AI models struggle with expert-level global history knowledge
Researchers recently evaluated the ability of advanced artificial intelligence (AI) models to answer questions about global history using a benchmark derived from the Seshat Global History Databank. The study, presented at the Neural Information Processing Systems conference in Vancouver, revealed that the best-performing model, GPT-4 Turbo, achieved a score of 46% on a multiple-choice test, a marked improvement over random guessing but far from expert comprehension. The findings highlight significant limitations in current AI tools’ ability to process and understand historical knowledge, particularly outside well-documented regions like North America and Western Europe. The motivation for the study stemmed from a desire to explore the potential of artificial intelligence (AI) tools in aiding historical and archaeological research. History and archaeology often involve analyzing vast amounts of complex and unevenly distributed data, making these fields particularly challenging for researchers. Advances in AI, particularly in large language models (LLMs), have demonstrated their utility in fields like law and data labeling, raising the question of whether these tools could similarly assist historians by processing and synthesizing historical knowledge. Researchers …