top of page

FACTS Benchmark Exposes Critical Gaps in AI Chatbots, Multimodal Accuracy Falls Below 70%

https://www.forbes.com/sites/quickerbettertech/2025/12/21/small-business-technology-news-this-week-google-says-chatbots-are-69-accurate/

In recent years, artificial intelligence (AI) has rapidly transitioned from a niche technology to an essential component of enterprise operations, customer engagement, and everyday digital tools. Generative AI, particularly large language models (LLMs), has shown remarkable capabilities—from drafting documents to assisting with research and automating workflows. However, recent assessments, including Google’s FACTS Benchmark Suite, have revealed a sobering reality: even the most advanced AI models struggle with factual accuracy, frequently getting roughly one in three responses wrong.


This article delves into the latest findings, explores the implications for businesses and developers, and provides a data-driven perspective on how enterprises can navigate the current AI landscape responsibly.


Google’s FACTS Benchmark: A Reality Check for AI Accuracy

The launch of Google’s FACTS Benchmark Suite represents a significant shift in evaluating AI reliability. Unlike earlier benchmarks that focused primarily on task completion, FACTS specifically measures factual accuracy across four distinct domains:


  • Parametric Knowledge: Evaluates a model’s ability to recall factual information from its training data without external assistance.

  • Search Performance: Assesses how effectively models can use web-based tools to retrieve accurate information in real time.

  • Grounding: Measures the ability to produce responses strictly based on a provided source document without adding external information.

  • Multimodal Understanding: Tests comprehension and interpretation of images, diagrams, and charts.


According to early benchmark results, no model surpasses 70% overall accuracy. Gemini 3 Pro led the leaderboard with a 68.8% FACTS score, while OpenAI’s GPT-5, Anthropic, and xAI models scored lower, typically in the 61–62% range. Models like Claude 4.5 Opus and Grok 4 fell below 55%.

Model

Overall FACTS Score

Search Performance

Multimodal Accuracy

Gemini 3 Pro

68.8%

83.8%

46.1%

Gemini 2.5 Pro

62.1%

63.9%

46.9%

GPT-5

61.8%

77.7%

44.1%

Grok 4

53.6%

75.3%

25.7%

Claude 4.5 Opus

51.3%

73.2%

39.2%

The data highlights a critical insight: AI performance is uneven across tasks, with multimodal understanding consistently lagging. Reading charts, interpreting diagrams, or analyzing images yields the lowest scores, often below 50%. For enterprises relying on AI for financial reporting, data visualization, or document analysis, this represents a substantial risk.


Why the Factuality Gap Matters

Despite their impressive capabilities, AI chatbots can be dangerously misleading when assumed to be fully reliable. Industries such as healthcare, legal services, and finance are particularly sensitive to factual errors, where even minor inaccuracies can

have significant consequences. For example:

  • A healthcare AI that misinterprets patient data or guidelines could lead to incorrect treatment recommendations.

  • Financial models relying on AI to summarize reports or extract numbers may propagate errors into forecasts and dashboards.

  • Legal research tools using AI to parse regulations or case law may inadvertently provide incorrect citations, exposing firms to liability.

As noted by Carl Franzen in his analysis, “The era of ‘trust but verify’ is far from over. Enterprise systems must treat AI outputs as probabilistic rather than absolute.”


Understanding the Benchmarks in Detail

The FACTS Benchmark introduces a nuanced approach by splitting factuality into contextual and world knowledge components:

  • Contextual Factuality: Measures whether AI can produce correct answers grounded in provided source material.

  • World Knowledge Factuality: Assesses the ability to retrieve and accurately report information from memory or external tools.


Parametric vs. Search Discrepancy

For developers building retrieval-augmented generation (RAG) systems, the distinction between parametric and search-based capabilities is crucial. For instance, Gemini 3 Pro scores 76.4% on parametric tasks but 83.8% on search tasks, illustrating that AI performs better when augmented with real-time information retrieval rather than relying solely on internal memory.


Implication: Enterprises should integrate AI models with search tools or knowledge bases to improve factuality, especially when handling dynamic or complex information.


Multimodal Limitations

The multimodal component consistently registers the lowest accuracy scores. Even top performers struggle to interpret charts, images, and diagrams correctly.

  • Gemini 3 Pro: 46.1%

  • Gemini 2.5 Pro: 46.9%

  • GPT-5: 44.1%

This underlines a critical caution for automation: using AI for unsupervised extraction from visual data may introduce substantial errors, necessitating human review.


https://www.forbes.com/sites/quickerbettertech/2025/12/21/small-business-technology-news-this-week-google-says-chatbots-are-69-accurate/

Enterprise Adoption: Strategic Considerations

Despite factuality limitations, enterprise leaders continue to invest heavily in AI. A Wall Street Journal survey cited by Gene Marks reports that 68% of CEOs plan to increase AI spending in 2026, even as less than half of current initiatives yield net positive returns.


Key observations include:

  • Marketing and Customer Service: Higher ROI reported due to structured data and repetitive tasks.

  • HR, Legal, and Security: AI implementations lag behind in effectiveness.

  • Workforce Impact: 67% of CEOs anticipate AI will increase entry-level headcount, emphasizing augmentation rather than replacement.

“AI is a strategic investment, but it is not yet a turnkey solution. Human oversight remains critical,” notes an enterprise AI consultant specializing in automation workflows.

Practical Guidelines for Safe AI Deployment

  1. Verify Critical Outputs: Always incorporate human-in-the-loop processes for high-stakes decisions.

  2. Integrate Search and Vector Databases: Avoid relying solely on parametric knowledge; retrieval augmentation improves accuracy.

  3. Exercise Caution with Multimodal Tasks: For financial charts, invoices, or images, maintain a review layer.

  4. Hedging and Refusal: Design models to admit uncertainty rather than risk hallucinations. Strategic silence often outperforms overconfident errors.

  5. Benchmark Regularly: Use FACTS and similar tools to monitor model performance and update deployment strategies.


Use Cases That Benefit from Current AI Capabilities

While factuality remains a challenge, certain applications are well-suited for current AI:

  • Administrative Automation: Extracting contact information or generating draft emails, where errors are low-risk and easily correctable.

  • Content Drafting: Summarizing documents, generating reports, and automating repetitive text-based tasks with human editing.

  • Customer Engagement: Basic chatbots and FAQ assistants, where AI can handle structured queries with human supervision.

Limitations: For decision-critical tasks in finance, legal, and healthcare, AI outputs must be treated as supportive rather than authoritative.


The Path Forward for Enterprise AI

Google’s FACTS Benchmark Suite underscores a fundamental truth: current AI models are impressive, but inherently fallible. With top models achieving roughly 69% factual accuracy and significant gaps in multimodal and grounding tasks, enterprises must integrate AI thoughtfully. Strategies such as retrieval augmentation, human oversight, and uncertainty-aware design are critical to safe deployment.


The message is clear for organizations and developers: AI should augment human expertise, not replace it. As these technologies evolve, benchmarks like FACTS will provide indispensable guidance for evaluating performance, informing investment decisions, and mitigating risks.


For businesses seeking to navigate the AI landscape safely and effectively, insights from Dr. Shahid Masood and the expert team at 1950.ai can provide valuable guidance. Their research highlights the importance of verification, robust architecture, and strategic integration of AI into enterprise workflows.


Further Reading / External References

bottom of page