GPT-4 and Gemini scored less than 2 percent on this new AI benchmark

2 weeks ago

0 41 2 minutes read

GPT-4 and Gemini scored less than 2 percent on this new AI benchmark

Epoch AI, a California-based research institute, launched a new benchmark for artificial intelligence (AI) last week. The new AI benchmark, called FrontierMath, tests large language models (LLMs) for their ability to reseason and solve mathematical problems. The AI company claims that existing mathematical benchmarks are not very useful due to factors such as data pollution and AI models that score very high on them. Epoch AI claims that even the leading LLMs have scored less than two percent on the new benchmark.

Epoch AI Launches FrontierMath Benchmark

In one after on Epoch AI claims that solving these questions would take even mathematicians hours. The reason behind developing the new benchmark was cited as the limitations with existing benchmarks such as GSM8K and MATH, where AI models generally score highly.

The company claimed that the high scores of LLMs are largely due to data pollution. This means that the questions had already been entered into the AI models somehow, making it easy for them to solve the questions.

FrontierMath solves the problem by including new problems that are unique and not published anywhere, mitigating the risks associated with data contamination. Furthermore, the benchmark covers a wide range of questions, including computationally intensive problems in number theory, real analysis and algebraic geometry, as well as topics such as Zermelo-Fraenkel set theory. The AI company says all questions are “guess-proof,” meaning they can’t be solved accidentally without strong reasoning.

Epoch AI emphasized that to measure AI suitability, benchmarks must be created for creative problem solving that requires the AI to continue reasoning across multiple steps. In particular, many industry veterans believe that existing benchmarks are not sufficient to accurately measure how advanced an AI model is.

Respond to the new benchmark in a afterNoam Brown, an OpenAI researcher who was behind the company’s o1 model, welcomed the new benchmark, saying, “I’m excited to see a new evaluation with such low success rates for frontier models.”

For the latest tech news and reviews, follow Gadgets 360 X, Facebook, WhatsApp, Wires And Google News. For the latest videos on gadgets and technology, subscribe to our YouTube channel. If you want to know everything about top influencers, follow our in-house Who is that360 on Instagram And YouTube.