SmartGPT: Major Benchmark Broken – 89.0% on MMLU + Exam’s Many Errors

AI Explained
Has GPT4, using a SmartGPT system, broken a major benchmark, the MMLU, in more ways than one? 89.0% is an unofficial record, but do we urgently need a new, authoritative benchmark, especially in the light of today’s insider info of 5x compute for Gemini than for GPT 5?

Learn all about the power of exemplars, self-consistency and how you can tangibly benefit in real world examples. You’ll learn more about everything from cutting edge benchmarking to AGI forecasting.

Joshua Stapleton is a Machine Learning Engineer who has worked in the healthcare and defence sectors. He recently pivoted into AI capabilities and safety, with a concentration on LLMs. He now works as a research engineer, consults on the applications of AI across various industries, and is pursuing his Masters in Machine Learning and Data Science at Imperial College London.
Feel free to reach out to Josh via his email, [email protected]

  2. One way you can *reduce the cost* to answer many questions is to *gradually increase* how many processes you use only if there are disagreements:
    1. Ask for several answers in parallel, say 3 or 8;
    2. If all agree or all but 1 agree, you pick that one—
    3. If you have disagreements (or one or more), ask it again to 5 or 24 processes in parallel;
    4. If all but 1 or all but 4 of the 8/32 answers agree, you pick that one—
    5. If you have too many disagreements, ask it again to 8/64 processes in parallel;
    6. If all but 2/10 of the 16/96 answers agree, you pick that one—
    7. Otherwise, you add that question for human review.

  7. It surprises me that tests with all multiple choice questions and no “essay questions” are uses for benchmarking expert level intelligence/knowledge/reasoning. Even in mathematics, where one would expect “right answers” to be possible and unambiguous, true expertise involves choice of the best methodology, and what is “best” may change depending on context. In humanities fields like history and philosophy, the idea that a multiple choice test can determine true expertise is laughable.

  8. 9:00 so what you are saying is you are building a quantum super position by creating multiples of the same thing and collapsing on the most probable outcome because it was the most probable…If only there was a new type of computer that used the properties of probability to create super positions to instantly collapse these quantities of possible answers down quickly to the correct bit….

  13. Im noticing a trend of us trying to dumb down LLMs and trying to hide how capable they are… this is not good, for a number of reasons

  Error Types in the MMLU:
    Here is the summary:
    Error Types in the MMLU:

    Missing Text: Some questions lacked vital context or statements that made them understandable.

    Factual Errors: There were numerous factual inaccuracies in the answers provided by the MMLU.

    Misspellings: Some questions or answers had spelling mistakes.

    Grammatical Ambiguity: Some questions were grammatically ambiguous.

    Formatting Ambiguity: The way some questions were formatted could lead to confusion.

    Multi-Question Dependence: Some questions depended on other questions for context, but that context was missing.

    No Clear Answer: Some questions were ambiguous to the point that there wasn't a clear correct answer.

    Controversial Questions: Some questions touched on controversial topics, leading to potential bias or ambiguity in the answers.

    MMLU Question Categories with Errors:

    Business Ethics: Missing vital context in many questions.

    High School Chemistry: Factual errors and missing context.

    High School Psychology: Missing context.

    Professional Psychology: Missing context.

    Microeconomics: Missing context.

    Professional Law: Missing context.

    Professional Accounting: Missing context.

    Virology: Numerous factual errors.

    College Chemistry: Numerous factual errors.

    Econometrics: Factual errors.

    Philosophy: Multi-question dependence.

    High School Biology: Multi-question dependence.

    Public Relations: No clear answer.

    Moral Scenarios: No clear answer.

    Global Facts: Questions where the answer might depend on the source.

    Security Studies: Controversial questions.

