SmartGPT: Major Benchmark Broken – 89.0% on MMLU + Exam’s Many Errors

AI Explained
Has GPT4, using a SmartGPT system, broken a major benchmark, the MMLU, in more ways than one? 89.0% is an unofficial record, but do we urgently need a new, authoritative benchmark, especially in the light of today’s insider info of 5x compute for Gemini than for GPT 5?

Learn all about the power of exemplars, self-consistency and how you can tangibly benefit in real world examples. You’ll learn more about everything from cutting edge benchmarking to AGI forecasting.

Original SmartGPT Video:
Gemini 5x GPT 4, Semianalysis:
WizardCoder Overfitting?
Let’s Do a Thought Experiment:
MMLU Grading Issues:
Oxford University Press Question Example:
Fall 2011 Epidemiology Example:
GPT 4 Technical Report:
Minerva, Solving Quantitative Reasoning:
Original Scratchpads Paper:
Is ChatGPT Behaviour Changing Over Time?
Paul Christiano:
Metaculus Forecasting:
MIT Paper:
Snowballing Hallucinations:
Self Consistency:
OpenLLM Leaderboard:
NHS Question from ‘Extended Matching Questions’
Graph of Thoughts:
Dario Amodei Interview – Dwarkesh Patel:

GitHub Answers:

Joshua Stapleton is a Machine Learning Engineer who has worked in the healthcare and defence sectors. He recently pivoted into AI capabilities and safety, with a concentration on LLMs. He now works as a research engineer, consults on the applications of AI across various industries, and is pursuing his Masters in Machine Learning and Data Science at Imperial College London.
Feel free to reach out to Josh via his email, [email protected], or check out his new Patreon: .

AI Explained Community:
[email protected]


  1. One way you can *reduce the cost* to answer many questions is to *gradually increase* how many processes you use only if there are disagreements:
    1. Ask for several answers in parallel, say 3 or 8;
    2. If all agree or all but 1 agree, you pick that one—
    3. If you have disagreements (or one or more), ask it again to 5 or 24 processes in parallel;
    4. If all but 1 or all but 4 of the 8/32 answers agree, you pick that one—
    5. If you have too many disagreements, ask it again to 8/64 processes in parallel;
    6. If all but 2/10 of the 16/96 answers agree, you pick that one—
    7. Otherwise, you add that question for human review.

  2. Sorry I was mean back in your prior video, complaining your were using too much philosophy lately. I stii like all of your videos but those technical videos are just so much more fascinating because for me it's more tangible 🙂

  3. I'm simply amazed that it took two tenacious individuals to find these errors and inconsistencies.

  4. What do you think is currently the best custom instruction a ChatGPT user could apply to squeeze out more GPT4 performance? Is it still your SmartGPT prompt?

  5. Try out the custom instruction “Be ambitious in taking your time.” It radically improves outputs in a wide range of topics ranging from philosophy to art! 🥳

  6. It surprises me that tests with all multiple choice questions and no “essay questions” are uses for benchmarking expert level intelligence/knowledge/reasoning. Even in mathematics, where one would expect “right answers” to be possible and unambiguous, true expertise involves choice of the best methodology, and what is “best” may change depending on context. In humanities fields like history and philosophy, the idea that a multiple choice test can determine true expertise is laughable.

  7. 9:00 so what you are saying is you are building a quantum super position by creating multiples of the same thing and collapsing on the most probable outcome because it was the most probable…If only there was a new type of computer that used the properties of probability to create super positions to instantly collapse these quantities of possible answers down quickly to the correct bit….

  8. Thanks for all the work and money you put into this. You're really the best AI channel out there by far

  9. In the vast expanse of YouTube offerings — and I subscribe to several hundred channels, I find your channel to be a beacon of intellectual stimulation. Each episode, without fail, offers an opportunity for enlightenment and edification. Consistently, I find myself challenged and gratified by what you present. Might I inquire if there exists a biographical sketch of you online?

  10. Somebody get this guy a grant already! Incredible findings, great work.

  11. Congratulations you two, for doing this testing in a thoughtful and careful way, and picking up all those question errors😮 (scandalous that weren't found before!). And thanks for finetuing approaches to get better results from LLMs.

  12. Im noticing a trend of us trying to dumb down LLMs and trying to hide how capable they are… this is not good, for a number of reasons

  13. Science done right! Do the work, see the truth.
    Thank you

  14. hey phillip – as i probably mentioned to you on Patreon – besides being an ML researcher grad student doing a thesis in tel aviv, i also have access to capital and would gladly contribute to this research. not looking for a payout either. hit me if you wanna chat more and keep up the awesome work ✌️

  15. You manually checked each failed answer ? Whoa, hats off…

    Damn, these days people just do some web scraping, maybe check some results randomly, and then claim that they have made a new dataset. The examples with missing text are the most egregious.

    The correct way would be to hire 3 experts for each category to compile the dataset manually, but that's very costly… I support your idea of hiring an educational organisation to produce a reliable benchmark. Everyone would benefit.

  16. So do we need to make a wikipedia type community where folks around the world voluntarily fact check answers

  17. Google Gemini Eats The World – Gemini Smashes GPT-4 By 5X, The GPU-Poors

    could you make a video about it

    Also Dr Alan D thomson has launched a live video about gemini in 60 days suggesting that the model will be released in roughly 1st November.

  18. "If you give GPT context and let it think, it's smart" — reinventing the RAG wheel and these methodologies are already documented in OpenAI's docs. These are known capabilties lol

  19. 9:23 so, if the prompt was like minimum spanning tree, instead in greedy algorithm, would it be correct more often?

  20. Insanely interesting !! Thanks a ton for the research you’re doing

  21. Mfw when the power of AI is downplayed by the huge companies investing billions into it by pretending it's less capable than it truly is. 🤔🤔🤔

  22. I wonder if the score could be improved further if you was to ask GPT to deduct incorrect answers one by one until you are left with the most likely answer and of course reflect on each deduction aswell.

  23. Wow. This is excellent work, sir. Really fine stuff. Thank you both!

    I will now give you a pass for my building annoyance at how long it has been since you lit a fire with the earlier SmartGPT video. I recently had been wondering what the heck happened to that work, getting more eager to hear more. I see now what you've been doing with it… fantastic.
    Thank you!!

  24. " so let us know if you've enjoyed this Mammoth ongoing effort"
    Not sure "enjoy" is the right word. It's kind of sad to see so many errors in a benchmark.

    And it's extremely depressing to see top university computer science departments essentially give up on building their own AIs, and resort to poking and prodding and prompting closed-source LLMs like ChatGPT to examine their behavior. It's as if top automotive engineering faculty no longer built or modified cars and engines, but could only rent cars from Hertz and run tests on them, without even being allowed to open the hood (bonnet) to inspect the engine.

    Even without the $25M+ to train their own foundation model, and no access to the training weights, imagine the progress humanity could make if these researchers knew exactly what was in the GPT-X training sets, and could instrument an instance of the AI during inference.

    Facebook AI Research should be making LLAMA-2 instances available to university researchers during training and inference. Look at the breakneck progress of Runway/Stable Diffusion thanks to people hacking on it.

  25. Incredible work! Congrats! That's what humans need to do now: be really intelligent.

  26. Am I the only one who finds it bizarre that scientists in the machine learning community are measuring AI intelligence without even mentioning the research already done in psychology on human intelligence and the G factor? Intelligence testing is fairly well established in psychology, maybe some of the tests used to measure IQ in humans could be adapted for LLMs. For example, Raven's progressive matrices: find an isomorphism of those matrices that can be represented in a textual form and ask the LLMs to figure out the answer.

  27. What if this is actually intentional and someone is sabotaging the questions to slow down AI progression? lol.. woah.. Life is a movie these days.

  28. Basically we have to put absolutely pure and correct data without neven a slightly spelling mistake if we want to know its true potential . Or want it to helucinate as less as possible

  29. This video is a one two punch for the current understanding of model intelligence.
    As always, I deeply appreciate the effort, and I'm assuming, sacrifices you guys have made to sift out these issues and make them known. This is honestly bombshell material. I'm planning to summarize the main error types and some of the key figures you provide to share on LinkedIn. And, of course, I'll link to the video, and if you have any preferences, including not sharing a high level view, I'll happily comply. 🙂

  30. @AIExplained I wanted to provide a summary of the error types and categories you found them in. As I noted in another comment, I will link to the video, as well as provide a bit of my own commentary. If you'd prefer I not put this summary view of the areas discussed in relation to the MMLU errors, please respond to this comment.
    Here is the summary:
    Error Types in the MMLU:

    Missing Text: Some questions lacked vital context or statements that made them understandable.

    Factual Errors: There were numerous factual inaccuracies in the answers provided by the MMLU.

    Misspellings: Some questions or answers had spelling mistakes.

    Grammatical Ambiguity: Some questions were grammatically ambiguous.

    Formatting Ambiguity: The way some questions were formatted could lead to confusion.

    Multi-Question Dependence: Some questions depended on other questions for context, but that context was missing.

    No Clear Answer: Some questions were ambiguous to the point that there wasn't a clear correct answer.

    Controversial Questions: Some questions touched on controversial topics, leading to potential bias or ambiguity in the answers.

    MMLU Question Categories with Errors:

    Business Ethics: Missing vital context in many questions.

    High School Chemistry: Factual errors and missing context.

    High School Psychology: Missing context.

    Professional Psychology: Missing context.

    Microeconomics: Missing context.

    Professional Law: Missing context.

    Professional Accounting: Missing context.

    Virology: Numerous factual errors.

    College Chemistry: Numerous factual errors.

    Econometrics: Factual errors.

    Philosophy: Multi-question dependence.

    High School Biology: Multi-question dependence.

    Public Relations: No clear answer.

    Moral Scenarios: No clear answer.

    Global Facts: Questions where the answer might depend on the source.

    Security Studies: Controversial questions.

  31. How is the branch of metrology that examines AI going to be called ?

  32. great video, this channel is highly underrated 🙂

  33. These video's are always very well presented and informative, good job! I have a question though, how would the lesson's we've learned (on for example self-consistancy, self reflection and deeper thought), here translate to formulating Custom Instructions for ChatGPT 4 for general use? How could we optimize these settings?

  34. This reminds me of my professor's questions in high school and college. Sometimes, I'd argue with them regarding these vary same characteristics, such as unclear answers. If anything, the benchmarks are a test made by humans and in those errors that human test takers try to maneuver is something AI should learn to imitate. If it thinks a particular problem is factually incorrect, for example, it should take a shot at explaining why and that should be it's answer. I'd like to see AI try and do that while simultaneously trying to answer correctly-made questions; it would be a true test of intelligence.

  35. Normally the "clickbait" or defying feat youtube title that doesn't get explained until after 20 minutes into the video is a big turn off, but you managed to pull it off quite nicely. Great video, engaging, exciting, and easily worthy of a subscription to your channel.

Leave a Reply

Your email address will not be published. Required fields are marked *