Learning Library

← Back to Library

OpenAI API Update, Black Spatula, AI Beats Doctors

Key Points

  • OpenAI’s Developer Day unveiled GPT‑4o (referred to as “01”) on the API with a new “reasoning” slider, vision capabilities for image input, and expanded token limits for longer prompts and outputs.
  • The “Black Spatula” project aims to evaluate AI’s ability to detect errors across hundreds of peer‑reviewed papers, offering a real‑world benchmark beyond the tightly controlled tests typically used by model developers.
  • In a New England Journal of Medicine clinical‑pathological conference challenge, physicians averaged about 30% accuracy while GPT‑4o achieved roughly 80% with low variance, highlighting AI’s superior diagnostic performance in this test.
  • These results suggest that AI tools will increasingly enter clinical practice by 2025, prompting a cultural shift as healthcare providers begin to incorporate AI assistance alongside their own judgments.

Full Transcript

# OpenAI API Update, Black Spatula, AI Beats Doctors **Source:** [https://www.youtube.com/watch?v=yOGMZq-_q60](https://www.youtube.com/watch?v=yOGMZq-_q60) **Duration:** 00:04:40 ## Summary - OpenAI’s Developer Day unveiled GPT‑4o (referred to as “01”) on the API with a new “reasoning” slider, vision capabilities for image input, and expanded token limits for longer prompts and outputs. - The “Black Spatula” project aims to evaluate AI’s ability to detect errors across hundreds of peer‑reviewed papers, offering a real‑world benchmark beyond the tightly controlled tests typically used by model developers. - In a New England Journal of Medicine clinical‑pathological conference challenge, physicians averaged about 30% accuracy while GPT‑4o achieved roughly 80% with low variance, highlighting AI’s superior diagnostic performance in this test. - These results suggest that AI tools will increasingly enter clinical practice by 2025, prompting a cultural shift as healthcare providers begin to incorporate AI assistance alongside their own judgments. ## Sections - [00:00:00](https://www.youtube.com/watch?v=yOGMZq-_q60&t=0s) **OpenAI API Enhancements and Black Spatula Project** - The speaker highlights OpenAI's new developer‑day features—including a reasoning‑level slider, vision support, and expanded token limits for the 01 model—then introduces the Black Spatula initiative to systematically test AI's ability to catch errors in peer‑reviewed scientific papers. ## Full Transcript
0:00three pieces of AI news today we're 0:02going to start with open AI 12 days of 0:03open AI they did a developer day 0:05yesterday 01 is now in API which means 0:11that you can call it you can give it a 0:13reasoning parameter which is really cool 0:14because essentially it gives you a 0:16little slider and you can move the 0:17slider up and down to uh suggest the 0:19effortfulness of reasoning how much it 0:21should think about a particular question 0:23you pose apparently you can bring in a 0:25vision element and pass that to 01 now 0:28uh which kind of makes sense because you 0:30can put photos into the chat bot with 01 0:32already so it it has visual reasoning 0:35and they also did a bunch of other 0:37releases 40 mini uh out in API there's a 0:41ton of other stuff I'm going to link the 0:43full developer docs in the video here 0:46check them out there's more than I can 0:47get into on this on this video oh one 0:49more that's really fun they bumped the 0:51output token and the input token limit 0:54which means that you can get more into 0:55the query and more out of the prompt uh 0:58and that's very exciting 1:00okay number two black spatula project so 1:04we know that 01 has been useful 1:06anecdotally for catching mistakes in 1:08peer-reviewed papers but like many other 1:12realworld examples of AI application 1:14it's been hard to get a consistent eval 1:16it's hard to get a consistent evaluation 1:18or test that is changing the black 1:21spatula project is a project to review 1:23hundreds of peer-reviewed published 1:26papers with AI and see if AI can catch 1:29mist stakes in those papers in a way 1:31that's useful to the field we will see 1:34what happens it's not done yet but the 1:35fact that it exists is a significant 1:38step forward because at the end of the 1:40day most of the evals for AI are very 1:44very tightly controlled model maker 1:46defined evals which they kind of have to 1:49be to have Apples to Apples comparisons 1:51but it means we don't really understand 1:53which models do well at real world tasks 1:57in specific Fields even though we're 1:58using them like 2:00like medicine which brings me to my 2:02third piece of news medicine the New 2:04England Journal of Medicine has a 2:06famously difficult uh diagnosis test 2:09called the 2:10clinicopathological conference it's a 2:12sort of differential diagnosis uh uh 2:15case study that they do and what they 2:18decided to do for this academic paper 2:21was run 01 and a few other models 2:24against the questions posed in the 2:26clinical pathological conference tests 2:29and and assign Physicians to do the same 2:33well Physicians scored about 2:3530% and had a really wide range of 2:37variant I actually looked at the err 2:38bars 01 scored 80% and had a very narrow 2:43sort of range of 2:45variance and I look at that and 2:51I I think two things the first is we are 2:56going to see startups bringing AI into 2:59the ex exam room and medical settings in 3:012025 this is too big a difference for AI 3:04not to be in the room especially with 3:06capabilities like Vision being rolled 3:08out it's going to happen and that's 3:10going to take a cultural change because 3:11we've already seen studies where doctors 3:13were given the option to try AI 3:16alongside their own judgment uh for an 3:20academic paper and the doctors just 3:22refused to sort of trust what AI was 3:24saying and trusted their own judgment so 3:25there's there's a cultural change that's 3:27going to need to 3:28happen but but even if the cultural 3:31change happens and even if we bring AI 3:33in the other piece that stands out to me 3:35is that there is a 3:38intangible factor in some of what is 3:41going on that is difficult to measure 3:43and we need to think more about but 3:44probably fixable with prompting I'll 3:46give you an example some of the 3:48treatment plans proposed by 01 were not 3:52incorrect but they were impractical it 3:55is very difficult to catch correct but 3:57impractical for example they for too 4:00many tests tests that would not 4:02necessarily incrementally add value but 4:03would just add that sort of fine grain 4:05piece of detail Physicians are usually 4:07quite sparing with the test that they 4:09run partly because it's invasive to the 4:11patient and partly because of expense 4:13and partly because they want to 4:14efficiently get the most value per test 4:17in sort of understanding the disease and 4:18they know how much granularity they need 4:20in their results to get a correct 4:22diagnosis so there's some human judgment 4:24there that 01 wasn't necessarily showing 4:28and it's interesting to think about how 4:29you would prompt for that okay that's 4:32your news mostly science and math but 4:34hey developers it's it's a fun day go 4:36over and check out the developer tools 4:37open AI released cheers