Learning Library

← Back to Library

OpenAI API Update, Black Spatula, AI Beats Doctors

4m • Unknown Channel • ai-ml • news • intermediate • Watch on YouTube ↗

Key Points

OpenAI’s Developer Day unveiled GPT‑4o (referred to as “01”) on the API with a new “reasoning” slider, vision capabilities for image input, and expanded token limits for longer prompts and outputs.
The “Black Spatula” project aims to evaluate AI’s ability to detect errors across hundreds of peer‑reviewed papers, offering a real‑world benchmark beyond the tightly controlled tests typically used by model developers.
In a New England Journal of Medicine clinical‑pathological conference challenge, physicians averaged about 30% accuracy while GPT‑4o achieved roughly 80% with low variance, highlighting AI’s superior diagnostic performance in this test.
These results suggest that AI tools will increasingly enter clinical practice by 2025, prompting a cultural shift as healthcare providers begin to incorporate AI assistance alongside their own judgments.

Sections

00:00:00 OpenAI API Enhancements and Black Spatula Project - The speaker highlights OpenAI's new developer‑day features—including a reasoning‑level slider, vision support, and expanded token limits for the 01 model—then introduces the Black Spatula initiative to systematically test AI's ability to catch errors in peer‑reviewed scientific papers.

Full Transcript

# OpenAI API Update, Black Spatula, AI Beats Doctors **Source:** [https://www.youtube.com/watch?v=yOGMZq-_q60](https://www.youtube.com/watch?v=yOGMZq-_q60) **Duration:** 00:04:40 ## Summary - OpenAI’s Developer Day unveiled GPT‑4o (referred to as “01”) on the API with a new “reasoning” slider, vision capabilities for image input, and expanded token limits for longer prompts and outputs. - The “Black Spatula” project aims to evaluate AI’s ability to detect errors across hundreds of peer‑reviewed papers, offering a real‑world benchmark beyond the tightly controlled tests typically used by model developers. - In a New England Journal of Medicine clinical‑pathological conference challenge, physicians averaged about 30% accuracy while GPT‑4o achieved roughly 80% with low variance, highlighting AI’s superior diagnostic performance in this test. - These results suggest that AI tools will increasingly enter clinical practice by 2025, prompting a cultural shift as healthcare providers begin to incorporate AI assistance alongside their own judgments. ## Sections - [00:00:00](https://www.youtube.com/watch?v=yOGMZq-_q60&t=0s) **OpenAI API Enhancements and Black Spatula Project** - The speaker highlights OpenAI's new developer‑day features—including a reasoning‑level slider, vision support, and expanded token limits for the 01 model—then introduces the Black Spatula initiative to systematically test AI's ability to catch errors in peer‑reviewed scientific papers. ## Full Transcript

0:00three pieces of AI news today we're 0:02going to start with open AI 12 days of 0:03open AI they did a developer day 0:05yesterday 01 is now in API which means 0:11that you can call it you can give it a 0:13reasoning parameter which is really cool 0:14because essentially it gives you a 0:16little slider and you can move the 0:17slider up and down to uh suggest the 0:19effortfulness of reasoning how much it 0:21should think about a particular question 0:23you pose apparently you can bring in a 0:25vision element and pass that to 01 now 0:28uh which kind of makes sense because you 0:30can put photos into the chat bot with 01 0:32already so it it has visual reasoning 0:35and they also did a bunch of other 0:37releases 40 mini uh out in API there's a 0:41ton of other stuff I'm going to link the 0:43full developer docs in the video here 0:46check them out there's more than I can 0:47get into on this on this video oh one 0:49more that's really fun they bumped the 0:51output token and the input token limit 0:54which means that you can get more into 0:55the query and more out of the prompt uh 0:58and that's very exciting 1:00okay number two black spatula project so 1:04we know that 01 has been useful 1:06anecdotally for catching mistakes in 1:08peer-reviewed papers but like many other 1:12realworld examples of AI application 1:14it's been hard to get a consistent eval 1:16it's hard to get a consistent evaluation 1:18or test that is changing the black 1:21spatula project is a project to review 1:23hundreds of peer-reviewed published 1:26papers with AI and see if AI can catch 1:29mist stakes in those papers in a way 1:31that's useful to the field we will see 1:34what happens it's not done yet but the 1:35fact that it exists is a significant 1:38step forward because at the end of the 1:40day most of the evals for AI are very 1:44very tightly controlled model maker 1:46defined evals which they kind of have to 1:49be to have Apples to Apples comparisons 1:51but it means we don't really understand 1:53which models do well at real world tasks 1:57in specific Fields even though we're 1:58using them like 2:00like medicine which brings me to my 2:02third piece of news medicine the New 2:04England Journal of Medicine has a 2:06famously difficult uh diagnosis test 2:09called the 2:10clinicopathological conference it's a 2:12sort of differential diagnosis uh uh 2:15case study that they do and what they 2:18decided to do for this academic paper 2:21was run 01 and a few other models 2:24against the questions posed in the 2:26clinical pathological conference tests 2:29and and assign Physicians to do the same 2:33well Physicians scored about 2:3530% and had a really wide range of 2:37variant I actually looked at the err 2:38bars 01 scored 80% and had a very narrow 2:43sort of range of 2:45variance and I look at that and 2:51I I think two things the first is we are 2:56going to see startups bringing AI into 2:59the ex exam room and medical settings in 3:012025 this is too big a difference for AI 3:04not to be in the room especially with 3:06capabilities like Vision being rolled 3:08out it's going to happen and that's 3:10going to take a cultural change because 3:11we've already seen studies where doctors 3:13were given the option to try AI 3:16alongside their own judgment uh for an 3:20academic paper and the doctors just 3:22refused to sort of trust what AI was 3:24saying and trusted their own judgment so 3:25there's there's a cultural change that's 3:27going to need to 3:28happen but but even if the cultural 3:31change happens and even if we bring AI 3:33in the other piece that stands out to me 3:35is that there is a 3:38intangible factor in some of what is 3:41going on that is difficult to measure 3:43and we need to think more about but 3:44probably fixable with prompting I'll 3:46give you an example some of the 3:48treatment plans proposed by 01 were not 3:52incorrect but they were impractical it 3:55is very difficult to catch correct but 3:57impractical for example they for too 4:00many tests tests that would not 4:02necessarily incrementally add value but 4:03would just add that sort of fine grain 4:05piece of detail Physicians are usually 4:07quite sparing with the test that they 4:09run partly because it's invasive to the 4:11patient and partly because of expense 4:13and partly because they want to 4:14efficiently get the most value per test 4:17in sort of understanding the disease and 4:18they know how much granularity they need 4:20in their results to get a correct 4:22diagnosis so there's some human judgment 4:24there that 01 wasn't necessarily showing 4:28and it's interesting to think about how 4:29you would prompt for that okay that's 4:32your news mostly science and math but 4:34hey developers it's it's a fun day go 4:36over and check out the developer tools 4:37open AI released cheers