Learning Library

← Back to Library

Panel Debates OpenAI's $200 O1 Pro

40m • Unknown Channel • ai-ml • interview • intermediate • Watch on YouTube ↗

Key Points

The episode “Mixture of Experts” introduces a panel of AI experts—Marina Danilevsky, Vyoma Gajjar, and Kate Soule—to discuss current AI developments, including NeurIPS trends, AGI evaluation design, and the upcoming release of LLaMA 3.3 70B.
OpenAI announced a new premium tier, o1 Pro, priced at $200 per month, prompting a debate among the panelists: Vyoma supports subscribing for its reduced latency and higher‑speed capabilities, while Kate and Marina express skepticism about the cost.
Sam Altman’s year‑end product rollout aims to accelerate adoption, with a target of reaching roughly one billion users by 2025, and the o1 Pro tier is positioned to attract AI developers who need faster, more reliable model access despite higher operating expenses.
The discussion highlights broader industry concerns about pricing models for advanced AI services, balancing accessibility for developers against the substantial costs of running large‑scale, high‑performance models.

Sections

Full Transcript

# Panel Debates OpenAI's $200 O1 Pro **Source:** [https://www.youtube.com/watch?v=UVMndg9WX9g](https://www.youtube.com/watch?v=UVMndg9WX9g) **Duration:** 00:40:44 ## Summary - The episode “Mixture of Experts” introduces a panel of AI experts—Marina Danilevsky, Vyoma Gajjar, and Kate Soule—to discuss current AI developments, including NeurIPS trends, AGI evaluation design, and the upcoming release of LLaMA 3.3 70B. - OpenAI announced a new premium tier, o1 Pro, priced at $200 per month, prompting a debate among the panelists: Vyoma supports subscribing for its reduced latency and higher‑speed capabilities, while Kate and Marina express skepticism about the cost. - Sam Altman’s year‑end product rollout aims to accelerate adoption, with a target of reaching roughly one billion users by 2025, and the o1 Pro tier is positioned to attract AI developers who need faster, more reliable model access despite higher operating expenses. - The discussion highlights broader industry concerns about pricing models for advanced AI services, balancing accessibility for developers against the substantial costs of running large‑scale, high‑performance models. ## Sections - [00:00:00](https://www.youtube.com/watch?v=UVMndg9WX9g&t=0s) **Debating OpenAI’s $200 o1 Pro** - Panelists on the Mixture of Experts podcast discuss OpenAI’s new $200‑a‑month o1 Pro tier while also covering hot AI trends such as NeurIPS highlights, AGI evaluation design, and the release of LLaMA 3.3 70B. - [00:03:04](https://www.youtube.com/watch?v=UVMndg9WX9g&t=184s) **Cost Concerns Over AI Subscriptions** - The speaker argues that paying a high monthly fee for powerful AI models is unjustified for infrequent, modest tasks, advocating for occasional use and open‑source alternatives instead. - [00:06:06](https://www.youtube.com/watch?v=UVMndg9WX9g&t=366s) **Evaluating Premium AI Pricing** - Speaker argues that high‑cost AI services need clear 10× value and use‑case justification, likening them to Apple’s profitable premium products. - [00:09:15](https://www.youtube.com/watch?v=UVMndg9WX9g&t=555s) **OpenAI's Multimodal Strategy Debate** - The speakers evaluate whether OpenAI’s push into video understanding and unified multimodal models is a strategic step toward AGI or an overextension in a fragmented market. - [00:12:20](https://www.youtube.com/watch?v=UVMndg9WX9g&t=740s) **Model Discovery and Synthetic Data** - The speakers note that AI designers can’t foresee all uses, observing that user prompts become a rich source of data for creating synthetic training material, a feedback loop that helps build ever‑larger models and pushes the technology closer to AGI. - [00:15:25](https://www.youtube.com/watch?v=UVMndg9WX9g&t=925s) **Navigating Emerging AI Papers** - The speaker expresses feeling overwhelmed by the avalanche of AI research and spotlights two promising papers—Waggle on unlearning in large language models and Trans‑LoRA on transferring fine‑tuned adapters—to guide listeners toward valuable work. - [00:18:29](https://www.youtube.com/watch?v=UVMndg9WX9g&t=1109s) **Structured Execution of Language Models** - The speaker outlines emerging research on using SGLang and LoRA adapters to enable non‑linear, programmable execution of LLMs with tool calling, multimodal inputs, and built‑in safety/uncertainty checks. - [00:21:33](https://www.youtube.com/watch?v=UVMndg9WX9g&t=1293s) **Open Source Script & ARC Benchmark** - The speaker mentions a proud personal script they'd consider open‑sourcing before introducing the ARC Prize—a benchmark from Zapier and Keras aimed at measuring models’ ability to learn new tasks as a step toward evaluating AGI. - [00:24:37](https://www.youtube.com/watch?v=UVMndg9WX9g&t=1477s) **Debating New Benchmark Utility** - The speakers question the value of unsolved, highly specific puzzles as AGI tests, argue that true intelligence assessment requires a diverse “pentathlon” of tasks, and express concern that current evaluation benchmarks have become saturated and unreliable. - [00:27:42](https://www.youtube.com/watch?v=UVMndg9WX9g&t=1662s) **Evaluating AI: Benchmarks and Meta Trends** - The speakers argue that current AI evaluation relies on artificially difficult benchmarks and high‑stakes prizes like the ARC Prize, which they view as crude, uncertain measures of true progress. - [00:30:45](https://www.youtube.com/watch?v=UVMndg9WX9g&t=1845s) **Rethinking Model Size vs Performance** - The speaker argues that model performance is no longer dictated solely by scale, highlighting how the smaller Llama 3.3 70B outperforms the larger 3.1 405B on several benchmarks thanks to higher‑quality data and alignment strategies. - [00:34:01](https://www.youtube.com/watch?v=UVMndg9WX9g&t=2041s) **Shift to Smaller, Data‑Driven Models** - The speaker notes that companies are moving away from huge, costly AI models toward smaller, domain‑specific models built on curated data to reduce expenses and satisfy legal and finance scrutiny. - [00:37:08](https://www.youtube.com/watch?v=UVMndg9WX9g&t=2228s) **Shrinking Large Models Efficiently** - The speaker describes the accelerating effort to compress massive models such as Llama 405B into far smaller, high‑performing versions, highlighting cost‑driven limits on model size, the benefits of better data quality, the need for lightweight agents, and the growing focus on energy‑efficient training pipelines. - [00:40:12](https://www.youtube.com/watch?v=UVMndg9WX9g&t=2412s) **Legal Woes, Benchmark Gaming, and Mixture** - The hosts discuss legal and financial concerns about incentivizing overfitting, warn that benchmarks may be gamed, and preview a future episode of the Mixture of Experts podcast. ## Full Transcript

0:00Will you be paying 200 a month for o1 Pro? 0:02Marina Danilevsky is a Senior Research Scientist. 0:04Marina, welcome to the show. 0:06Uh, will you? 0:07No, I will not. 0:07Vyoma Gajjar is an AI Technical Solutions Architect. 0:11Uh, Vyoma, welcome back. 0:12Uh, are you subscribing? 0:13Yes, shockingly. 0:15And last but not least is Kate Soule, Director of Technical Product 0:18Management on the Granite team. 0:19Kate, welcome back. 0:20Will you be subscribing? 0:21Absolutely not. 0:23Okay, all that and more on today's Mixture of Experts. 0:31I'm Tim Huang, and welcome to Mixture of Experts. 0:33Each week, MoE is dedicated to bringing the top quality banter you need 0:37to make sense of the ever evolving landscape of artificial intelligence. 0:41Today, in addition to having the best panel, don't tell the other 0:43panelists, Kate, Vyoma, Marina, very excited to have you on the show. 0:47We're going to talk about the latest hot trends coming out of NeurIPS, 0:50designing evaluations for AGI, and the release of LLAMA 3.3 70B. 0:55But first, we have to talk about what Sam Altman's been cooking at OpenAI. 0:58If you've been catching the news, OpenAI released, uh, uh, an 1:01announcement that for the next 12 days, they'll be making a product release 1:05announcement every day, um, to kind of celebrate the end of the year. 1:09And there's already been a number of interesting announcements, not least 1:12of which is the release of a new 200 a month tier for o1 Pro, which is 1:19their kind of creme de la creme of models that they are making available. 1:23Um, suffice to say 200 a month is a lot of money, much more than 1:27companies who have been providing these services have charged before. 1:30And so I really wanted to kind of just start there because I think 1:32it's such an intriguing thing. 1:34Um, Vyoma, I think you were the standout, you said that you would subscribe. 1:37So I want to hear that argument and then we'll get to Kate and 1:39Marina's unwarranted skepticism. 1:42Sure. 1:43I feel OpenAI's strategy here is to increase more adoption, and 1:47that is something that they have been speaking continuously about. 1:51Sam has been speaking continuously about in a multiple, uh, conferences 1:55and talks that he's been giving. 1:56He said that he wants to reach almost like 1 billion users by 2025. 2:01And the whole aim behind using and coming up with the 01 Pro with like $200 is to... 2:07try to get like AI developers, who is the majority of the market trying to build 2:11these applications, to start using it. 2:13Some of the key features that he says are like reduced 2:16latency during like peak hours. 2:18It gives you like higher speed to implement some of these 2:22models and use cases as well. 2:24And it'll be surprising that I was reading about it on X and ChatGPT, 2:29et cetera, takes like almost 30 times more money, it's more expensive to run. 2:34Um, so if you look at it from a perspective as a daily software 2:38engineer, developer, engineer, web developer, it, it, it seems 2:43to be a steal for those people. 2:46And yeah, that's, that's why I feel that I would pay it. 2:49That's great. 2:50Yeah. 2:50All right. 2:51Maybe Kate, I'll turn to you because I think your response was 2:53no hesitation, absolutely not. 2:55Um, what's the argument, I guess, for, for not wanting to pay? 2:59Because I mean, it sounds like they're, they're like, here, get 3:01access to one of the most powerful artificial intelligences in the world. 3:05And, you know, it's money, but, you know, I guess what they're trying to 3:08encourage is for us to think about this as if it were a luxury product. 3:13I think my biggest, uh, umbrage at the price tag is, you know, I can 3:20see use cases for a one and having a powerful model in your arsenal and 3:25your disposal, but I don't want to run that model for every single task 3:29and there's still a lot out there. 3:31So trying to then have unlimited access for a really high cost on a monthly basis 3:37just doesn't quite make sense for the usage patterns that I use these models 3:41for and that I, that I see out in the world, like, I want to be able to hit 3:44that model when I need to on the outlying cases where I really need that power. 3:49The rest of the time, I don't want to pay for that cost. 3:51Why would I carry that with this really high price tag month to month? 3:55Yeah, I was gonna say, I mean, I think one of the funny things about this is the 3:57prospect of paying $200 a month and then being like, I need help writing my email. 4:01Yeah. 4:01Like, it's kind of like a very silly sort of thing to think about. 4:06Um, I guess I have to ask you this cause you work on Granite. 4:09Open source, right? 4:10I assume one of the arguments is just that open source is better. 4:13Getting better and is free. 4:14I don't know if you would say that that actually like is one reason 4:17why you're more skeptical here. 4:18I mean, I think that's certainly a reason how long you know do I want to pay to 4:23have that early access or am I willing to wait a couple of months and see what new 4:26open source models come out that start to, you know, tear away at the performance 4:31margins that o1's been able to gain. 4:34I don't have a need to have that today, and I'm willing to wait and to continue 4:38working on the open source side of things as the field continues to play catch up. 4:42You know, I think with every release we're seeing of proprietary models, it takes 4:45less and less time for an open source model to be released that can start to 4:49match and be equitable in that capability. 4:53Yeah, it feels like they've really kind of gotten out on a 4:55little bit of a limb here, right? 4:56I didn't even think about it until you mentioned it, is like, once you've gotten 4:59all these people paying $200 a month, it will feel really bad to now say, hey, 5:03these capabilities are now available for 50 bucks a month all of a sudden. 5:07I think there's some, you know, market testing, right? 5:09They need to see how far they can push this. 5:11That's, that's a reasonable thing for businesses to do, but I, it's 5:15past my, uh, my, my, my my taste. 5:17It's a little too fine. 5:19Yeah, for sure. 5:20Um, I guess Marina, maybe I'll toss it to you as kind of the last person. 5:24I know you were also a skeptic being like, no, I don't, I don't really think so. 5:27Um, maybe one twist to the question I'll ask you is, uh, you know, when I was 5:31working on chatbots back in the day, we often thought like, oh, what we're doing 5:35is we're competing with like Netflix. 5:37So we can't really charge on more on a monthly basis than someone 5:40would pay for a Netflix because it's like entertainment, basically. 5:43Um, and I guess, I don't know, maybe the question is someone 5:45who's kind of skeptical of $200, how much would you pay, right? 5:49Like, is an AI model worth $100 a month or $50 a month? 5:53I guess, how do you think a little bit about that? 5:56I think that's about what a lot of them are charging, right? 5:58OpenAI's got a lovely $20 a month uh, tier, so does Anthropic, uh, 6:04Midjourney has something like that. 6:06So honestly, I think the market has said that if you're going to be 6:09doing something consistent, that's a kind of a reasonable amount of 6:12money, somewhere in that 20 to 50. 6:14The 200 seems like a bit of a play from Steve Jobs of do you really, 6:18really want to be an early adopter? 6:20Okay, you get to say, ha ha, I'm playing with the real model. 6:23Realistically though, I agree with Kate. 6:24I think most people don't know how to create sophisticated enough use cases 6:29to warrant the use of that much of a nuclear bomb, and you don't even know why 6:34you're spending the money that you are. 6:36So you can actually get pretty far in figuring out how to make use of all of 6:41these models that are coming out and coming out quickly in the lower tier. 6:44I mean, if again, if I was in charge of the finances, I'd say give me a reason 6:48why this is a 10x quality increase. 6:52And I don't see why it's a 10x quality increase when you don't 6:54have a 10x better understanding of how to actually make use of it. 6:58Um, so I'm, I'm on Kate's side. 7:00I think part of this, and I think the comparison to Apple is quite apt 7:03in some ways is, um, you know, like Apple has turned out not necessarily 7:08be the most popular phone, but the most profitable the phone, right? 7:11And it actually just turns out that a lot of people do really want to pay premium. 7:14I guess maybe what we're learning is like, does that actually also 7:16apply for AI, because I think, you know, it's hard to imagine other 7:21things that you pay $200 a month for. 7:23It's like getting like to like commuting expenses, utilities, like you pay 7:26that much for your internet bill, I guess, you know, in some cases. 7:29So yeah, I think we're about to find out whether or not like people are 7:32going to bid up in that particular way. 7:34I guess Vyoma maybe I'll turn it back to you, I mean, with all this criticism, 7:37still sticking with it, though. 7:38Yeah, I'm telling you, I feel the market niche that the OpenAI wants to stick to 7:44is getting people to utilize these APIs, um, for purposes in the case that they 7:50want to build a small application, like, uh, they have a black box environment as 7:55well, where they can build, uh, something on their own, get it out quick and dirty. 7:58Experimentation is much more easier. 8:01And let's be honest, OpenAI has the first mover advantage. 8:04So everyone, like majority of the people, know ChatGPT as the 8:08go-to thing for generative AI. 8:10So they are leeching it and I completely see them, um, doing 8:15some of the, these, uh, marketing strategies around the money, et cetera. 8:19I feel they are monetizing on it now and, That's one of the key reasons they might 8:24be getting some push from investors. 8:25I don't know, but that's somehow I feel the strategy that startups do follow. 8:31And that's what everything is doing to as well. 8:33Yeah, for sure. 8:34The other announcement I kind of quickly wanted to touch on was, uh, OpenAI had 8:38been hyping, uh, Sora, which is their kind of video generation, um, model. 8:44Um, and it's now finally kind of widely available. 8:46Um, and I think this is a little bit interesting just because you know, this 8:50is almost like a very different kind of service that they're supporting, right? 8:53Like they came up with language models. 8:55Now they kind of want to go multimodal. 8:56They want to get into video, you know, in part to kind of compete with all 8:59the people that are doing generative AI on the image and video side. 9:04And I guess I'm curious if the panel has any, any thoughts on this. 9:07Um, Kate, maybe I'll throw it to you is like, it kind of feels like 9:10this is like a pretty new front for OpenAI to try to go compete in from 9:14a technological standpoint, right? 9:15Like I think like, this is like a pretty different set of tools 9:18and teams and infrastructure. 9:21I guess kind of like, do you think ultimately this is sort of like a 9:23smart move on the part of OpenAI? 9:25Because it does feel like they're kind of like stretching themselves kind of in 9:28every direction to try to compete on every single front in the generative AI market. 9:32I mean, I think it does make sense under the broader vision, or OpenAI's 9:38broader vision of pursuing AGI. 9:40I mean, I think you're going to need to be able to have better, 9:43uh, video understanding and generation capabilities to kind of 9:48handle this more multimodal task. 9:50And we're starting to see models being able, one single model being 9:54able to handle multiple different modalities and capabilities. 9:57So you need to develop models that can handle that right before you 10:00can start to merge it all together. 10:02So I think under that broader pursuit and umbrella, it does make sense to try and 10:07develop those technologies and techniques. 10:09Yeah, I think it's kind of like, well, we'll have to see. 10:12I mean, I think again, like part of the goal is just like whether or not 10:14AGI itself is is the right bid, um, to kind of take on this market, um, and, 10:20and whether or not this market really will be kind of like one company to 10:22rule them all, if it will be like, you know, he's the winner on video, and 10:25you have the winner on text, and it'll kind of break down in a multimodal way. 10:29I mean, I'm really skeptical that there's like the right economic 10:31incentive to develop AGI in the way that a lot of people are pursuing it. 10:35So we'll, we'll see, you know, but if that's your broader vision, 10:38I don't think you can have a language-only model for AGI. 10:43Right? 10:43It needs to have better, different domain understanding. 10:45Um, how about this announcement? 10:46I mean, Vyoma, Marina, are you more excited about this than having the 10:49prospect of having to pay, you know, your, your internet's bills worth 10:53each month for a language model? 10:55Yeah. 10:55Uh, I feel like the Sora announcement that we saw, and I was going through the videos 11:00and I was actually playing through it. 11:01The way that they've created, if you look at the UI, it looks very, very 11:05similar to your iCloud photos UI. 11:07Again, they're trying to drive more and more people to, um, use it seamlessly 11:12and also, uh, it, it creates an, um, era of creativity, like people are going 11:19over there playing a little bit with their prompts, increases the nuances 11:23around prompt engineering as well. 11:25I saw a lot of that, uh, happening with different, uh, AI developers 11:29that I work with day in and day out. 11:30They're like, if I tweak this in a different manner, uh, will that 11:33particular frame in which it is being developed change, et cetera. 11:36So it's, I feel it's also coming up with a whole different, um, arena 11:41of doing much more better prompt engineering, prompt tuning as well. 11:45I'll second that in saying that it's a really good way again to get a 11:50better understanding of what this space really is and a lot of data. 11:53This is something that we don't have an entire text's worth of internet or 11:57internet's worth of text stuff for, whereas here trying to see whether 12:00anecdotally or if people are willing to share what they've done, people 12:04will get a much better sense of what can these models do and then maybe 12:07economic things will come where you have a true multimodal model that can 12:10understand, you know, graphs and charts and pictures and videos at the same time. 12:14Um, but this is a good way to get a lot of data of what comes to 12:17people's minds and what they think the technology ought to be useful for. 12:21And that is interesting and it'll be really interesting to see what 12:23comes out from, from this capability. 12:26Yeah, I think kind of the model discovery, like you kind of build the 12:29model, but it's sort of interesting that the people who design it are not 12:32necessarily well positioned to know what it will be used for effectively. 12:35That's absolutely true. 12:37yeah, and the market's just like, all right, well, let's 12:38just like throw it out there. 12:40And then they're kind of just sort of like waiting, hoping 12:42that something will pop out. 12:43That's a great point that Marina brought about that, and I know Kate 12:46also spoke on the same point about AGI. 12:48Imagine, like, I just thought about it. 12:50All the users are writing their prompt creativity onto that 12:54particular, uh, Sora interface. 12:56That is data itself. 12:57Imagine that data being utilized to gauge human creativity and 13:02getting much more closer to AGI, so. 13:04And building on that, then also that model that you've trained can now generate 13:08more synthetic data that you can then, even if you don't want an AGI model to be 13:12able to generate videos, you still need an AGI model that can understand videos. 13:16And for that, you need more training data through either collecting data that's 13:20been generated by, you know, prompts, creating synthetic data from the model 13:24itself, Sora, to create some larger model. 13:27So it all, all I think is certainly related. 13:30Yeah, for sure. 13:31And yeah, there's kind of a cool point there about, I think, like, 13:34If we think synthetic data is going to be one big component to the 13:37future, um, there's almost like a first mover advantage, right? 13:41Well, yeah, okay, maybe it's uncontroversial, right? 13:43But it's kind of just like, well, you actually, if you're the first 13:46mover, you can acquire the data that helps you make the synthetic data. 13:50And so there's kind of this interesting dynamic of like who 13:52gets there first actually ends up having a big impact on your ability. 13:55And this is OpenAI's playbook, like one of the reasons they were able to 13:58scale so quickly is they had first mover advantage and their terms and 14:01conditions allow them to use every single prompt that was originally put 14:04into the model when it first released. 14:06It wasn't a little bit later until they started to have more terms to protect 14:10the user's privacy with those prompts. 14:12So yeah, definitely a model they can rinse and repeat here, so to speak. 14:16And now everyone else is caught on and is like, Oh, any model you put 14:19out where we can't store the data or don't you dare store my data. 14:22So OpenAI got in there before people caught up with critical thinking 14:26of, oh, that's what you're doing. 14:28Yes. 14:29Yeah. 14:35I'm going to move us on. 14:36So this week is the Lollapalooza, maybe that's too old of a reference. 14:40The Coachella of machine learning is happening. 14:42This week, uh, NeurIPS, the annual machine learning conference, uh, one of the big 14:47ones next to, you know, ICML and ICLR, um, and, uh, there's a ton of papers 14:54being presented, a ton of awards going on, a ton of industry action happening 14:58at this conference, certainly more than we're going to have time to cover today. 15:02But I do think I did want to take a little bit of time just because I 15:04think it is a big research event and we have a number of folks who are in 15:07the research space, uh, here with us. 15:09On the episode. 15:11Um, I guess maybe Kate, I know we were talking about before the episode, maybe 15:13I'll kick it to you is, you know, given the many thousands, thousands of papers 15:19circulating around coming out of NeurIPS. 15:21Um, I'm curious if there's things that have caught your eye, things you're 15:24like, oh, that's what I'm reading. 15:25That's what I'm excited by. 15:27Um, what are pointers? 15:28Because I think for me personally, it's just like overwhelming. 15:30Like you look on Twitter, it's like, this is the big paper 15:32that's going to change everything. 15:33And then pretty soon you have like more papers than you're ever going to read. 15:35So maybe I'll tee you up as I'm curious if there's like particular things 15:38you point people to take a look at 15:40I mean, I think there's some really exciting work that our colleagues 15:44at IBM are presenting right now that I'm just really, really fascinated 15:47by and think has a lot of potential. 15:49So I definitely encourage people to check out the paper called Waggle, which is a 15:54paper on on learning that are our own 15:58panel expert Nathalie, uh, is representing talking about unlearning in large language 16:03models and they've got a new method there. 16:06Uh, there's also a paper called Trans-LoRA that was produced by some of my colleagues 16:11who sit right in, uh, Cambridge, Mass. 16:14And I'm really excited by this one because it's all about how do you 16:17take a LoRA adapter that's been fine tuned for a very specific model 16:21and represents a bunch of different capabilities and skills that you've added 16:25to this model and you've trained it. 16:27And transfer it to a new model that it wasn't originally trained for, 16:30because normally LoRA adapters are pairwise kind of designed for an exact 16:35model during their training process. 16:37And so I think that's going to be super critical as we start to look at how 16:41do we make generative AI and building on top of generative AI more modular? 16:46How do we keep pace with like these, breakneck releases, you know, every month. 16:50It seems like we're getting new Llama models like Granite we're 16:54continuing to release a bunch of updates similarly, And I think that's 16:57just where the field is headed. 16:59And if we have to fine tune something from scratch or retrain LoRA from scratch 17:03every single time a new model is released It's just going to be unsustainable 17:07um, if we want to be able to keep pace. 17:09So having more universal type of LoRA's that can better adapt to these new 17:14models Um all I think is going to be a really important Uh, part moving 17:19forward to the broader ecosystem. 17:20That's great. 17:21So, yeah, we definitely, uh, listeners, you should check those out. 17:24Um, Vyoma, Marina, I'm curious if there's other kind of things that 17:27caught your eye, papers that you're of interest or, or otherwise. 17:29So, one of the papers that I was looking into was the understanding the 17:33bias in large scale visual data sets. 17:36So, we've been working a lot with large language models, uh, and, uh, 17:40uh, data, which is, uh, language data. 17:42But here, this was based on of some data set or an experiment, which was done in 17:472011, which was called name that data set and what I, what they showcased in this 17:52entire paper is how you can break down the image by doing certain transformations, 17:58such as like semantic segmentation, object detection, finding that boundary and edge 18:02detection, and then kind of doing some sort of color and frequency transformation 18:07on a piece of particular image to break it down such that you are able to, uh, 18:13ingest that data in such a better manner that a model that is being created on that 18:18data is much more accurate and precise. 18:20So very, very, um, old techniques I might say, but like the order 18:25in which they performed it was. 18:27Great in a visual use case. 18:30I think that was one of the papers that really got my eye. 18:33I think that's interesting to me lately is the increase in structured now not 18:39just data but the structured execution of language models for various tasks 18:45as we continue to get more and more multimodal not even just text, you 18:49know, text, image, video, but just text, uh, with functions, with tool 18:54calling, with things of that nature. 18:55I think we talked about this on a previous episode as well. 18:58There's now some interesting work going forward. 19:00Uh, one particular paper I think I read recently, uh, SGLang. 19:05on how to actually execute the language model in what your state 19:09is, and how to have it be forking and going in different directions. 19:13I think that there's a lot to be said here about how to make these models work for 19:17you in a way that's not just sequential, and not just, oh, chain of thought, first 19:21do this, then do this, then do this. 19:23No, let's turn it into a proper programming language and a proper 19:25structure with a definition with some intrinsic capabilities that the model 19:31has besides just text generation. 19:33So that happens to be a particular topic that I'm looking at with interest. 19:36Yeah, and IBM actually has a demo, I think, on that topic. 19:39Yes, it does. 19:40So how do we use SG Lang and some LoRA adapters, uh, coming back into 19:45play, uh, different LoRA's in order to set up models that can run different 19:50things like uncertainty quantification, safety checks, all within one workflow. 19:54Using some clever masking to make sure you're not running inferences 19:58multiple times and to kind of set up this really nice programmatic flow 20:01for more advanced executions, uh, with the, with the model in question. 20:05So if anyone's at NeurIPS, definitely recommend checking out the booth. 20:08That's great. 20:09Yeah. 20:09I feel like, uh, I don't know, my main hope right now is like 20:12to have more time to read papers. 20:13I do miss that period of my life when I was able to do that. 20:17Um, I guess maybe the final question, I mean, Marina, maybe I'll kick 20:19it to you is, uh, how do you keep up with all the papers in the 20:22space, just as a meta question? 20:25Uh, I think I can't possibly, but, uh, in general, giant 20:29shout out to my IBM colleagues. 20:30We have some real good active Slack channels where people post the things that 20:34they like, and there's particular folks with particular areas of expertise that 20:38I can look to and see, oh, what has, uh, some particular researcher been posting. 20:42Lately. 20:43And that is the way because, um, yeah, it's, it's a lot of things, especially 20:46now that there's, uh, a very welcome shift to people posting research even 20:51early, just, you know, preprints on archive and things of that nature. 20:55And you really need the human curation to let you know what's noise and 20:58what's worth paying attention to. 20:59And yeah, I can't beat human curation for that right now. 21:03. Yeah. 21:03It feels like, I feel like the, the, the key infrastructure is group chats. 21:07Like that's all I have now. 21:08Yes. 21:09just gonna add that- this is gonna make Kate very happy on this. 21:12I use, uh, the, as a true AI developer, I go to watsonx, the AI platform. 21:18I use the Granite model. 21:20I feed in my papers one by one. 21:22First I ask, okay, summarize this for me. 21:24Then I'm like, tell me the key points. 21:26And then I go deeper, deeper, deeper. 21:28I mean, I go the other way around to reverse engineer the paper to kind 21:32of figure out what to do with it. 21:33There's a script that I've written for it, which I'm very proud of. 21:37So I usually- 21:37You should open source that. 21:39Yeah, you gotta open source that. 21:40I need that in my life. 21:40Maybe I could do that. 21:41Yes, you should. 21:42Absolutely. 21:43Okay. 21:43Thank you. 21:44You heard it here first on Mixture of Experts. 21:52I'm going to move us to our third topic of the day. 21:54Um, so, uh, ARC Prize, uh, which is an effort that was set up by Mike 21:59Knoop of Zapier and Francois Chalet of, uh, Keras, um, is a benchmark 22:04that attempts to evaluate whether or not models can learn new skills. 22:09And ostensibly what it's trying to do is to be a benchmark for AGI. 22:13In practice, what it means is that you're asking the machine to solve 22:16a puzzle with these colored squares. 22:18Um, and this is very interesting. 22:20I bring it up today just because I think they did the latest round 22:23of kind of competition against the benchmark and showed the results 22:26and a technical report came out. 22:27But I think this effort is just so intriguing because you know, we've done 22:31it to this on the show where people say, AGI, what does it really even mean? 22:35And I think in most cases, people have no idea what it really means or can't 22:38really point to how they would measure it. 22:40And this seems to me to be like at least one of the efforts that 22:43say, well, here's maybe one way we could go about measuring this. 22:47Um, and so I did want to kind of just like bring this up to kind of maybe 22:50square the circle, particularly with this group, um, about sort of evals for AGI. 22:56Like, does that even make sense as a category? 22:59Are people even looking for those types of evals? 23:01There's just a bunch of really interesting questions there. 23:03And I guess Vyoma, maybe I'll turn it to you first. 23:05I'm kind of curious about like, when you see this kind of eval, 23:08you know, is it helpful eval? 23:10Is it mostly a research curiosity? 23:12Like how do you think about something like ARC Prize? 23:14Yeah, so when I look at ARC Prize, it was, I think, um, it was founded in 2019, 23:20created back then, when generative AI, large language models weren't a thing 23:25back then, and I think, um, it helps because it's like one of the first in 23:30the game again as well, so people kind of relate immediately back to it that this 23:35is the benchmark to, um, evaluate AGI, but AGI is way more bigger and better, 23:41um, in doing things such as there are so many things that, uh, clients like 23:45OpenAI and then other companies such as Mistral, etc., they're coming up 23:49with these models, which can annotate human data to help you act like human. 23:54And there are different methods to do that. 23:56So are AGI even, I won't say is the pristine benchmark or standard, 24:01but I do get the point as to why people refer back to it a lot. 24:06Yeah, it sounds right. 24:07I mean, I think that's kind of, I mean, we talked about it earlier in this episode. 24:09I think Kate, you were like, it makes sense as if you were an AGI company, 24:13this is the strategy you would pursue. 24:16As yeah, it's kind of interesting, even though we can think about it 24:18and talk about it in those concrete terms, when it comes down to like 24:21the nitty gritty of machine learning, it's like, what do we even use to 24:23kind of measure progress against this? 24:25And no one really knows, I guess, you know, kind of you're indicating 24:28that it's like, well, we kind of fall back to this metric because 24:29we don't have anything else. 24:31Um, yeah, I kind of curious, Kate, how you think about that? 24:33I think there are, there's a number of different ways you can think about it. 24:37One, we're always going to need new benchmarks to continue to have 24:41some targets we can solve for. 24:43So having a benchmark that hasn't been like, cracked, 24:45so to speak, is interesting. 24:47I don't know that that means it's more than a research curiosity, honestly, 24:51but there is something of value there. 24:52There's something that we're measuring that models can't do today. 24:55Is that thing valuable? 24:56I don't know. 24:57The, we're really talking about solving puzzles with colored dots. 25:02How well can people who solve those puzzles with colored dots 25:04correlate to the different tasks outside of solving those puzzles? 25:07I'm not too sure. 25:10I also think there's something wrong calling that a test for AGI 25:14because like, general is in the name. 25:16That task in that benchmark is very specific. 25:19It's like oddly specific. 25:21And it's one that humans can't do very well today either. 25:24So, you know, it doesn't quite resonate to me as a quote general intelligence 25:30where I think breadth is super important, um, if that's what we're really after. 25:35Yeah, it almost feels like you need like the the pentathlon or something. 25:38It's got to be like a bunch of different tasks, I guess, in theory. 25:41Um, I guess Marina, do you, do you think, I was talking with a friend 25:44recently, I was like, I was like, do you think evals are just broken right now? 25:48Like, um, there's kind of a received wisdom that most of the 25:51market benchmarks or the understood benchmarks are all super saturated. 25:56Um, and then like, it's very clear that the vibes evals, like you 25:59play around with it or like not. 26:01comprehensive in the way we want. 26:03And then so there's kind of this big blurry thing about like, well, what's 26:06the next generation evals that we think are going to be useful here? 26:09And how broken is it? 26:10Maybe I'm being too much of a pessimist. 26:11It's a very hard thing to do. 26:13I mean, even this particular benchmark that we're talking 26:15about, it is one specific way to instantiate a few assumptions 26:20about intelligence that they said. 26:22I was refreshing my memory on what they had said. 26:24And I was like, okay, there are objects and objects have goals. 26:26And there's something about the topology of space. 26:29Okay. 26:29Yes, this is all. 26:30True, and this is one way to go there. 26:32It's certainly not a comprehensive way, but with research It's all about well, 26:36we got to have some instantiation of it or we're never gonna make any progress 26:39So I think you always have to take every benchmark with a grain of salt. 26:43A benchmark is not an actual measure of quality It's a proxy if you want 26:48to really get into ML speak quality is hidden, benchmark is observed. 26:53And it is a limited proxy in a smaller space than what the quality is. 26:57Think about all the hidden layers of quality. 26:59We get a specific proxy. 27:00Um, the more variety you can do the better and the more you can also, 27:05uh, understand that if something's been around for several months, 27:08as you said, it's been learned. 27:10Um, you, that's it, you, you've learned it, you need to move 27:13on and, and do something else. 27:14But the problem is, if we don't have something that's quantitative, then 27:17people are just going to argue over vibes. 27:19Like, "well, I had these five examples in my head," "well, I had 27:21these five examples in my head." 27:22And then you really do just say, I don't, I don't trust it, or I don't believe 27:26it, or, but can't these things be faked? 27:28That way lies madness, as far as the actual use of these things go. 27:33We have to agree on something and put out the limitations and put out the 27:38constraints and still be able to agree that there is something to compare on. 27:42Um, so I, there's, there's no way around it with evaluation. 27:46It's never been easy. 27:46It's never going to be easy. 27:48I don't think it's more broken than it ever really used to be. 27:51Yeah, exactly. 27:51It's exactly as broken as it always has been. 27:53It's as broken as it's been. 27:56Yeah. 27:56I think, cause I think you're seeing two meta trends. 27:58I feel like one of them is, We talked about the hard math benchmark that 28:03this group called Epoch put out. 28:04Um, and you know, it feels like one, one bit of meta is like, we're going to 28:08just make the difficulty so difficult that like, it's almost like a way of us 28:13recreating that consensus where we're kind of like, Oh, well, if a machine can 28:16do that, something is really happening. 28:18But it feels to me, that's like a very, almost a very crude way of going at 28:21eval is like, all we do to try to get some agreement to move beyond the vibes 28:26is to try to create something that's so difficult that it's indisputable 28:29that if you hit it, it would, it would be like a breakthrough in progress. 28:34But, you know, on a day-to-day basis, it's like, how useful is a metric like that? 28:37You know? 28:38Well, so ultimately, uh, ARC Prize, I guess, are we pretty sympathetic to it? 28:42It kind of sounds like ultimately, it's like, it's measuring something. 28:44We're just not quite sure what it is just yet. 28:46I don't know if I'm a million dollars sympathetic to it. 28:48I'm sympathetic to it as a benchmark, but I guess it's up to them. 28:52Yeah. 28:52I like how large dollar amounts have just been this theme for the episode. 28:57I feel, I feel once the AI agents, um, are utilized to kind of, uh, make these 29:05AGI concepts much more simplified, I feel that I wouldn't go to that extent saying 29:11that that particular benchmark can be achieved and someone will win that prize. 29:15But I feel that with multiple permutation transformations with AI agents, let's 29:20say someone used generalization and some sort of transfer learning and then 29:24created an agent to understand the human's way to learn, maybe, maybe not, but I 29:30feel that that's a gray area right now and we don't know what can be achieved. 29:36So let's say I'm not here to say that it's here to stay or not, but 29:40there's something new comes along. 29:41I feel that's something that we're measuring against. 29:44I think I mean, and to Marina's point, I think one of the theories I've been 29:48sort of chasing after is AI is just being used in so many different ways by 29:52so many different people now that like, we will just end up seeing this like 29:55vast fragmentation and evals, right? 29:58Like it won't be old days where it's like, it was good on MMLU, so 30:01I guess it's just good in general. 30:03Like everything is going to just be measured by like very 30:05local needs and constraints. 30:07And, you know, talk, talk about group chats. 30:09I've been like encouraging all of my group chats, like we need our own bench, you 30:12know, because I think it's just like every community is so specific that like we just 30:16should have our own bespoke eval that we just run against models as they come out. 30:24So for our next topic, I really want to focus on the release of Llama 3.3 70B. 30:29Uh, background here is that Meta announced that it was launching, uh, another 30:33generation of its own Llama models, um, and most notably a sort of 70B version of 30:38the model that promised 405B performance, but in a much more compact format. 30:43This is a trend that we've been seeing for a while. 30:45And I guess maybe ultimately, um, you know, Kate, maybe I'll kick it 30:49to you is, I guess the question I want to ask is like, do we think that 30:52we're going to eventually just be able to have our cake and eat it too? 30:54Like that, like we've been operating under this trade off of 30:57big model, hard to run, but good. 31:00Little model, not so good, but fast to run. 31:03And, you know, where everything seems to be going in my mind is like, maybe 31:06that's just a total historical artifacts? 31:09Like, I don't know, do you think that's the case? 31:11I think that we often conflate size as the only driver of performance in a model. 31:17And I think with this release of Llama 3.3 70B, comparing it to the older 3.1 31:23405B, we're seeing firsthand that size isn't the only way to drive performance, 31:27and that increasingly the quality of the data used in the alignment step 31:32of the model training is going to be incredibly important for driving 31:35performance, particularly in key areas. 31:38So if you look at the eval results, right, the 3.3 70B, uh, is matching or you're 31:44actually exceeding on some benchmarks the older 405B in places like math reasoning. 31:51And so I think that really speaks to the fact that you don't need 31:54a big model to do every task. 31:56Smaller models can be just as good for different areas. 31:59And if we increasingly invest in the data versus just letting it sit 32:03on a compute for longer, training it at a bigger size, we can find 32:07new ways to unlock performance. 32:09Yeah, that's a really interesting outcome. 32:11My friend was commenting that it's like, uh, it's almost kind 32:14of like a very heartening message. 32:16You know, the kind of ideas like you don't need to be born with a big brain so 32:19long as you've got good, good training. 32:21Like you've got like a good pedagogy is like actually what makes the difference. 32:24And, you know, I think we are kind of seeing that in some ways, right? 32:27That like, I guess like the dream of massive, massive architectures may not 32:31be like the ultimate lever that kind of gets us to really increase performance. 32:35Um, uh, I guess, I think one idea I think I want to kind of run by you 32:40is just whether or not you think that this, this will be the trend, right? 32:43Like I guess to Kate's point, like you can imagine in the future that companies 32:47end up spending just a lot more time on their data more than anything else. 32:51Um, which is a little bit of a flip. 32:52I mean, I think most of my experience with machine learning people was like, I don't 32:55really know where the data comes from. 32:56So long as there's a lot of it, it will work. 32:59Um, and this is like almost points towards a pretty other different discriminating 33:04kind of like approach to doing this work. 33:06Yeah. 33:06So I, I work with clients day in and day out, and I feel 33:10that the trend is catching on. 33:12The clients no longer want to be paying so much money amount of 33:16dollars for every API call to like a large model and on, on something 33:21which is lying, not in their control. 33:23Even though we, they say that, oh, we say we indemnify the data, which 33:27you are not, we are not storing your data for them in their head. 33:31It's still not there yet. 33:32So people want it on their own prem, a smaller model trained 33:35on their own specific data. 33:37There have been so many times that I've sat with them and 33:39then curated the data flow. 33:41That listen, this is what we'll get in. 33:43This is how we'll get it. 33:44So the trend is definitely, definitely catching on. 33:47And sometimes often, like historically, I've seen that efficiency gains that we 33:52see, they are promising, but sometimes some of these models, they, uh, there, 33:57there are some trade offs in like context handling and then adaptability, et cetera. 34:01So now I feel if we have a smaller model with good amount of data, 34:05that the domain-specific data, they are getting better value out of it. 34:10And I see that happening. 34:11So yeah, I feel it's good and refreshing to see that no longer everyone, every 34:16time I used to walk into a board meeting and everyone would like, Ooh, 34:1970 billion, Oh, 13 billion, I will be comparing it under 405 billion. 34:24I no longer have to have that conversation anymore. 34:27So good for us. 34:29Yeah, I think it's kind of like, it's almost like people want the metric. 34:31They're like, oh, that's a lot of B. 34:33Like, where is this 405 B? 34:35That's a lot. 34:36Because now they have the legal team, the finance team, as Marina was 34:39mentioning, breathing down their necks. 34:41They're like, why do you have such a big model? 34:44Why is it inflating our resources and the money that we have to 34:47write a check on every month. 34:49So everything's coming back to that. 34:51Yeah. 34:51There's a little bit of a race against time here though. 34:53I don't know if Marina's got views on how this will evolve as like, part of this 34:57is driven by just the cost of API calls. 34:59And so there's kind of almost this game where it's like, how cheap will the API 35:02become versus how much work are people willing to do upfront around their data? 35:06I guess kind of you almost saying is like, it seems like companies are really 35:09tending towards the data direction. 35:11Uh, so as a committed data centric researcher, I'm very pleased to see 35:15this direction of, uh, of things. 35:18Is it good? 35:18Excellent. 35:19Um, again, I'll just, uh, re, uh, say what Kate had said, which is 35:24that the 3.3 model, uh, versus the 3.1, it's only post-training. 35:29It's not, you know, reach, you know, making a new model. 35:31It is differences in post-training techniques. 35:34So fine tuning alignment, things of that nature. 35:36And this also shows the value of going in the directions of the small different 35:41ways of adapting LoRA's because yeah, clients want things that are not 35:44just good on the general benchmark. 35:45They want things that are good for them. 35:47And look, the big was good because whenever you have new technology, 35:51first you want to get it to work. 35:52Then you want it to get to work better. 35:54Then you want it to work cheaper, faster. 35:56So we have like, all right, there's a new thing. 35:58Okay. 35:58Now we're getting those things a little bit smaller, cheaper, faster. 36:01There's a new thing again. 36:01Now we're getting it smaller, cheaper, faster. 36:03This is normal. 36:03This is a normal cyclic way of having the innovation. 36:06Clients are for sure catching up to this fact and saying, yes, okay, I see 36:10your 405, but I'm not gonna pay money on that because I already know you're going 36:13to figure out ways to bring that down. 36:15And I don't need all the things that that model can do. 36:18I need really specific things for me. 36:20So this is again, even goes back to our conversation on benchmarks. 36:23You look at the benchmark that matters for you. 36:25You look at the size of the model that matters for you and how much it costs. 36:28And this, this really matters a ton as we try to make use of this technology, 36:33not get the technology to work, but to get the technology to work for us. 36:37This trend is going to continue and I see it as a as a very good thing a 36:40very heartening thing It means people are getting a better intuition of what 36:43the point of this tech is going to be which is not size for the sake of size. 36:48I also think there's some really interesting like scaling laws that 36:51are starting to emerge like you look at the performance of Llama 1 65B 36:56versus you know, okay, maybe Lama 2 13B was able to accomplish all of that. 37:01You look at what Llama 3 8B could do compared to Llama 2 70B. 37:05You know, again, we were able to take that, shrink it down. 37:08Now we're taking Llama 405B and shrinking it down into 70B. 37:12And I think these updates are happening more rapidly and we're increasingly. 37:16Uh, decreasing the amount of time, uh, that it takes to take those, 37:21that large model performance and shrink it down into fewer parameters. 37:24And so it'd be interesting to plot that out at some point and see, cause I think 37:28we're seeing a ramp up and as we continue to look at things that are scalable. 37:32So like amount of training data and size of the model isn't very scalable. 37:36It just, it's cost exponentially more to increase the size of your model, right? 37:41But if we are looking at things like investing in data quality and 37:44other resources that maybe are. 37:46Uh, we can invest in more easily. 37:48I think we're going to continue to see that increase in model performance 37:52and shrinking of the, the model size. 37:55And to Vyoma's earlier point about, uh, agents, right, the complexity of 37:59that is exponential already itself. 38:01So you do not want to be having an each agent have a 405 billion parameters. 38:06That is, that is not something you can do. 38:08So it's a yet another driver in this direction of motivator. 38:12One more driver that I've seen. 38:13And I don't know if anyone else has, but I was in a call with one of the 38:17banks and they've also, uh, there's a shift towards using some energy 38:21efficient training pipelines as well. 38:23Everyone's looking into how do we optimize the hardware utilization? 38:26Is there any sort of long term environmental effect? 38:29And that's also a nuanced topic, which is building up. 38:32I saw some papers in NuerIPS also on that, but I've hadn't had the 38:36chance to look deeper into it, but I also see these conversations 38:39coming up day in and day out. 38:41Although I guess one thing, I mean, maybe Kate to push back a little 38:45bit, like it is actually an important thing probably for our listeners to 38:48know that you, you kind of need the 405 to get to this new Llama model. 38:52Um, and I guess that is one of the interesting dynamics is for all 38:55the benefit that these small models provide, is it right to say like 38:58we, we still need the, the mega size model to get to, get to this? 39:01Again, I think we're conflating size as the only driver of performance. 39:06So I think you need more performant models to get to smaller performant 39:10models, regardless of what size they are. 39:12Um, and if you have something bigger that's performant, it's 39:15easier to shrink it down in size. 39:17But if I talk about and think of the normal way we'd think about going 39:21doing this right taking a big model and shrinking it down is generating synthetic 39:25data from it and what's called using it as a teacher model and training a 39:28student model on that data you can use a smaller model if it's better at math. 39:33You know Llama 3.3 70B is uh you know outperforming 405B according 39:37to a few other benchmarks on on math and instruction following and code 39:41so I could, and would prefer to use that smaller model to train a new 70 39:45billion parameter model than 405B. 39:47I don't have to go with the bigger one. 39:49I want to go wherever performance is highest. 39:51Yeah, this all calls to mind, I mean, I want to benchmark or some kind of machine 39:54learning competition, which is like you take the smallest amount of data to try to 39:57create the highest level of performance. 39:59And like, it's almost like a form of like machine learning golf. 40:02It's like, what's the smallest number of strokes that get you to the goal? 40:05You know, what's the smallest amount of data that gets you to the model 40:07that can actually achieve the task? 40:09And it feels like, you know, it sounds like we may just be forced there because. 40:12you know, legal and finance are complaining. 40:13Now it feels like it's going to become more of an incentive within the space. 40:16You're going to promote overfitting, Tim. 40:19If you really do that kind of thing, people will just game the benchmark. 40:24Well, that's another topic for another day. 40:26As per usual, we're at the end of the episode and there's a lot more to talk 40:29about, so we will have to bring this to a future episode panel with you all on. 40:34Uh, thanks for joining us. 40:35Uh, and thanks to all you listeners out there. 40:37If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, 40:40and podcast platforms everywhere. 40:42And we will see you next week on Mixture of Experts.