Learning Library

← Back to Library

DeepSeek Challenges AI Giants

Key Points

  • DeepSeek’s recent R1 model delivers performance comparable to OpenAI’s o1, reigniting debate over whether the open‑source challenger can truly surpass industry leaders.
  • Panelists agree DeepSeek is making a strong splash, but emphasize that leadership hinges on more than raw benchmarks, requiring robust integration, ecosystem support, and sustained innovation.
  • Geopolitical considerations and the broader AI “arms race” heavily influence how these advanced models are developed, deployed, and regulated worldwide.
  • The episode also highlights other hot topics: Mistral’s potential IPO, controversy surrounding the FrontierMath benchmark, and an IDC study contrasting generalized versus specialized coding assistance tools.

Sections

Full Transcript

# DeepSeek Challenges AI Giants **Source:** [https://www.youtube.com/watch?v=86rz0mV3jZE](https://www.youtube.com/watch?v=86rz0mV3jZE) **Duration:** 00:39:37 ## Summary - DeepSeek’s recent R1 model delivers performance comparable to OpenAI’s o1, reigniting debate over whether the open‑source challenger can truly surpass industry leaders. - Panelists agree DeepSeek is making a strong splash, but emphasize that leadership hinges on more than raw benchmarks, requiring robust integration, ecosystem support, and sustained innovation. - Geopolitical considerations and the broader AI “arms race” heavily influence how these advanced models are developed, deployed, and regulated worldwide. - The episode also highlights other hot topics: Mistral’s potential IPO, controversy surrounding the FrontierMath benchmark, and an IDC study contrasting generalized versus specialized coding assistance tools. ## Sections - [00:00:00](https://www.youtube.com/watch?v=86rz0mV3jZE&t=0s) **Untitled Section** - - [00:03:12](https://www.youtube.com/watch?v=86rz0mV3jZE&t=192s) **Shifting Licenses, Open Competition** - The speakers examine how new commercial licensing models reshape ideas of transparency and openness in AI, highlight emerging non‑big‑tech players such as DeepSeek, and note the renewed focus on reinforcement learning within the evolving competitive landscape. - [00:06:25](https://www.youtube.com/watch?v=86rz0mV3jZE&t=385s) **Evaluating R1's Edge Over O1** - The speaker highlights R1’s touted contextual and reasoning improvements, calls for rigorous benchmarking against o1, questions its enterprise‑grade features, and discusses the broader open‑source implications, rapid release cycles, quality‑scaling challenges, and the pressure it may place on major AI providers through lower pricing. - [00:09:37](https://www.youtube.com/watch?v=86rz0mV3jZE&t=577s) **Beyond Benchmarks: End-to-End AI Integration** - The speaker stresses that true AI leadership hinges on safely and ethically integrating large language models across ecosystems—prioritizing efficiency, specialized adaptability, and regulatory compliance over pure benchmark performance. - [00:12:44](https://www.youtube.com/watch?v=86rz0mV3jZE&t=764s) **Distilling DeepSeek into Llama Models** - The speaker explains how DeepSeek’s large model is used to guide the creation of smaller, Llama‑based models through knowledge distillation, enabling plug‑and‑play compatibility and lower VRAM requirements. - [00:15:51](https://www.youtube.com/watch?v=86rz0mV3jZE&t=951s) **Open‑Source AI and Global Diversity** - The speakers discuss IBM’s commitment to open‑source AI models, compare emerging competitors such as DeepSeek, and emphasize the need for geographic diversity and representation from the global majority in the AI ecosystem. - [00:18:59](https://www.youtube.com/watch?v=86rz0mV3jZE&t=1139s) **Regional Customization of AI Models** - The speaker discusses how AI systems might be adapted to local cultures, languages, and data sets, resulting in region‑specific behaviors and tonal variations. - [00:22:04](https://www.youtube.com/watch?v=86rz0mV3jZE&t=1324s) **Skepticism Over Corporate Benchmark Involvement** - The speaker warns that while industry players may help design evaluation sets, their vested interests can lead to biased results, urging reliance on independent, third‑party verification before accepting claimed performance gains. - [00:25:23](https://www.youtube.com/watch?v=86rz0mV3jZE&t=1523s) **Skepticism Over Model Benchmarks** - The speakers question the reliability of current AI evaluation metrics, citing rapid model releases and controversies like FrontierMath, and suggest independent governance to ensure fair and trustworthy benchmarking. - [00:28:33](https://www.youtube.com/watch?v=86rz0mV3jZE&t=1713s) **Automating LLM Evaluation via Vibes** - The speaker critiques current evaluation gaps, argues that model progress is hard to measure, and proposes using specialized LLMs to conduct interactive “vibes” assessments that let LLMs evaluate other LLMs at scale. - [00:31:38](https://www.youtube.com/watch?v=86rz0mV3jZE&t=1898s) **Balancing Legacy and General Code Models** - The speaker explains IBM’s dual strategy of maintaining resource‑specific models for legacy systems like COBOL while also developing broader models for modern languages, with an eventual goal of unifying them into a single solution. - [00:34:48](https://www.youtube.com/watch?v=86rz0mV3jZE&t=2088s) **Focus on Code Explanation, Not Unit Tests** - The speaker advises developers to prioritize the ability to explain code—an area where AI assistants are underused—rather than specializing in unit test generation, highlighting current AI strengths, gaps, and the future of human‑AI co‑creation in software development. - [00:37:56](https://www.youtube.com/watch?v=86rz0mV3jZE&t=2276s) **Think Before You Code** - The speakers stress that effective software development relies on strategic, conceptual planning and problem decomposition rather than just writing code, noting academic gaps, the difficulty of inventing new algorithms, and the current limits of AI. ## Full Transcript
0:00At the end of 2025, is DeepSeek leading the state of the art 0:03in artificial intelligence? 0:05Abraham Daniels is a Senior Technical Product Manager with Granite. 0:08Abraham, welcome back to the show, joining us for the second time. 0:11What do you think? 0:11They're definitely making a splash in the open source, uh, space, 0:14but you know, it's, it's a really competitive, uh, landscape, so I 0:18guess we'll have to wait and see. 0:19Kaoutar El Maghraoui is a Principal Research Scientist and Manager 0:22at the AI Hardware Center. 0:23Kaoutar, I feel like you're becoming a regular here on the show. 0:26Uh, what's your take on this question? 0:27DeepSeek is definitely reshaping the AI landscape, challenging giants with open 0:33source ambition and state of the art innovations, but talking about leading, 0:38I think that's remains to be seen. 0:40It's not just about the raw performance, but it's also about the whole integration. 0:44And finally, last but not least is Skyler Speakman, who is 0:46a Senior Research Scientist. 0:48Uh, Skyler, welcome back. 0:49What is your take? 0:50Um, amazing technology. 0:52Great splash, as we said earlier. 0:54But I think there's really some really big geopolitics at play on 0:58how these models really get developed and are used across the world. 1:01All right. 1:02All that and more on today's Mixture of Experts. 1:10I'm Tim Hwang and welcome to mixture of experts. 1:12Each week, MoE is the place to tune into to hear the news and analysis. 1:16on some of the biggest headlines and trends in artificial intelligence. 1:19Today, we're going to cover quite a lot as per usual. 1:22We're going to talk about Mistral, potentially going IPO, uh, controversy 1:25around the FrontierMath benchmark, uh, and a recent interesting 1:29IDC report on generalized versus specialized coding assistance. 1:33But first, I want to start with DeepSeek. 1:35Um, so just this past, uh, last week or so, um, DeepSeek released R1. 1:41And if you recall and you're a listener to the show, you know that just a 1:44few episodes ago, I believe we were talking about DeepSeek v3, uh, which 1:48is their release, uh, which at the time I think kind of blew everybody's 1:51mind where they were showing really, really incredible performance with 1:54incredibly sort of less compute. 1:57and costs than what we're traditionally used to in the AI space. 2:01And with R1, um, it basically is DeepSeek's pretty fast on its heels 2:06release, showing that it has performance comparable with kind of state of 2:10the art stuff coming out of OpenAI, specifically to wit, uh, o1 and kind of 2:15the inference compute sort of techniques that really seem to give it a bunch of, 2:18um, sort of benefit, uh, for that model. 2:22Um, and so I guess maybe Abraham, I'll, I'll start with you. 2:25Do you want to talk us through a little bit about like why this is a 2:28big deal because I remember when, you know, o1 was released, people were 2:31like, this is a huge innovation and, you know, really shows that OpenAI 2:35has this big technological edge. 2:37Pretty soon afterwards, it seems like DeepSeek's doing 2:38almost the same thing, though. 2:39So I don't know if you want to talk our listeners to like, how 2:41do they, how do they do that? 2:42How do they catch up so quickly? 2:45Yeah, that's a great question. 2:45Um, so I think there's kind of two things that are really cool here. 2:47One is, of course, just, you know, the comparative performance with, you know, 2:52a state-of-the-art kind of, leading edge, bleeding edge model, like, uh, o1. 2:58But, um, unlike o1, it's been pretty cool that DeepSeeker has decided to 3:01open source it, which, you know, has been able to kind of proliferate some 3:06pretty powerful models across the community without the blockage or, you 3:10know, added need for commercial license. 3:12So I think they're really kind of shifting the paradigm, given a lot of 3:15these model providers are starting to slap on more, um, you know, specific 3:21licenses that are tailored to more commercial practices, given, you know, 3:25the business model that they're in. 3:27So I think it kind of shifts the idea of, you know, what 3:29does it mean to be transparent? 3:30What does it mean to be open without having to risk performance? 3:33Skyler, it strikes me a little bit that like, I think when we've talked about 3:36this issue in the past, you know, we've really talked about it in terms of. 3:39You know, OpenAI versus Meta, you know, right? 3:42And Meta's trying to kind of go compete with OpenAI by releasing these 3:45incredibly powerful models open source. 3:48This almost feels like now like everybody's after 3:50OpenAI exactly the same way. 3:51And obviously the distinction here, which is pretty interesting 3:54is you know, DeepSeek is, is not a kind of classic player. 3:57It's not a big tech player. 3:58Um, so do you want to speak a little bit to that? 4:00I know you kind of mentioned that, like, you think the competitive dynamics 4:03here are really interesting to watch. 4:05So, uh, first off, I think we'll get to the competitive dynamics in a bit, but 4:10reinforcement learning back on the scene. 4:12And I know it was, it kind of sort of died out for a while when, uh, uh, 4:16deep neural networks really took over. 4:17But, there now are multiple companies, and I think DeepSeek is an example of 4:21making it quite public, of bringing this back into, uh, the large language models. 4:26So, uh, cool to see these ebbs and tides of various parts of AI 4:30and machine learning come and go. 4:32Uh, so that's kind of more on the technology side. 4:34It's really cool to see some of these things, uh, pop back up. 4:36Yeah, totally. 4:36And I guess a quick comment on that. 4:38I mean, I think it is funny that, um, you know, for DeepMind, right, which 4:41originally made its bet on reinforcement learning, I think the rhetoric of the 4:44last year was, ah, they made the wrong bet and now they're trying to catch up. 4:47And now it's like, were they just really, really far ahead of everybody else? 4:50Like, I don't know. 4:51Yes. 4:52No, great comment. 4:52There were, there was this big push in reinforcement learning before, 4:55I think the transformer basically. 4:58And now these things seem to be, uh, you know, I'd say cohabitating, or 5:03at least, uh, being, uh, being in the same technology, uh, DeepSeek's has 5:08shown that they can put both of those techniques into the same package. 5:13And I think that is a really compelling argument, uh, for 5:16their strength going into 2025. 5:18Kaoutar, maybe I'll turn to you. 5:20I know out of the kind of set, uh, of folks on the panel, you know, I think 5:24you sounded the most, uh, sort of, um, you know, cautious about DeepSeek. 5:30Um, you know, I think there's one point of view, which is, 5:32oh man, they're releasing V3. 5:34That's incredible. 5:35Not like a month or so later, you know, oh my God, now they're 5:38releasing R1, you know, they're, they're catching up so quickly. 5:42Uh, you know, I guess there's a, there's a way the human mind is just like, well, if 5:45we continue these trends, then, you know, AGI by the end of the year from DeepSeek. 5:49Um, Do you want to speak up a little bit about why you're still ultimately kind 5:53of skeptical that, you know, DeepSeek, this is like the arrival of a genuine 5:57deep challenger to something like OpenAI? 6:00Yes, I think the key question is, what advancements does 6:04R1 introduce compared to V3? 6:06And how does it compare to o1? 6:08Are we talking about incremental changes or really like through 6:11innovations and new things that are leapfrogging the AI community. 6:16So they're claiming that they're improving the search precision, the scalability, 6:20the usability, while their V3 release focus on optimizing the core algorithms. 6:26So they're saying that R1 has capabilities, you know, such as 6:29better contextual understanding, and especially for these complex 6:34reasoning tasks, which makes it competitive, kind of toe-to-toe with R1. 6:40So, so I think we need still to test these models to see really whether 6:45they're there because this is a new release, so it still remains to be 6:48tested and to see what capabilities they're really bringing to the table. 6:53And how do they really compare with o1? 6:55I mean, they're showing some of the benchmark that sometimes, 6:58you know, they exceed o1. 6:59So I think something that needs to be validated. 7:02Um, But one thing that I'm a bit skeptical about is, you know, I 7:07think o1 still benefits from their proprietary integration with enterprise 7:13grade features, which R1 might lack. 7:16So, and that's something that still needs to be tested and evaluated. 7:20So, uh, you know, the, and another thing is, what are, what's the broader 7:25implications, you know, in this rapid integration for open source ecosystem? 7:30You know, the release cycles are, it's pretty impressive, they're very fast. 7:33cycles. 7:34And, you know, this release space show showcases the power also 7:38of community driven innovations. 7:40However, maintaining quality while scaling adoption remains a challenge here. 7:45And, you know, the open nature of DeepSeek could accelerate AI democratization, and 7:51it's also challenging the big players like OpenAI so and putting, you know, kind of 7:57pressure is visually that they're coming with very competitive pricing much cheaper 8:01compared to a one OpenAI's pricing. 8:05So I think it still needs remains to be validator, whether we're really 8:10talking about true innovation that goes, you know, kind of hand in hand 8:14with what one is doing or even better. 8:18So that needs to be still validated, but I still think, you know, the fine 8:22tuning capabilities, the integration with the enterprise enterprise use cases 8:27that probably are still lacking there. 8:29Yeah, for sure. 8:30I guess, Abraham, that's like a very natural place, I think, to turn to you. 8:34You know, what I hear in Kaoutar's argument is kind of the idea that the 8:37models are going to become kind of more commodity with time and sort of the 8:40competitive edge is integration, right? 8:42Which is, well, OpenAI can kind of win now because it's like hooked into 8:45all these other types of systems. 8:47And that's actually where the advantage is, you know, as someone who's working 8:50on Granite, is that kind of how you see, see the market or I'm kind of 8:53curious about your response to all that? 8:54Yeah, I think there's kind of two people that we gear towards. 8:57There's the commercial users, you know, where, you know, they're, they're really 9:00focused on enterprise use cases, ensuring that there's proper governance wrapped 9:04around the model and demonification and just that safety and support. 9:08And then there's the open source developers that, in my opinion, kind 9:12of dictate what is the best on the, you know, outside of benchmarks, 9:16which, you know, to Kaoutar's point is, is, is not always exactly what 9:20it seems, you know, our developer community really dictates what the best 9:23is given what the adoption rate is. 9:25So, um, I think over here at Granite, you know, we're focused on open source, 9:29so I think DeepSeek is a phenomenal play in terms of being able to open up the 9:33aperture when it comes to some of the most performant models on the market. 9:38Um, and honestly, I'm looking forward to kind of seeing what this, what comes 9:41from this in terms of the learnings that are shared and, you know, how 9:45developers in the community actually start to use, uh, o1 to start to, you 9:49know, develop new ways, uh, of, uh, you know, creating, uh, creating, um, 9:54to your point, like applications and spaces where this model can perform. 9:57Yeah, I think really to, to truly lead, you know, LLMs or these, um, you know, 10:03large language models need to move just beyond the role benchmarking performance. 10:08And to really reach true innovations that you have to innovate across 10:12efficiency, ethical framework, specialized adaptability, ecosystem support. 10:18So pushing the boundaries, not just in AI, but also how it's going 10:23to transform human interactions, technology, enterprise applications. 10:27So it's really a story about end to end integration while 10:31being safe, being ethical. 10:34So that's, you know, when you can really can claim true leadership in the AI space. 10:39So a full story of integration, not just looking at the benchmark performance. 10:43Benchmark performance is, I'm not saying it's not important, that's 10:46important, but I think integrating it full end-to-end and meeting all the 10:51regulations, safety, and the ethical considerations will be really important, 10:55uh, drive adoption, wide scale adoption. 10:57And if I may just add a the release of the DeepSeek did come along with 11:04a number of distilled versions. 11:06Um, so just to the point of adoption, like, you know, the 650 billion model 11:10is not gonna fit everywhere in terms of compute use, you know, availability. 11:15So the fact that DeepSeek understood that in order to adopt the model, you have to 11:18have, you know, different weight classes for different use cases, I think that just 11:22adds to, you know, their story as well. 11:24Yeah, totally. 11:25Sounds like Skyler wants to get in. 11:26I think Skyler also, before your response, if I can prompt you a 11:29little bit is, um, distillation. 11:32You should explain a little bit what distillation is, because 11:34I think it is super important. 11:36It's going to totally change a lot of the competitive dynamics in the space, but, 11:39um, you know, even I have kind of like the barest understanding of what it is. 11:43So I think probably you should start with an explanation of like, 11:45what does it mean that they've released a bunch of distilled models? 11:47And then, and then you should do whatever hot take you're going to do. 11:50All right. 11:50I'll, I'll try not to get into lecture mode too much. 11:53Knowledge distillation is when a much larger probably a much more complex 11:58model is used as a target for a, uh, smaller or less capable model. 12:06So what do I mean by a target? 12:08Hopefully our users understand the idea of the next token prediction task, right? 12:12You have to complete the rest of the sentence. 12:15Knowledge distillation doesn't care quite as much about predicting the next 12:20token, but rather taking a smaller model and asking it to match the internal 12:27representation of a larger model. 12:30So before that larger model gives its answer, it has its own internal 12:34representation of the answer. 12:36Now we are tasking the smaller model to match that representation rather than 12:42making a prediction of another token. 12:44And actually last year, Llama showed great results of getting Lama 3.2, I believe. 12:50Smaller through knowledge distillation, but what's different here is they are 12:56now fine tuning a Llama-based model, but the larger one is coming from DeepSeek. 13:02So this is kind of, uh, you know, spending across different companies 13:06here in different ways of training the original DeepSeek model. 13:10Way too large to actually run in a lot of circumstances. 13:13But as part of this release, they also have Llama-based models that 13:18have been fine tuned as guided or as distilled from the DeepSeek model. 13:23And I think that's something that was a very, very smart play because people 13:26are used to kind of the Llama sizes and you can, uh, Llama APIs, and these seem 13:32to be plug and play with those existing, uh, with those existing tools already. 13:36So knowledge distillation is a way of taking a much larger, much more complex 13:41model and using it to guide the training process of a smaller, um, uh, smaller 13:47model that uses a lot less VRAM and makes a lot of the users much happier. 13:51Yeah. 13:51I think I like the analogy of the teacher students model. 13:54Think of the Big model as a teacher and the smaller models as a student, 13:57and they're just trying to mimic like Skyler said, the internal representation 14:03and mimic the final answers while still having much smaller footprints. 14:12So I'm going to move us on to our next topic, Mistral, the French open source AI 14:17company, um, was recently, uh, appeared at the World Economic Forum happening Davos. 14:22Um, and uh, sort of after much rumors confirmed that they were not attempting 14:27to sell the company or be acquired, but instead we'd be pushing for IPO. 14:32Um, and I think it's a kind of nice opportunity to talk about Mistral 14:35because, you know, I remember like many moons ago, and by that I mean, I don't 14:40know, 18 months ago Mistral was like the thing that everybody was talking 14:43about in terms of open source AI. 14:45Um, and candidly we haven't really heard from them in some time, right? 14:49Like we haven't talked about Mistral at all in the last, say, 10 episodes 14:53of Mixture of Experts and open source seems to have appeared to become 14:56much more dominated, say by Meta. 14:59And I guess the question I wanted to kind of ask the panel first is. 15:02You know, uh, is open source really Meta's game right now? 15:06Or do, is there kind of a chance for these kind of like earlier players 15:10that really moved along open source AI in a really big way in kind of 15:13the early innings of this game? 15:15Um, you know, do they still have a fighting chance here? 15:17Or is it really kind of Meta's game in some way? 15:20Um, and Abraham, maybe I'll toss it to you. 15:21I'm curious about what you think about that. 15:23Uh, I mean, in short, I don't think it's only Meta's game. 15:26Um, so the, the most recent Llama license, although it allows for open 15:30source there are some intricacies in terms of, you know, you do have to 15:34model nomenclature has to include Llama. 15:36So they do still wrap some, you know, uh, restrictions around how you use 15:40your model, especially if you are, you know, an IBM or a different model 15:44developer that wants to distill, uh, you know, DeepSeek into Llama, so I think 15:50the, I think the market is still open. 15:51IBM is 100 percent committed to open source. 15:54Our entire roadmap will ensure that our dense models and our ML models are 15:58released on Hugging Face, uh, fully open source under Apache 2 licensing. 16:03So, um, personally, I think it's, you know, I think the market is still, uh, 16:07the field is still kind of open to, to, you know, who wants to lead that charge. 16:11And just based on our last conversation, you know, obviously DeepSeek now 16:15entering the space with, uh, you know, extremely high, extremely 16:18high performance model it's, uh. 16:20I think right now it's just like, you know, who's committed to it more so 16:23than, you know, who owns it right now. 16:24Skyler, do you agree with that? 16:25Yes, I do. 16:26I'm rooting for them. 16:27I think, uh, perhaps, um, I don't know, living in the global 16:31majority, I do pay more attention about where these models come from. 16:35And so I'm, I am rooting for models coming from, uh, EU or any of kind of the 16:42kind of non-traditional large players. 16:44So, uh, I great to see them, uh, you know, not at least being up for sale. 16:49Um, you know, we'll see how long that that stays out. 16:52But yeah, it was really cool to see that statement. 16:54And, uh, again, rooting for models that are coming from as diverse 16:58parts of the world as possible. 16:59And so I'm still holding out for Mistral to still represent, 17:03uh, large parts of the world. 17:04Yeah, of course, because I think that that is a big part I did want to bring 17:07up is, is the global majority and kind of the geography of all this, right? 17:11I mean, we talked about DeepSeek, right, China, Mistral for a long time, it's 17:14kind of considered like, oh, okay, Europe's also going to have its kind 17:17of open source player in the space. 17:19And so, yeah, I think it is exciting. 17:21I guess, Skyler, to kind of push you a little bit further, you know, do you 17:24think that different countries, different regions of the world will produce 17:28very different kinds of models, right? 17:30Like, I guess that's kind of the thing that you might be suggesting here, but 17:32I don't know if that's what you imply. 17:33Should they or could they might be the, the key difference there? 17:37Um, I think, um, I think if they could, they would have yet. 17:42I think it is proving much more difficult to kind of, you know, uh, these 17:47efforts, uh, scale across the country. 17:49And it's also why I think, uh, two, uh, two countries have 17:52really dominated this space. 17:54Um, so I would like to see more of that again, why I 17:58would be a Mistral, uh, a fan. 18:00Um, I think it would take lots of investments, uh, from governments, 18:05from universities if that money exists to really push that type 18:09of homegrown effort of models. 18:12And I don't really see that now. 18:15That's why, again, Mistral, stay strong still, uh, still 18:18represent other parts of the world. 18:19Definitely. 18:20Yeah. 18:20So Kaoutar, are you going to buy into the Mistral IPO? 18:23I think it's a great strategic move by Mistral. 18:26So, uh, you know, especially it's great for the European startups 18:30ecosystem because they often face these challenges, uh, around scaling 18:36due to limited vendor capital compared to what we see in the U.S. 18:40So the Mistral's IPO will test really whether Europe can foster this 18:44globally competitive AI companies. 18:46And of course, you know, I think it's important not to have this centralization 18:50just, you know, between U.S. 18:51and China. 18:52It's good also to see other countries, you know, uh, Middle East and 18:57Europe and also contributing models. 18:59I think going to the question you had, whether we're going to see different 19:02models coming from different regions, there might be some nuances there. 19:06For example, the cultural, uh, cultural, uh, implications or the, um, the, the 19:13language, the, you know, all these things, maybe some of these regions might 19:18tailor their models to their specific cultures, their specific traditions, 19:22uh, focus more on incorporating, you know, their languages also in terms 19:26of the APIs and answering questions and things like that, which would be 19:30great, uh, while also, but of course for general questions and so on, there 19:34will be commonalities, but I think there might be also some, uh, regionalization 19:39that might happen in the future. 19:41Yeah, for sure. 19:42I think that'll be so interesting because I think it'll, you know, I mean, there's 19:45almost nothing mysterious about it. 19:46It's almost like, okay, if you're based in a country, you may think to 19:49use certain data sets that people in other countries may not think to use. 19:52Right. 19:52And like, I'll actually have a material effect on the behavior of the model. 19:55And so, you know, I think it's like, there's really kind of interesting aspects 19:58of like, Oh, what would you choose to use? 20:01You know, if you're based in France versus, you know, Menlo Park, California. 20:04And I think that that's, that's a really interesting twist of it. 20:06Even I think the way that the model responds to you, for example, maybe 20:10the tone of the language, uh, whether you want it to be polite or didn't 20:15want it to be aggressive, I think if we can inject some of these human traits 20:18in this human, uh, AI interactions. 20:22and kind of taint it with some cultural aspects, which would be really great. 20:27You know, the way you greet a person will be different from a region to region. 20:31Would you incorporate maybe some religious aspects to it or some cultural aspects? 20:36It would be nice to see some of these specializations per regions. 20:40Yeah, definitely. 20:41I'd love to do the test, which is, um, you know, talk to this chatbot. 20:44Which country do you think this chatbot is from? 20:47Like whether or not you could be like, oh, that's definitely an 20:49American chatbot, I would know. 20:57Next topic that we're going to cover today is a pretty interesting one. 21:00Um, a few episodes ago, we talked about the release of a benchmark called 21:05FrontierMath from a group called Epoch AI. 21:09And FrontierMath is fascinating, uh, to me at least. 21:12because it is an attempt to kind of create evaluations that can 21:15keep up with how high capability these models are coming becoming. 21:19And so what FrontierMath is that you work with a group of um, uh, really kind of 21:23graduate mathematicians, kind of like professional expert mathematicians to put 21:28together incredibly hard math problems that even they have a hard time solving. 21:33Um, and uh, using that as the source of the eval benchmark, right? 21:37And you know, here is that all the classic evals, right? 21:40Like MMLU or whatever have kind of become saturated, like no one 21:45really thinks that they give us good signal anymore on model performance. 21:49Now I bring it up again today because there was sort of an interesting 21:51controversy that emerged where it sort of came out that OpenAI had been involved in 21:57the development of this eval, and in fact had gotten sort of access to, um, sort 22:02of these kind of initial test questions. 22:05And um, you know, I think there's a couple of kind of responses that Epoch had, 22:09you know, one of them is that there's a holdout set, right, that that the open 22:13AI team won't be able to get access to. 22:15There's kind of a commitment not to train on these questions, right, which 22:18might also distort the eval performance. 22:20But I kind of wanted to raise it because I think we're kind of in this interesting 22:23time where everybody knows the existing evals that are kind of the main benchmarks 22:27in the industry are kind of broken. 22:29Everybody's seeking to create better evals. 22:31And we're kind of in this new world where we're trying to work out, like, 22:34what should that look like exactly? 22:36And, uh, and I guess, Skyler, I want to kind of throw it to you, is like, 22:39you know, how, how should we sort of think about the involvement of 22:42companies in developing benchmarks? 22:45I guess the skeptical part of me would just say, expect that type of 22:51back-and-forth between the eval, the companies and the evals, and then take 22:55whatever performance gains they're advertising with a grain of salt and 23:00wait for third party confirmations. 23:02So that's, that's probably the, my, my largest takeaway there is it 23:06don't say it's never going to happen. 23:08In some cases, perhaps it really is great to have smart people get 23:12into the same room and break down barriers between companies and 23:16the goals of making benchmarks. 23:17But don't just take that particular company's word about 23:21how amazing their product is on arguably Uh, overfitting results. 23:27So yes, just add overall to, uh, skepticism and just kind of raise the bar 23:36a little bit on consumer education of what these kind of results really mean and, 23:40and make people really be appreciative of, of third party confirmations. 23:44Definitely. 23:45I, cause I think, I don't know. 23:46I, I, I, I take that. 23:47And I think that you know, I'm a little bit sympathetic to Epoch, right? 23:50Which is, well, you want to create an eval that challenges the very best models. 23:55And part of that involves working kind of closely with the 23:57companies to design those evals. 23:59Like the worst thing is you release an eval that is completely 24:01irrelevant to actually testing any model performance at all. 24:05And so almost by necessity, there is this kind of interaction. 24:08You know, Abraham, do you kind of buy that? 24:09This is sort of like inevitable. 24:10I know I have some friends who are like, you know, church and state, right? 24:13Like you should, you know, the eval people should never talk to 24:16the companies, which I think is 24:17at least in my mind is a little broken, but curious about what you think. 24:20Yeah, I would echo the same sentiment, to be honest. 24:23I think it's, um, I think the evaluations and benchmarks over the last, you 24:26know, year have become less and less, I wouldn't, I mean, not trustworthy, 24:30but, um, uh, transparent in terms of what they're actually using as 24:35part of their, uh, you know, what benchmark you did, it makes it into 24:38the training, uh, versus, you know, what they're actually evaluating on. 24:42Um, I think in a space like this, it really is the community that 24:45dictates the performance of the model. 24:47Um, you're even starting to see where, you know, you'd have 24:49ubiquitous benchmarks across models. 24:52You're starting to see model providers pick and choose which 24:54benchmarks they publish versus which ones they leave out to be able to 24:59narrate the story that they want. 25:01So I think as, as, you know, as that trend continues and as you know, data curators 25:07work with model developers to figure out what the best way is to evaluate these 25:12models, I think it's just going to be on the community at large to be, you know, 25:16the judge jury, um, in terms of, you know, is this model actually performing what 25:20the benchmarks say, or is this another kind of, you know, gaming the system? 25:23Because a model comes out every few months and somehow every single model 25:27is better than the previous one, so 25:29everything is always state of the art. 25:31We should have been at AGI months ago, but it's, you know, why are we not there? 25:34Kaoutar, I guess this kind of leaves us in a funny place though if we take sort 25:37of Skyler's rule, right, which is we should see all these evals with a bit of 25:41skepticism, is it true that kind of in the end, like, vibes still are the best eval? 25:46Like, you know, is there, can we trust any eval anymore? 25:48Like, it kind of leaves me in a fun place because I'm like, well, I really 25:52desperately want to have some kind of quantitative metric here, but 25:55it sort of feels like maybe that's ultimately kind of a lost game. 25:58Yeah, I think it's, it's a very controversial thing here, you know, what 26:02do you really What can you trust here? 26:05So there are all these benchmarks out there. 26:07But you know, with this controversy that happened around FrontierMath, you 26:11can see that OpenAI has this advanced access, which raises concerns about 26:15fairness, because it gives them an advantage in optimizing their models 26:19specifically for those benchmarks. 26:21And this compromises the integrity of this fair benchmarking. 26:25where all the participants should start from the same baseline. 26:29So how can we fix this? 26:31Can we maybe establish some governance around, you know, these evals? 26:35Can we have some transparent access rules, some independent oversight, like a third 26:40party that makes sure that everybody has access at the same baselines and, you 26:45know, that they don't get access maybe to data that will help them tune their 26:49models for those specific use cases. 26:51And then can we have an open review process for these results? 26:55So that's going to require a lot of work, but I think it can be done. 26:59Technically, it can be done to have these third parties that are completely 27:03independent that establish a governance and write these tools and processes 27:09and so on to be able to really ensure a fair evaluation process. 27:13And I hope we get to that some point because what can you trust? 27:18And you have to do these evaluations sometimes yourselves. 27:21And I think maybe the community can also contribute to all these 27:25evaluations and provide more validation. 27:27Yeah, I think the incentives are kind of a little bit interesting 27:29here too, because I think, you know, Epoch gets burned in this story, but 27:33OpenAI gets burned as well, right? 27:35Because like, it doesn't, it's not a great look in some ways. 27:38Um, and I feel like, you know, almost there's incentive to, like, be as hands 27:44off as possible, because look, when o3 comes out, I really do believe it will 27:48be better at very hard math, right? 27:50Like, I think there is actually some genuine signal here, but like, where 27:53we are now is maybe a little bit, you know, happens in the shadow of, 27:57oh, well, we know this arrangement, and they had access, and all that. 27:59I mean, the jump was pretty significant in the benchmark. 28:01I think it went from, uh, before the o1 results. 28:06It was a 2 percent and jumped to 25 percent with a one result. 28:10That's a big jump. 28:12Yeah. 28:12The question is like how much of that delta is the model, right? 28:15Yeah. 28:16And how much of it is, you know, being able to kind of 28:18study for the test basically. 28:20Yeah. 28:20And I think there was also someone, I think Chollet, the creator of the 28:23ARC-AGI benchmark, he refuted OpenAI's claim of exceeding human performance. 28:29You know, he highlighted, you know, that o3 still struggles 28:31with some of the basic tasks. 28:33So, so then, you know, it remains, you know, what do you trust? 28:38You know, 25% leap here compared to the 2%? 28:44Or maybe there are still some, uh, gaps that they're not, they're 28:48not telling the full story. 28:49So yeah, I think we're going to have to keep on this. 28:51Um, you know, there's a great article that I saw from, um, uh, they just came 28:55out, I think a few weeks back that was kind of making the observation that 28:59models are getting better, but we don't, can't really measure how, you know, we 29:04live in this kind of funny world where like all the evals kind of seem broken. 29:07We have a general strong intuition that things seem to get better, 29:09but like we have no way of actually assessing that, which I think is is 29:13kind of a funny situation to to be in. 29:15Can we create an eval LLM? 29:18So some model that evaluates all of these other models. 29:22Can we automate this evaluation process? 29:24Yeah, I think that's kind of where we end up is like I think if we think that 29:28vibes are going to be a powerful way of evaluating models and what we really 29:33say by vibes is like an interactive evaluation like you talk with the 29:37model to get a better understanding. 29:40It seems very intuitively obvious to me that at some point you will 29:42end up with like, well, to scale that we need LLMs talking to LLMs. 29:46And that kind of like they're conducting a scaled Vibes eval. 29:50I don't know where that goes, but it kind of feels like that's like maybe one set 29:53of research paths that you'd go down. 29:55You might be onto something. 29:56Yeah, we'll see. 29:59I just host the show. 30:00Someone else needs to do that work. 30:07So for our final topic today, we're going to talk about a report that 30:10came out of the research group IDC, uh, about generalist versus 30:14specialized coding assistants. 30:16Um, and it was released, uh, or just earlier this month, I believe. 30:20Um, So the report kind of takes a look at, you know, what programmers are getting out 30:24of, um, uh, out of, uh, coding assistants. 30:28And they show a lot of the results that I think we are familiar with at this point. 30:31So they report that 91 percent of developers are using coding assistants. 30:36They say that 80 percent of those developers are seeing 30:39productivity increases with the mean productivity increasing by 35%. 30:44So all kind of the good news that we're used to, which is that these 30:47coding assistants really do seem to be helping people along and doing better 30:51at their job as software engineers. 30:53I think the really interesting thing, though, that they make a 30:55distinction on is between generalist and specialized coding assistants. 30:59So generalist are basically like overall coding help with specialized assistants 31:04focusing on specific programming language, specific frameworks, um, 31:08industry specific requirements. 31:10And they kind of make the distinction, these are actually 31:12like two different markets. 31:13And right now like, you kind of need both to do coding assistance. 31:18And I guess maybe the question, you know, maybe I'll throw it to you, 31:20Abraham, first, is like, you know, I always thought that, like, where we're 31:23headed with these coding assistants is that they will just be one coding 31:26assistant model to rule them all. 31:28Um, but it is kind of interesting to me, they seem to be making the argument that 31:31like, no, there's going to be these really interesting niches for like, you know, 31:35my joke is like the FORTRAN model, right? 31:38It's just like, just specific to this particular use case. 31:40Is that what you guys are seeing at Granite? 31:42Like, I'm kind of curious because I know you've done a fair amount of coding work. 31:44Yeah, yeah. 31:45So 31:46I, I agree at least in the current space right now, you know, there, 31:49the, the perfect world there would be, you know, uh, one ring that fits all, 31:53like, you know, that rules them all- 31:54The one ring that fits all. 31:55kind of methodology, but here at IBM, you know, we support, uh, we develop 32:00our resource specific languages and the reason behind that, there are these legacy 32:04applications, you know, COBOL Z, where it's a low-resource language, there's 32:08not a ton of, you know, data that we can use to be able to do that trade on models 32:13where if we were to start to bake it into our more general code model, some of 32:17the capabilities might get lost in terms of being able to support that use case. 32:22So we find that, you know, you do have these legacy systems that people are 32:25still on where, you know, a resource support might not be as prominent 32:29as it was 5, 10, 15 years ago, where you do need to backfill some of the 32:32work with, you know, code assistants. 32:35And then you do have your larger, more general models that support, you know, 32:38your more uh, widely used languages, so in our space, we really do have 32:43that two pronged approach in terms of how we develop our, our code models. 32:46And of course, you know, the, the ultimate goal is to start to consolidate 32:49into something that can fit everything. 32:52But right now, that's just not the case. 32:54So I guess your prediction is that we will actually just see, like, 32:57this is temporary and we will see the merger, like generalists will 32:59become specialized at some point. 33:01You know what? 33:01I'm, I'm, I'm trying not to make predictions in this space because 33:04everything changes so fast. 33:05Yeah, I think it's hard. 33:06But what I will say is that, um, there's a shift in workforce specifically 33:12around, you know, capabilities. 33:14So I think that for organizations that need to be able to maintain 33:18their environment, they will look for models that help that. 33:22And if that can be provided as a part of a general model, all the better. 33:26But I think right now, it's, it's still looking to be more 33:29of a specialist model focus. 33:31Skyler, do you want to talk a little bit about, I mean, the interesting 33:34kind of labor impact of all this? 33:36Um, you know, I was joking with a friend recently, I was like, what you 33:40really need to do now, talking about the Fortran code assistant, is like, 33:43you need to specialize in languages that no one programs in anymore. 33:47Right, because if you do Python, you do, you know, any of the popular 33:50languages, you're about to get wiped out because the models are going 33:52to get really good really fast. 33:54And so the main thing is to flee into like what weird obscure version of 33:57Haskell, you know, and kind of that's your that's your defensive moat if 34:01you're a coder, is that good advice? 34:02Or is that just crazy? 34:06That's a great anecdote, um, and I think actually it's not just a story. 34:10I do think actually IBM's got a lot of vested interest in keeping some of 34:14those old languages up and running. 34:15So, uh, beyond, beyond just a, um, a punchline, I think here, here's a 34:22great breakdown in as part of this, um, the survey that was done from the 34:27IDC, they also said what particular tools, or what particular tasks. 34:32Do you use these assistance for and at the top of the list 34:37was unit test case generation. 34:40So this is like the really boring part of software engineering, writing all these 34:45unit tests to try to break your code. 34:50In that sense, I would say to your friend, don't specialize in building unit tests. 34:54That is something that I think machines are doing a great job of, and people 34:59are already leveraging for that task. 35:02But at the bottom of this list of where they don't, aren't using these tools as 35:07much, is code explanation, which is now if I copy in a set of this code, can I have 35:13an LLM tell me what this code is doing? 35:16So I think there's this really cool breakdown between what tasks 35:20software developers really want to be automated for them, things like 35:25coding up unit tests and other areas where they actually need to, you know, 35:29use kind of higher level processing of, Ooh, what is this code doing? 35:34Can I explain what this code is doing to somebody else? 35:37And That kind of breakdown here of how at least software developers in the U.S. 35:42are currently using tools, I think represents that gap. 35:45So to your friends, don't tell them to specialize in unit test generations, 35:50but maybe have them skill up a little bit on the ability to explain what that 35:56code is doing, because that's something that currently the AI assistants 36:00at least are not being used for. 36:02I see the future as an AI co-creation, uh, software developers. 36:07So where the future of programming will involve human, AI collaboration with AI as 36:14a coding assistant helping to brainstorm, optimize, and refine solutions. 36:19But going to your friend, I think where they should focus on is where areas, uh, 36:24is on areas where AI struggles, things like system design, security and handling 36:30edge cases, uh, creative problem solving. 36:33So, uh, you know, it's responsible AI use cases. 36:37Those are still areas where AI struggles because I think designing and solving 36:42and programming complex software systems involves not just coding. 36:47But a lot of other elements and, and, uh, angles here, and especially the 36:53collaborative nature of understanding the end users requirements, the client edge 36:57cases, the requirements, the security implications, all of that, and putting 37:03it all together in a full, full end to end solution with testing with coding 37:09with so there are a lot of elements here that still AI cannot handle completely 37:13and software developers are still needed but I think they need to focus more on 37:18those situations that AI struggles with but of course enhance their productivity 37:24with these code assistants and copilots. 37:27Yeah, I think that's right and I think I don't know Skyler's emphasis I 37:29think on like don't do unit tests but work on explaining the code I think 37:34is very interesting, um, you know, I mean, classically, documentation 37:37is always terrible for any software. 37:39Um, and I guess, Skyler, kind of what you're saying is maybe that's 37:43actually where the future is. 37:44Like, you really, you really gotta get better at that soon. 37:46I was actually having a conversation with a former, uh, co worker and, uh, 37:51I don't want to date him, but when he was in computer science, when he 37:53was doing his grad school in computer science, he said they didn't code. 37:56They just, their goal was to think about how to strategically, um, you 38:02know, outline your code and what's the thought process behind building it, 38:05as opposed to just going and building. 38:07And he recently took on a new, uh, role in a new space and he's 38:11had to learn a new language. 38:14And it was funny. 38:14He was saying, I don't have to build code anymore. 38:16I think the gap that I see with a lot of these, you know, PhDs coming out is they 38:22don't have to build code, but they're never taught how to think through and 38:25explain why we're doing what we're doing. 38:27So he found it a lot easier to actually learn given that that 38:29was kind of where he started. 38:31So to your point, Skyler, it was, uh, he's actually seeing that the better 38:35you can actually structure your code in your head before you actually start 38:39to write it, the easier it is to learn. 38:41I agree with you, Abraham. 38:42I think the problem solving process. 38:46How do you decompose a problem into subproblems and also the algorithms 38:51think understanding, you know, how to create a very innovative algorithm. 38:55This is something, you know, that requires deeper thinking, deeper expertise 39:00that probably AI cannot solve today. 39:03Like coming up with a new algorithm that solves something, um, like, 39:07uh, some of the existing problems. 39:09So it's still challenging for an AI system to do. 39:12Well. 39:12Let that be a lesson to, or a, the word of advice to all you coders out 39:16there who are listening to the show. 39:17Um, as always, I say this every single episode, but we are out of time for all 39:21the things that we need to talk about. 39:22Um, thank you for joining us, Abraham, we'll have you back on the show. 39:25Kaoutar, as always. 39:26And, and Skyler, thanks for coming on and thanks for joining us. 39:30If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, 39:33and podcast platforms everywhere. 39:34And we will see you next week on Mixture of Experts.