Learning Library

← Back to Library

AI Roundup: Rabbit Hardware, GPT-2 Bot, FT Deal

41m • Unknown Channel • ai-ml • news • intermediate • Watch on YouTube ↗

Key Points

“Mixture of Experts” is a weekly AI‑focused programme that brings together a rotating panel of specialists to cut through the flood of news and highlight the most consequential developments.
The current episode features three IBM‑affiliated experts – Chris Hay (Distinguished Engineer, IBM), Kush Farney (IBM Fellow, AI governance), and Shar (Senior Partner, AI & IoT consulting) – each representing a different AI domain.
The first headline covers Rabbit’s AI‑companion hardware launch, which quickly ran into technical glitches (a battery‑drain firmware issue) and credibility concerns after it was revealed the device was essentially just an Android app that could run on a phone.
The second story examines the surprise appearance of a “GPT‑2 chatbot” on Chatbot Arena, raising questions about the reliability of current LLM evaluation methods and the transparency of model provenance.
The final discussion notes OpenAI’s new agreement with the Financial Times to license the newspaper’s content for training, illustrating the growing trend of commercial media data being used to improve large‑language models.

Sections

00:00:00 AI News Roundup with Experts - The show introduces an IBM‑led panel to dissect recent AI headlines, beginning with the troubled rollout of Rabbit AI hardware.

Full Transcript

# AI Roundup: Rabbit Hardware, GPT-2 Bot, FT Deal **Source:** [https://www.youtube.com/watch?v=hwNkFnR1U0I](https://www.youtube.com/watch?v=hwNkFnR1U0I) **Duration:** 00:41:34 ## Summary - “Mixture of Experts” is a weekly AI‑focused programme that brings together a rotating panel of specialists to cut through the flood of news and highlight the most consequential developments. - The current episode features three IBM‑affiliated experts – Chris Hay (Distinguished Engineer, IBM), Kush Farney (IBM Fellow, AI governance), and Shar (Senior Partner, AI & IoT consulting) – each representing a different AI domain. - The first headline covers Rabbit’s AI‑companion hardware launch, which quickly ran into technical glitches (a battery‑drain firmware issue) and credibility concerns after it was revealed the device was essentially just an Android app that could run on a phone. - The second story examines the surprise appearance of a “GPT‑2 chatbot” on Chatbot Arena, raising questions about the reliability of current LLM evaluation methods and the transparency of model provenance. - The final discussion notes OpenAI’s new agreement with the Financial Times to license the newspaper’s content for training, illustrating the growing trend of commercial media data being used to improve large‑language models. ## Sections - [00:00:00](https://www.youtube.com/watch?v=hwNkFnR1U0I&t=0s) **AI News Roundup with Experts** - The show introduces an IBM‑led panel to dissect recent AI headlines, beginning with the troubled rollout of Rabbit AI hardware. ## Full Transcript

0:00[Music] 0:05hello and welcome to mixture of experts 0:07uh on this show we're going to be 0:09meeting weekly to review the sort of 0:12deluge of news that's happening in the 0:14world of AI um and the goal here is to 0:17distill down right it can be really hard 0:19to keep track of uh everything that's 0:21flying around on a weekly basis but the 0:23Hope here is by bringing together a 0:25group of experts we can distill what's 0:27happening um and give you an 0:29understanding of what's happening in the 0:30world of AI and what to be looking for 0:32uh in the week ahead and so today I'm 0:34joined with a great panel of um three uh 0:37experts uh really hailing from different 0:39areas of the AI World um so just to 0:42quickly run through them Chris hay who's 0:44a distinguished engineer at IBM um he's 0:46the CTO of their customer transformation 0:48operation um Chris welcome to the show 0:51hey thanks for having me looking forward 0:53to this yeah absolutely uh Kush Farney 0:56he's an IBM fellow uh working on AI 0:59governance issues uh Kish welcome yeah 1:01thanks Tim yeah and uh Shar he's the 1:05senior partner Consulting uh running the 1:07AI and iot business uh in the US Canada 1:09and Latin America so welcome to the 1:11show thanks guys thanks for having me um 1:14well let's go ahead and get started 1:15we're going to cover three really big 1:17stories uh of the last few weeks um the 1:20first one is going to be uh the uh 1:23recent release of the rabbit AI Hardware 1:26uh product uh and we're going to talk a 1:29little bit about some of the trouble 1:30they've been having on the roll out and 1:31what it all means for the future of AI 1:33enabled Hardware secondly we're going to 1:35talk about uh what's happening with gpt2 1:38chatbot which is a mysterious chatbot 1:40that has just appeared on chatbot Arena 1:42um what it is and what it really tells 1:44us about the uh evals in the AI and llm 1:48space in particular uh and then finally 1:50uh we're going to talk about open ai's 1:52uh concluding of a deal with the 1:54financial times to license their data uh 1:57for training uh purposes 1:59[Music] 2:05so I'd like to kind of start first with 2:07uh the rabbit story so rabbit if you've 2:09been watching is a a really sort of 2:11widely discussed Hardware startup uh 2:14whose bid is basically to say you know 2:17in the future we're going to have ai 2:19first hardware and rabbit effectively is 2:21a little device that is intending to be 2:23kind of an AI companion for you um they 2:27rolled out just recently um and have run 2:29into immediately a number of problems so 2:31they you know had to push a firmware 2:33update to deal with a a battery problem 2:36right the battery was draining too 2:37quickly um they've been criticized 2:39recently because it turned out that 2:41their product was essentially an Android 2:43app that could be running on a phone um 2:46and so they've received a lot of 2:47criticism and I wanted to bring up this 2:48story just because it feels like it's 2:50the the seconds data point right so 2:53there was the release of rabbit uh 2:54Humane which is another company that 2:56released an AI enabled pin um has 2:58similarly run into aot lot of criticism 3:00people saying you know why would I buy 3:02this is this a good product at all um 3:04you know what is what is this all for 3:07and so Shan I was hoping I would bring 3:09you in kind of maybe to kick us off here 3:11um because I think what's really 3:12interesting is you know personally I'm 3:14like very excited by the future of AI 3:15Hardware right like I think there's just 3:17so many cool things that can happen once 3:19AI is on device and it's a thing that 3:21you can carry around with you but 3:23clearly kind of some of the first forays 3:25some of the most talked about forays 3:27that are happening um today um are are 3:30clearly having some some teething issues 3:32some some issues um and so kind of want 3:35to get your take as someone who's like 3:36deep in the AI and iot space and kind of 3:38thinking about the relationship between 3:40Ai and and Hardware how you see these 3:42recent stories and and what do you think 3:44it tells us about how this Market is 3:46evolving thanks Tim I had the pleasure 3:48of playing with the rabbit R1 at the CES 3:52this year and I obviously as a geek I am 3:54I did drop my 200 Parts I received my 3:57rabbit you own one right so it is 3:59fantastic effort uh if you think about 4:02uh the direction that AI is moving it 4:04will close it'll go to the edge more and 4:07more right you're seeing the models 4:08getting smaller there's a lot of work 4:11that's happening in on device Computing 4:13Apple breaking it its uh wall Gardens 4:16and open sourcing its open Elm you'll 4:18see Google with it share models all of 4:20those are moving closer and closer to 4:22the edge so I generally love the 4:24direction that it's taking that way you 4:26get addressing things around privacy 4:28your data is being commuted on on the 4:30device and that stays with you so the 4:32direction that the tech is taking is 4:33fantastic I'm all for it I think there's 4:36a lack of appreciation of what problem 4:39are you really trying to solve for and 4:40are the other devices that are better 4:42soled for it so when you start to uh 4:44react to a device like that in your in 4:47your brain you're trying to create a set 4:49of things that you're going to evaluate 4:50this on I think that's where the problem 4:52is with R1 or from when you look at even 4:55metas Rand glasses and the Humane pin 4:58right we have a set set of things that 5:01we are looking to evaluate it against as 5:03an example I would appreciate that it 5:05understands me as a person really well 5:07if it's attempting to be a personal 5:09assistant I would appreciate that it 5:11would have instant responses if I'm 5:13looking if I can do something in half a 5:15second or Split Second faster on a 5:18regular mobile phone I'm going to tend 5:20you to go do that so we're all 5:21optimizing for how to make it more 5:23effective in our own lives then you 5:25start to look at things around I already 5:28carrying a cell phone in my pocket so it 5:30has to be net new POS to be like for 5:33example the watch I'm wearing is adding 5:35something to the ecosystem right versus 5:38when you when you create the set of 5:39criteria and then you start to evaluate 5:41rabbits R1 it starts to fail on some of 5:43the basic uh capabilities that will 5:45expecting from it the direction is great 5:48but I think the the battery life being 5:50very low the the fact that the screen 5:53itself is they're teasing you with 5:55certain things you can do with a 5:56touchscreen like typing on the terminal 5:59but you can't really interact with the 6:00menus the menus kind of remind you of 6:02how we had the old scroll wheel iPods 6:05those are amazing to scroll through 6:06music but they're terrible at changing 6:08setting things of that nature and we see 6:10that Paradigm Shift over to the rabbit 6:11R1 as well there are a few things around 6:15taking images the visual recognition of 6:17what's in front of you that has been 6:19pretty decent like I've had good 6:21response when I'm pointing it to certain 6:23things in front of me documents is a hit 6:26or miss right now this handwriting 6:28recognation is still taking a lot more 6:29time 6:30so yeah I was curious if youve got like 6:32have you had like what's your most 6:34magical experience so far right I I'm 6:36almost interested in like the Steel Man 6:37case of like what's the what's the most 6:39exciting thing you've done so far with 6:41it right because it's gotten so much 6:42hate online that it's almost interesting 6:44thinking about like the the killer 6:46application like I remember when I 6:47bought my smartphone for the first time 6:48I was like this has maps on it I 6:51literally never going to be lost again 6:53and like that's like a huge deal um and 6:55I guess in the AI space I'm still kind 6:57of like waiting for that and I'm curious 6:59about as as someone who owns it and uses 7:00it and is playing around with it if 7:02there's like things where you're like oh 7:03this is starting to be really cool yeah 7:05so I think the promise of the large 7:07action model is pretty cool uh it has 7:10solved this to a decent extent with a 7:13few apps like Ubers and and others but 7:16the fact that a lot of the services that 7:18we that we use today are hidden behind 7:21applications and not all of those 7:23capabilities are exposed through apis so 7:25it's difficult for say a personal 7:27assistant Siri or chity of others to be 7:30able to go call those and do some 7:32actions so the large action model I 7:35think that has a lot of Promise uh the 7:37training data becomes a constraint for 7:39them that's their keyless heal so so far 7:41they've had hundreds of people manually 7:43go and train these models right and 7:46they're going to open up this catalog of 7:48hundreds of different models uh over 7:50time but in the current form it's very 7:52limited in what actions I can take on it 7:54but the fact that you can delegate a end 7:57to end process that's very complex and 7:59other you couldn't have done it with 8:00apis and that's what what really excites 8:03me I see a lot of applications I mean uh 8:06so I don't have one of these I'm not as 8:09much of a of a Gadget Guru as shath is 8:12but um uh but yeah I mean uh I think the 8:16um uh I mean there's going to be fits 8:18and starts with any sort of new paradigm 8:20right and uh uh things have to start 8:23somewhere I'm more of an optimist on on 8:26things generally so um to me what if 8:28this is leading to is actually like a 8:30fourth Paradigm of how we interact with 8:33Computing right I mean there was Punch 8:34Cards there was command line then there 8:36was guies and this is now I mean like 8:40we're in this fourth sort of era the 8:41language natural language interactions 8:43and so forth and I think I mean yeah I 8:46mean maybe there's no killer app yet but 8:49the killer app maybe is the fact that we 8:51have this new way of interacting and 8:53that's what these devices are going to 8:55uh start us uh on the road down and uh I 8:59mean having having this more like Mutual 9:01theory of mind like this system 9:03interacts with us it understands us we 9:05understand it I mean I think that's 9:07where we're headed and um uh the more we 9:10can just keep down that road I mean of 9:12course the first instantiation of 9:14anything isn't always the like the most 9:17perfect or the best but um but I think 9:20that's where I'm optimistic about it 9:22yeah and I think there's kind of this 9:23interesting sort of hill climbing right 9:25because I think you know my friend was 9:26like this is this is like Google Glass 9:28all over again right like you're going 9:30to have like a couple products that have 9:31like such a bad rep that they kind of 9:34taint the entire market for like you 9:35know a decade plus but Co I was kind of 9:37like agreeing with you I was kind of 9:39like well it's not like these products 9:40are failing so hard right like if you 9:42remember when Google Glass came out 9:43people like went into bars and like got 9:45beat up because they were wearing the 9:46Google class like we're we're not there 9:49yet and so it feels like we are kind of 9:50more in this like hill climbing um 9:53scenario I mean I I I don't have the 9:57device but I think it's utter nonsense 9:59if if I'm honest right well tell us why 10:02you know well if you think about what it 10:04is right what what what do you me to 10:06have here a camera right a touchcreen 10:08right you need access to Wi-Fi and then 10:11for it to be useful you need a cell 10:12connection as well as you move around 10:15it's going to do image recognition and 10:17then it needs AI Hardware on board what 10:20is it it's a phone okay so this is why 10:23you can't find a killer app cuz the 10:25killer app is a phone and and when I 10:28look at it and and I I'm going to give a 10:30practical example so Apple silicon is 10:33absolutely incredible right so last 10:36night on my M3 I ftuned the mistel 7B 10:40model with my own data set in 15 minutes 10:44at 250 tokens per second the the gpus is 10:49incredible and that same technology is 10:50coming into the phones Apple's going to 10:52go on device they've got the hardware 10:54with apple silicon you know and then the 10:56mobile phone man manufacturers are going 10:58to do the same so as as far as I'm 11:00concerned and I agree with the Paradigm 11:03but it it's like trying to sell a pager 11:05to somebody today it's like here's this 11:08thing that's got the things you need you 11:09can get messages and you know but nobody 11:12has a pager right because it was 11:13replaced by the phone and and so I do 11:16think there will be AI on Hardware 11:18devices I I just don't get that one yeah 11:21yeah and I think you're also raising I 11:23think one final point I wanted to hit on 11:25before we move to the next story is you 11:27know obviously the Thousand PB gorilla 11:29in the room is Apple it's it's not even 11:31th000 pound it's like the 100,000 PB 11:33gorilla basically in the room because it 11:35it's got the hardware it's got the data 11:37you know should be able to execute on 11:39all this they they haven't yet really 11:42right and so this is I think where all 11:43the other Hardware companies see an 11:44opening is like well Apple's going to be 11:46so conservative that there's an opening 11:48in the market for at least to get in and 11:50at least maybe be a good acquisition 11:51Target right um I guess sh do you just 11:54to bring it back to you I'm curious if 11:55you've got any thoughts on Chris's uh 11:58attack on this whole idea because 12:00clearly you were bullish enough to to 12:01buy the products and experiment with it 12:03and um and I'm I'm curious if you got if 12:05youve got the Chris takedown here like 12:07what's what's the what's the thing he's 12:09not seeing no like he's on the right 12:11track in the current state it's not a 12:13great product right but just being an 12:15optimistic of where the tech is going 12:16I'm more on the wash wibe of I see the 12:19promise of what this can bring but uh I 12:23I think that these uh devices will 12:24evolve and apple takes a while to come 12:27into this industry right same thing goes 12:29with the Vision Pro glasses right I 12:33again I was a big fan of them and I 12:35bought them early on and 30 days in I 12:37did return them so I just the fact that 12:40I found some experiences and the promise 12:43of where this can be at this point I'm 12:44just waiting for the next version to 12:46come out but I'm With You Chris in the 12:47current state yes the $200 I I would 12:49have used it elsewhere but uh I'm just a 12:52sucker for a good Tech man yeah me too 12:54sh but maybe my challenge back to you is 12:57let's fast forward 6 months time post 13:00WWDC right when all of the uh AI 13:04capabilities start to move on to iPhone 13:06regular which I think has already got 13:08the hardware that it needs to do these 13:10scenarios and let's see if the rabbit 13:12actually comes out of your drawer at 13:13that point or whether you're just doing 13:15those same scenarios on your phone yeah 13:17I think one of the funny scenarios I was 13:18thinking too is people right now 13:20obviously are focused on like the top 13:21end of the market right which is like 13:23who's willing to pay hundreds of dollars 13:24for the rabbit I also kind of think that 13:27as models get more efficient and more 13:29energy efficient uh we may just end up 13:32putting really small models in all sorts 13:34of existing Technologies and you know I 13:36think this could be both an interesting 13:38thing right and then also potentially 13:39like a bad thing like it's like 2035 and 13:41you're like arguing with your toaster to 13:43get working because at some point like 13:45someone made like the only interface for 13:46this is language basically 13:48[Music] 13:53so um let's go ahead and move to our 13:55second story so um I wanted to talk a 13:58little bit about gpt2 chatbot so if 14:01you're not familiar with this uh 14:03essentially this mysterious thing 14:04happens there's this platform called 14:05chatbot Arena which has become in some 14:08ways kind of the gold standard for 14:09evaluating models and it's a really 14:11simple idea you basically have people 14:13talk to two models um and you tell them 14:16which one you like more right and this 14:18is basically allowed the comparison 14:21cross product of a lot of different sort 14:23of Open Source models and proprietary 14:25models that are floating around the 14:26space and this kind of mysterious one 14:29merged uh gpt2 chatbot which everybody 14:32claims is incredible it's amazing and I 14:35agree actually playing around with it 14:36it's actually like quite impressive um 14:39and it was accompanied by this sort of 14:40mysterious kind of opaque tweet from Sam 14:43Altman saying that he also you know has 14:45good feelings about gbt2 and so it 14:48immediately has led to kind of this like 14:50fanfiction if you will about what this 14:53model is and whether or not it is kind 14:54of a Trojan hored quiet you know stealth 14:59release of what could be GPT 4.5 or GPT 15:025 um and so Chris I want to kind of 15:05throw it to you on like what are we 15:06seeing here is gpt2 chatbot really like 15:09the Next Generation model um and if 15:12you've got any kind of theories about 15:13that a would just love to get your take 15:14on like what are we seeing do you buy 15:16the hype I don't know I mean I had to 15:18play with it it's pretty good actually 15:20to be fair is it GPD 5 I don't know I 15:23think they've hyped GPT 5 so much that 15:26if that is at this point it has to be 15:28AGI or like not even going to impress us 15:31exactly so maybe it's GPT 45 but I I 15:34don't think that I I I read a Theory 15:35online I can't say who said it but I 15:38actually like it I somebody said that uh 15:41take the GPT to llm which they've open 15:44source you can download that in hugging 15:46face and they reckon that they may have 15:48trained gpt2 on the uh latest uh data 15:52that trains gp4 and I think that's an 15:55interesting Theory right you know gpt2 15:57with gp4 data so maybe it's something 15:59like that um I don't know um but I don't 16:02think it's GPT 5 it probably is GPT 45 16:05and as you say you you've got to put it 16:07in some sort of Arena to to see how well 16:10it's actually performing and you know 16:12they'll have run all the kind of MML 16:14benchmarks and the 20 other thousand 16:16benchmarks that's out there so you know 16:18sticking it in the chatboard arena see 16:20how it performs there is is probably 16:22quite a smart move right it's it's a 16:24good way of testing out how that model 16:25is um so I I think you know seriously 16:28it's probably GPT 45 but I I I really 16:31like the idea that it's gpt2 with GPT 45 16:34data I think that's a cool 16:36Theory yeah I think I mean there's two 16:39things there one of them is like if that 16:40actually turns out to be the case and 16:41the model performance is like really 16:43good in the arena it's like yeah do 16:45these architectures really matter like 16:47is it just like you you have enough data 16:49and you can actually make this like 16:50amazing um like that ends up being the 16:52bigger lever um well curious so I do 16:56have a follow-up question here but just 16:57to quickly pause I mean k show I'm 16:59curious if you got thoughts on like 17:00first do you just buy the hype like do 17:02you think this is the next model is it 17:05all overhyped curious if you got any 17:06thoughts on that it could be um I mean 17:10anyone's guest is as good as anyone 17:11else's so um yeah I mean I'm sure it is 17:14something that's coming up next but uh 17:18yeah why speculate I mean I'm sure 17:19they'll tell us pretty 17:21soon exactly if you guys followed the 17:24the talk by Andrew on how agentic flows 17:28are going to be the way we get to AGI I 17:31think over time the next set of models 17:33that you bring out they will have a 17:36decent router that will go pick the 17:38right models and you're seeing these 17:39kind of things come out from open AI 17:41already right they have they're 17:42automatically picking the right model 17:44based on the queries and things of that 17:45nature right so I think the gpt2 would 17:48be a step in the direction of getting to 17:50or the 4.5 and five but I think it'll 17:53not be just one big model that's going 17:55to be able to solve all of that so I 17:57think they may be testing out in public 18:00and getting some feedback on how people 18:01are reacting what kind of questions 18:02people are asking and things of that 18:04nature in these open in llm sis Arenas 18:07it's very entertaining it's great drama 18:09in the AI world I love it I just pulled 18:11up popcorn and just enjoy what's 18:13happening uh there was a there was 18:15somebody who posted that U originally 18:17Sam Alman had tweeted with the gpt-2 and 18:20then edited that it be a gpd2 and they 18:23just leaving breadcrumbs to just make 18:25this more entertaining so I love the 18:26direction that is going in I think over 18:29time it will not be one big 4.5 or five 18:31model you'll end up with a mixture of 18:33experts the way that uh that they will 18:35solve for this yeah for sure and just 18:37before we move on you mentioned Andrew 18:39who who is Andrew and is that stuff 18:40public if people want to check it out or 18:42is that internal yeah Andrew is God of 18:44AI so he's like like if you look at Deep 18:47learning.ai he started the the Google 18:50brain and whatnot he has co-founded 18:52corera is is a great great guide to 18:54follow on on AI yeah and it's also very 18:57funny to me it kind of occurs to me that 18:58it's like whether or not it's Sam mman 19:00or like Taylor Swift like both are 19:03basically like dropping breadcrumbs on 19:05social media as a way of like driving 19:07engagement around their products so uh 19:10and just for folks earlier that's that's 19:11Andrew ing um if you want to check out 19:13his stuff he's he's great I agree um so 19:17I uh I want to put on my tin foil hat 19:18for a moment right to kind of go on the 19:20next sort of turn of the screw with this 19:22story is basically uh let's assume for a 19:25moment that gpt2 chapot is is the next 19:28greatest thing thing that open AI is 19:29going to release I think it's actually 19:32very indicative that one major way they 19:36want to do an evaluation around this 19:38model is to release it on chatbot Arena 19:41right because like I think one of the 19:43interesting things I see evolving in the 19:44space is that you meet a lot of people 19:46who are mts's ml like you know basically 19:50like real deep and machine learning 19:52people who like I think desperately kind 19:54of hate the idea that like in order to 19:56tell whether or not a model is good you 19:58just talk to it for a bit and then you 19:59tell they tell you whether or not it's 20:00good or not right like they would prefer 20:02to have some kind of much more 20:04structured evaluation for measuring kind 20:07of like conversational quality um but as 20:09it kind of turns out chapot arena is 20:11kind of like dominating the space over 20:13time and um you know it kind of leads 20:16this very interesting world where it's 20:18like become more and more difficult to 20:19like quantify model quality and we're 20:22almost just kind of falling back to like 20:23almost the most one brain cell way of 20:26evaluating models which is well I don't 20:28know you talk to it for 10 minutes and 20:29then you say whether or not you think 20:30it's good or not and I was joking with a 20:31friend recently I was like oh what we 20:33should do is we should we should start 20:34reviewing models like we review like 20:36fine wine where you're like oh this is 20:38like a model with like you know oky 20:40overtones and it's a little bit more 20:41conversational and like I think we're 20:43like moving in that direction um but I'd 20:45be curious to hear from you know 20:47particularly like Chris like whether or 20:48not you agree that like that is the 20:50future because it is so funny that like 20:52what's happening is basically you have 20:53the super advanced technology but our 20:55eval methods remain like very 20:57rudimentary and I think some people 20:58would say that's good some people would 20:59say like well that's not how it's always 21:01going to be I think it's a good thing I 21:03mean I mean is it really that different 21:05from the original churing test right 21:07that's what Alan churing said right 21:08which is you go have a conversation if 21:10you can't tell the difference then you 21:12know then is it human or not and and 21:14actually if I think about the problems 21:16with the benchmarks then this is why I 21:19quite like the ELO ratings again it's 21:21not perfect uh LMC I think they do a 21:23good job there with the leaderboard but 21:25the the problem is that because all the 21:28benchmark are published we know 21:30everybody is fine-tuning to the 21:31benchmarks right so you know so how 21:36valuable are the benchmarks really so 21:38everybody's like I'm 84 or I'm 85 and 21:40you're like well you know but if I then 21:43ask a query that's completely different 21:45that's not on a benchmark then it starts 21:47to mess up right so one of my favorite 21:49tests is I play Hangman with um the 21:53various models right and and sometimes 21:56sometimes uh I'm playing the game and 21:58some times I'm choosing the word and and 22:01I can tell you straight up none of the 22:03models play Hangman very well including 22:05GPT 4 right so if I give it the word so 22:08I use cheese as my test and it very 22:12quickly guesses the E so you get blank 22:14blank e e blank e right there there is 22:17no other words and then every model is 22:20like I don't know an R I'm like what why 22:23are you guessing an R you know and and 22:27therefore you know know that sort of 22:30viess that you get from the kind of uh 22:33from the arena is really important 22:35because they're the sort of things the 22:36creative sort of tests that we'll have 22:38we'll play Hangman we'll play uh 22:40tic-tac-toe we'll you know ask different 22:43questions but if if you are literally 22:46training within an inch of its virtual 22:49life you know being fine- tuned to The 22:51Benchmark then how valuable are those 22:53benchmarks really so I think the future 22:56has got to be tests where you can't find 23:00shune IE you don't know what the 23:01questions and the answer are in advance 23:03and it's got to be a little bit more 23:04creative whether that turns into 23:06benchmarks whether it turns into kind of 23:08um you know LMC as we're doing today 23:11whether it turns into as you're saying 23:13there's like this model has this kind of 23:15vibe you know it's a little bit chatty 23:16it's good at classifications Etc I think 23:18you're right it might move in that 23:19direction but I I I think at the moment 23:23the arena is probably the only sensible 23:25place where you can actually rank these 23:27models sure and I I had really thought 23:29about that is like one if I hear you 23:31right I mean one of the arguments is 23:33like basically all of the benchmarks 23:35we've been using are now kind of useless 23:37is what you're saying so like almost 23:39like there's been this Collective action 23:40problem where like no one's been 23:41everybody's gaming The Benchmark and so 23:44the only thing you can really trust ends 23:45up being like I don't know you have like 23:47a 12-year-old talk to it for a little 23:49bit and tell you whether or not you know 23:50they like it or not you can trust if 23:53everybody's at the same level of you 23:55can't trust if everybody's at the same 23:56level of Benchmark that's no indication 23:59model model but obviously a model is low 24:04you know then you can say they got a 24:06little bit of work to do on that model 24:08but at the higher end you know if you go 24:10oh I'm 0.2 better in this Benchmark 24:13reality who cares yeah that's right 24:17that's right yeah and I think um yeah 24:19and I also buy that right which is 24:20basically maybe it's actually a sign of 24:22the success of these models is that like 24:25a lot of the leading ones are so good 24:26now right that the benchmarks are a lot 24:29less useful because like yeah we're 24:30talking about these gradations that in 24:32terms of like actual experience of the 24:34model like very limited um uh most of my 24:37my accounts of when I'm partnering with 24:39clients these are 1400 companies and and 24:42where we're doing gen at scale and we 24:44putting things into production there's a 24:46much higher threshold of what good looks 24:48like and how can you define it 24:50especially in the regulated industries 24:51that I work in final Services federal 24:53government and things of that nature 24:54right in those cases you have to be very 24:56precise on how do you measure the 24:58accuracy how to measure effect the 25:01answer is is correct it's grounded it's 25:03hallucinating things of that nature 25:05right so for majority of my Fortune 100 25:08companies we have when we partner with 25:10them we create a system of benchmarks 25:12that are very tailored to the way the 25:14use case that they're putting into 25:15production so for example if you're 25:17looking at a rack pattern you're looking 25:19at pulling the right content is that is 25:21that content correct from that from 25:23those Snippets are they rank the right 25:25way given those Snippets can I reliably 25:27create the answer is grounded what's the 25:29grounded score and given the answer that 25:31you retrieve does that really answer the 25:32question that was asked and stuff like 25:34that right across each one of those it's 25:36based on the kind of domain that they're 25:37working on if you're looking at say uh 25:41contracts and you're trying to analyze 25:43if the answer is pulling out are correct 25:44or not the question itself will Define 25:47what is a good metric for it I may ask a 25:49question about given a contract tell me 25:52if I can order this particular part from 25:54that contract or not which means it's 25:56looking at the top of the contract it's 25:58look at an exception on page 19 and so 26:00on so forth so it's more of a Chain of 26:01Thought to understand how things are 26:03connected but then if I say can you 26:05contrast these two contracts that's no 26:07longer a rack pattern you're now asking 26:09a question where it's pulling out the 26:10right information comparing it together 26:12giving it to an llm so that each query 26:15has its own set of metrics that we need 26:16to evaluate at each query type right so 26:18we've created some very robust metrics 26:21to evaluate these models whenever we 26:23have a new model like llama 3 came out 26:25and snowflak optic came out in the last 26:27few days we need to plop that model into 26:30that workflow 10 step process step 26:32number four I'm going to call an nlm 26:34everything that comes before and after 26:36we need to have a good set of metrics to 26:37evaluate it the majority of my fortune 26:40big companies that we're working with 26:42they kind of ignore they look at the 26:43metrics and say hey a new cool model 26:45came in so the wipes that a new model 26:47came in but the evaluation we do not 26:49look at human eval scores we do not look 26:51at the scores that that are public in 26:53nature because those are not as 26:55meaningful to Enterprise use cases so 26:57partnering with Consulting our clients 26:59have built these really robust 27:01benchmarking mechanisms and that's how 27:03we've been bringing these to production 27:05during experimentation and production is 27:07continuously evaluating that throughout 27:09the day yeah totally and I think it's 27:11one of the interesting things like I was 27:12talking with a friend recently I was 27:14like in the future we're probably going 27:15to have these agencies that just focus 27:17on 27:18evaluation um it just feels like it's a 27:20it's an emerging business it's like 27:22essentially like models and pre-training 27:23become more and more commodified the big 27:26question will be like well which one 27:27should actually choose it feels like 27:29there's a whole industry to be built in 27:31terms of like bespoke evaluations even 27:33in like curation of people who evaluate 27:35your model seems like a critical 27:36question yeah we're going to start 27:38interviewing models like we interview 27:40humans right you model are applying for 27:43this HR job so you are going to be 27:45evaluated against your HR skills you 27:47model are a developer let's see what 27:49your react coding skills are like and 27:51you know and and actually I I think it's 27:52a fair point right chit which is it it 27:55doesn't matter how good a model is on a 27:57benchmark right what only matters is is 28:00it good at the task you need it to do so 28:02if you need it to do legal contract 28:05comparisons it doesn't matter if it's 28:07the best poetry writer in Snoop Dog 28:09style right what matters is can I can I 28:12evaluate contracts because that's the 28:13job you want it to do and can it do it 28:15reliably one point I just wanted to make 28:18related to to what Chris was saying on 28:20um kind of people fine tuning to the 28:22benchmarks uh so one thing that uh the 28:26highi research lab of IBM they putting 28:28together kind of this hidden Benchmark 28:30so not releasing it anywhere um and uh 28:33they have this thing open sourced called 28:35unitext um so it's a way to I mean 28:38actually construct these very quickly 28:39very easily and so forth so um I think 28:43that's I mean generally One Direction 28:45that uh is also going to be emerging is 28:48um uh that job interview also kind of 28:50being hidden away uh so that uh people 28:52can't train to it and and things like 28:54that and I think what shth was saying is 28:57is precisely right I mean uh these have 29:00to be right on on point on task uh for 29:03the sort of usage that you want so one 29:05thing we talk a lot about with customers 29:07is something called usage governance and 29:10um that is precisely that right I mean 29:12you don't want to care about what else 29:15this model is doing just for what is 29:17important for for your application for 29:19your industry and and things like that 29:21so um yeah I mean I think it's a it's a 29:23great uh area and government regulations 29:26are going to require a lot of this third 29:28party testing and evaluation too very 29:30soon so I think everything is is headed 29:33in that direction yeah totally one of 29:35the stories I was thinking about 29:36covering which we we probably will end 29:37up doing in a future episode and CH it' 29:39be great to have you back on it is like 29:40nist right and like kind of the 29:42development of all of these like Federal 29:44standards in the space and and what it's 29:46going to look like I hadn't really 29:47thought that it's going to look like 29:49like this a standard like HR interview 29:51like I love the idea that like in 2035 29:53you're basically like what's your 29:54greatest weakness to the model or you 29:56like you know I'm going to need you to 29:57do this bubble sour you know is like the 29:59question you're going to ask people the 30:00models to do um and hopefully models 30:03will find it as frustrating as humans 30:06do yeah and I think models are going to 30:08evaluate models too I mean we're already 30:10seeing that quite a bit and um uh one 30:13thing our team is working on a little 30:15bit is um so the arena is actually just 30:18a par wise comparison with a human 30:20judging two things but um if you have 30:22three models um we can actually figure 30:24out smart ways of having models figure 30:27out which ones are are good because um 30:30when you have three models let's say one 30:31is an expert one is um a noice and one 30:34is like intermediate um the expert can 30:37know which one's the novice the 30:39intermediate one can also know which 30:41one's the novice so with using three at 30:43a time you can actually figure out um 30:46kind of a total ranking of models and 30:48stuff so it's uh it's a fun game uh to 30:51be in fantastic hey Kush yesterday uh we 30:54had a new paper by coh here talking 30:57about the panel of llms right p uh 31:00that's a very very interesting way of 31:01looking at it instead of having one llm 31:03as a judge uh when you start to mix 31:06different LMS as a panelist you get a 31:08better accuracy in being able to Define 31:10that um I I also think that the the task 31:13that we asking an llm to do has 31:15fundamentally we need to have better 31:17appreciation for what steps llm is 31:20better at versus Humans so I feel this a 31:22little bit of a flaw in our benchmarking 31:24systems today where we are evaluating if 31:26you look at a 10-step process there are 31:28certain steps along that way that humans 31:29do that are incredibly easy for llm to 31:33take on right they just slam through 31:34those and then something very very 31:36fundamental will be so darn hard for an 31:38LM to get right so I think we are 31:41projecting what we are good at as a good 31:43Benchmark to evaluate llms I think as we 31:46as we go play around with these more 31:48you'll have a better understanding of 31:50what should we evaluate that llm on so I 31:52think the benchmarks themselves will 31:54start to evaluate the change so I'll 31:56probably not ask them what's their 31:57strength and weaknesses are or tell me a 31:59j story and things of that nature but 32:01I'm sure we'll have a better of what 32:03kind of questions we should be 32:04evaluating these elims they in context 32:06and grounded in the use cases that are 32:08in production for our Enterprise 32:11[Music] 32:15clients let's move on to the final story 32:17so this will be a quick final 32:18conversation but I think it was a big 32:20enough story that I think it's worth 32:21bringing in um so the news uh broke 32:24earlier this week that open AI had 32:27signed 32:28a licensing deal with the financial 32:30times basically to license their content 32:32for for training purposes um and 32:35obviously this happens on the backdrop 32:36of open AI you know getting sued by the 32:39New York Times and a number of other 32:40kind of Rights holders right and the 32:42kind of question about what are you 32:44allowed to train on is it a copyright 32:46infringement what do companies like open 32:48AI ow owe to people who you know uh 32:51whose data is integrated into their 32:53models uh is a really big one um and 32:56course I know you work on AI governance 32:57kind of want to throw this over to you 32:59for the first sort of take is you know 33:00I'm curious if you have any reactions to 33:02this news like do you think that you 33:04know we're going to see more of these 33:05types of Licensing deals going forwards 33:06in the future and and sort of if so why 33:09I'm I'm sort of interested in kind of 33:10like what's driving sort of the business 33:12decision here yeah so um I mean the 33:15content creators uh certainly need to I 33:19mean receive something in order for them 33:21to just exist right um because uh we're 33:25I mean pretty soon have we'll have run 33:26out of all the token in the world for 33:29these things to be trained on right and 33:31so um uh the new content needs to come 33:35from somewhere it can't be just fully 33:36like synthetic generated data I mean 33:38that will lead to model collapse and all 33:40sorts of other um sort of things so uh 33:44how that happens I mean copyright was 33:46always meant not to be a permanent sort 33:48of thing it's just to protect those 33:50creators during their lifetimes so that 33:53they have some livelihood and and so 33:55forth right so I think that's the idea a 33:58that we need to keep going with um and 34:00so it might just lead to a completely 34:03different business model um so local 34:05journalism has kind of died um in in the 34:08world and U maybe this is a way to 34:10resurrect it because um uh you need to 34:14have I mean like this information that 34:17is coming from somewhere um when we have 34:19a rag pattern or anything else I mean 34:20there needs to be timely information as 34:22well so uh I think just the fact that 34:26content needs to be be there um and we 34:29need to have a way to to have it uh uh 34:32kind of incentivized and and so forth is 34:35is the is the Crux of it um at IBM I 34:38mean we do do a great job trying to 34:40eliminate all um copyrighted content out 34:43of our Granite models and I mean we do a 34:45lot of stuff there but um eventually I 34:48think uh it's not a question of like 34:51who's infringing who's getting sued 34:53who's doing licensing deals but how do 34:55we just make an ecosystem such that uh 34:58the creators are are valued as much as 35:01as anyone else yeah for sure and I think 35:03this is actually really at the Crux of 35:05whether or not this AI economy can can 35:07work right because I think that um you 35:10know if you want to use Google as like 35:11an earlier template right it was sort of 35:13this interesting moment where we said 35:14okay you'll allow us to index all of the 35:17web and in exchange we're going to send 35:19traffic to you because we're search 35:20engine right and like that actually 35:21created a trade by which you know an ad 35:24economy could work it feels like here 35:26are the challenges that we haven't yet 35:27built that infrastructure to create sort 35:30of that symbiosis right and so like 35:32essentially there's nothing 35:33incentivizing the creation of new high 35:36quality tokens which is going to be a 35:38structural issue for the the market 35:40ultimately I guess coach maybe the 35:42question I'd have for you to maybe push 35:43back a little bit because I was debating 35:45this with a friend of mine you know my 35:47friend was like this is all just window 35:49dressing right because it turns out that 35:51like Financial Times tokens are just 35:53like not that valuable to open a a it's 35:55not a whole lot of tokens and B they 35:57already have a lot of new stories right 35:59do you do you kind of buy that like how 36:01valuable are the kind of tokens that 36:02we're talking about here um you know 36:05when it comes to a newspaper or when it 36:07maybe even comes to like say like is it 36:09more valuable to get um you know movie 36:12scripts right than it is new stories 36:14like I think this ends up being a really 36:16interesting question about like where 36:17the most valuable tokens are coming from 36:19um and I'm just curious to be like if 36:21you do think that these kind of like 36:22journalism tokens which have been the 36:24focus of so much attention um really is 36:26is where it's at 36:28yeah I mean journalism tokens have been 36:30in the news but uh I mean comedians have 36:32been suing as well I mean it's not that 36:35it's one or the other right so um uh so 36:40I think it's just the fact that it has 36:41to be new tokens and um uh I mean 36:44there's distribution shift right the 36:45world is changing now we have uh uh 36:48whatever we talked about at the 36:50beginning right I mean these uh these 36:52new gadgets the terminology for those 36:54isn't going to exist in um uh kind of 36:57any historical documents so we need to 37:00keep up with the world the way the 37:02meanings of words change the I mean any 37:04sort of new thing that comes up right 37:06and news tends to be one place I guess 37:09your comedy routines tend to be another 37:10place I mean wherever this um uh the 37:13where the world changes however we can 37:15bring that in I think that's where the 37:16value is because it's not about the 37:18number of tokens it's the quality and 37:21the quality in terms of how to get these 37:23models to keep adapting to the world as 37:25it exists yeah yeah for sure well so we 37:29uh probably in wrap-up mode right now 37:31does uh Chris show but I'm curious if 37:32you got any final takes on on this 37:34before we close up I I go one take and 37:36it's probably going to be on the 37:38opposite side and we had this chat 37:40before Tim which is um I think there's 37:43going to be a whole business on data 37:45washing coming out there because if if 37:47you really look at this and yeah I think 37:50there will be some folks like Financial 37:51Times that will license their data and 37:53that's great but you know if you take 37:55five articles which is a news article 37:58you know on the same subject I run it 37:59through a model you know I get it 38:01summarized and then maybe I open- source 38:04that data set right and then somebody 38:06else who's training a model goes and use 38:07that open source data set right you know 38:10where that data originally came from is 38:12gone right and as far as the model 38:14trainer is concerned it's like oh no I 38:16used this open source data set which is 38:17MIT licensed Etc I pulled off the 38:20internet there you're now one step 38:22removed from the uh the original content 38:25sources so I I think that's going to 38:28become I think that's going to become a 38:30big thing and then I I see people doing 38:32that commercially as well so as much as 38:34we're all good people and we want the 38:36Providence to go around I I just don't 38:39see a world where uh everything is so 38:42lovely and and you know we're all 38:45high-fiving each other and how good we 38:46are right I I see this data washing 38:48world coming really really quickly yeah 38:52for sure I have a I have a different 38:54slightly different take on this I think 38:56it is quite dang dangerous 38:59U I'm a big proponent of decentralized 39:02AI I'm with ammed from Cil ai's Camp 39:06right and he left stabil AI with his 39:08mission of decentralizing AI I think the 39:12fact that open AI is making a decision 39:14to partner with one news Outlet if they 39:17picked fox or they picked CNN they would 39:20have a different set of news articles 39:21that are being trained on right so I 39:23think there's quite a bit of bias in the 39:24media itself right if you look at the 39:26elections coming up and what not right 39:28where you what decisions you making on 39:31what data you think is high quality it's 39:33a single entities definition of what is 39:35high quality data where if you look at a 39:37human beings like we're being exposed to 39:39all the different uh both ends of the 39:41spectrum of of news articles and stuff 39:44right so I think decentralizing is going 39:45to be very important and this is this 39:47also goes to speak to open models where 39:50people can add data on the Fly and they 39:52can personalize it more and things of 39:54that nature I think it is a little bit 39:55dangerous when large organiz ation that 39:57have become AIS of Power with AI are 40:00making decisions unilaterally on what 40:02good looks like what ft would would 40:05produces news articles may not be 40:07representative of what the culture or 40:09the what the demographics of a 40:10particular country or particular region 40:12are so I think there's a little bit of 40:14definition of hey if I'm using a gen 40:16model from a particular vendor do I just 40:19get to go personalize it and say hey I'm 40:21leftwing I'm right-wing or I am more 40:23like my thoughts on particular topics 40:25that it gets more and more personalized 40:26to me and that defines how it responds 40:28and becomes my personalized version of 40:31it right I I wonder show it if the 40:35training is actually the important part 40:37for the open AI piece I wonder if 40:39actually it's just going to be ragging 40:41that data is actually the the key thing 40:43for them because you're giving the 40:44up-to-date article and therefore As you 40:47move into those agentic platforms 40:49training the models not going to be that 40:50valuable but actually being able to 40:52serve up and say this is the latest news 40:54from financial times and it's it's valid 40:56and it's not Hall ating I think that's 40:58probably valuable Chris I think it's 41:00just like kids right it's nature and 41:01nurture both of them so I think that's 41:03where we heading right like like what do 41:05you what was your nature of the kid that 41:07was born and what how they were nurtured 41:08or were dying but thanks so much Tim 41:10this is extremely helpful great great 41:12set of questions yeah absolutely well 41:14thanks everybody I could have not asked 41:16for a better panel to start with our 41:17inaugural inaugural episode so uh Chris 41:20Kush show it thanks for joining us and 41:22uh we hope to have you on on a future 41:24episode thanks te everyone 41:30[Music]