Learning Library

← Back to Library

Genie AI Beats Bench for Bug Fixing

10m • Unknown Channel • ai-ml • news • intermediate • Watch on YouTube ↗

Key Points

The AI‑coding assistant “Genie,” built by Cosign, recently topped the WE‑Bench leaderboard, outperforming the previous leader Devin by roughly 2 × on bug‑fixing tasks.
Genie’s edge comes from a heavy emphasis on structured reasoning—encoding planning, code‑location, and architectural logic outside the LLM rather than relying on the model to “throw code at the wall.”
The system combines retrieval‑augmented generation (RAG) with context‑aware “in‑place edits,” allowing it to modify existing code while respecting the project’s style and architecture before performing an agentic validation step.
The speaker also teases a second new AI product aimed at children, which will be covered after the developer‑focused discussion.

Sections

00:00:00 Genie AI Tops Bug‑Fix Benchmark - The speaker highlights Genie, a new LLM tool from Cosign that now leads the WeBench bug‑fixing leaderboard, outperforming previous solutions by emphasizing advanced reasoning for software debugging.

Full Transcript

# Genie AI Beats Bench for Bug Fixing **Source:** [https://www.youtube.com/watch?v=VxzQiJ8Ngkk](https://www.youtube.com/watch?v=VxzQiJ8Ngkk) **Duration:** 00:10:07 ## Summary - The AI‑coding assistant “Genie,” built by Cosign, recently topped the WE‑Bench leaderboard, outperforming the previous leader Devin by roughly 2 × on bug‑fixing tasks. - Genie’s edge comes from a heavy emphasis on structured reasoning—encoding planning, code‑location, and architectural logic outside the LLM rather than relying on the model to “throw code at the wall.” - The system combines retrieval‑augmented generation (RAG) with context‑aware “in‑place edits,” allowing it to modify existing code while respecting the project’s style and architecture before performing an agentic validation step. - The speaker also teases a second new AI product aimed at children, which will be covered after the developer‑focused discussion. ## Sections - [00:00:00](https://www.youtube.com/watch?v=VxzQiJ8Ngkk&t=0s) **Genie AI Tops Bug‑Fix Benchmark** - The speaker highlights Genie, a new LLM tool from Cosign that now leads the WeBench bug‑fixing leaderboard, outperforming previous solutions by emphasizing advanced reasoning for software debugging. ## Full Transcript

0:00I like to stay up to AI in part by 0:02looking at where new products are 0:04pushing the edges of llm capabilities 0:07and I saw two products this week that I 0:09want to call out one is for software 0:12developers and one is for kids we're 0:13going to cover both of them so the one 0:15for software developers first uh it's 0:17called Genie it's developed by a company 0:20called cosign it was just announced I 0:23think today the 12th through yesterday 0:25the 11th that jeie is now on the 0:28leaderboard the number one performer for 0:32a set of tasks that test llm 0:34capabilities at solving bugs known aswe 0:38bench uh and what that does is if you're 0:40not familiar with it you basically give 0:44the llm a set of GitHub bugs to resolve 0:48and it goes and fixes them and you sort 0:50of measure how good it does and it's 0:52much better than much B hooded solutions 0:56that we have seen previously so Devon 0:58came out like like three four five 1:01months ago I forget and they were 1:03celebrating because they got a score of 1:05133% on re resolving novel GitHub bucks 1:09and now the bar is 30% because that's 1:13what uh Genie got so jeie is roughly 1:16twice as good as Devin and the reason 1:21why is kind of why I brought this up 1:23they are focusing specifically on 1:25pushing the edges of llms around 1:27reasoning and that's been one of the big 1:28gaps that has proved prevented llm 1:31performance in code environments in the 1:33past or at least mitigated it or made it 1:35less effective so one of the things that 1:38sometimes agents will do uh llm agents 1:41will do if they're trying to solve a 1:42software problem is they they sometimes 1:44just sort of throw code at the wall 1:45until it works and I've even seen this 1:48when I've played with Claude and I've 1:50had Claude write JavaScript for me and 1:52like if I tell it what the bug is I have 1:54to actually give it a fair bit of 1:55direction before it starts to sort of 1:57adjust the code the way I want it to 1:59adjust 2:01cosign and the genie product seem to be 2:04different because they've worked really 2:05really hard on starting to encode 2:07reasoning outside the llm itself so the 2:10llm can handle the language piece but 2:13there's a lot of training on the back 2:14end to make sure that the approach is 2:18structured the way an engineer would 2:19structure and so they emphasize locating 2:23code in the correct place in your 2:24existing code structure so there's a 2:26deep integration component they 2:28emphasize planning and structured 2:30planning that's where the logic comes in 2:32they said they use rag in there they're 2:33kind of Cy about what else is in there 2:35it's not just rag it's very clear 2:37there's something else uh they talk 2:39about in place edits as another thing 2:41which one of the things I noticed is 2:43that like in place edits is something 2:45that requires a degree of contextual 2:47awareness that typically human Engineers 2:49are really much better at because they 2:51understand the code base and so they're 2:53working on resolving that by making sure 2:55that you can do an inpl edit that's 2:58contextually aware of the style of your 3:00code the architectural principles and 3:02then come back and do the final step 3:04which uh is agentic validation so they 3:06go back and like independently validate 3:08and test the in place edits they've made 3:11and then if it doesn't work they go and 3:13rewrite the code in place until it does 3:15and and it's fin result so they are kg 3:18this is not something where they're 3:20going to publish enough of their 3:21methodology that you can replicate it uh 3:24they are deliberately not being open 3:25sourc they view being closed source as a 3:28competitive Advantage which of course 3:29they can and I think one of the things 3:33I'm keeping an eye on 3:35is the overall direction of the market 3:39as a product like this comes online 3:41because I've noticed in the past when 3:43something like this hits the market you 3:45wait two months three months someone 3:46else is going to beat them and they're 3:48going to take the same approach and push 3:49it harder one of the hardest things 3:52right now is 3:54competitive value how do you have 3:56durable value when we were building the 3:59railroad 4:00durable value was literally laying track 4:02to a village and you sort of owned the 4:04track and you owned access to the 4:05village and the trains could go there 4:07and you could sort of harvest rents off 4:09of the value that you would put down 4:12it's not that clear when you were 4:13building an llm because llms are 4:16expensive you have to train them now you 4:18have to have this agented reasoning 4:19piece that comes behind it all of the 4:21thinking that goes into how you design a 4:24system that doesn't just generate tokens 4:25but actually plans and reasons and at 4:27the end of the day the bar is moving so 4:30fast in this market and so much money is 4:32being poured into this intelligence 4:33development initiative from the industry 4:35as a whole that you will be overtaken 4:40within 90 days like it's a safe bet it's 4:43happened just about every 90day period 4:45that whatever the bar is it gets raised 4:46about every three 4:48months and yet you've put so much into 4:50that that you jumped on the treadmill 4:52you you've decided to raise your flag 4:53and say you're you're a company that 4:55focuses on enabling developers and now 4:57you're just having to just pump out 4:59better and better Solutions and from a 5:01developer perspective from a consumer 5:03perspective from a working perspective 5:04that's fantastic intelligence is getting 5:06cheaper and more available but if you're 5:09trying to build a company around llm 5:11intelligence it is not at all clear 5:13right now how you monetize this kind of 5:15development so the two takeaways I have 5:18are that the technical side is really 5:20interesting I love that they're working 5:21on the reasoning I think we're going to 5:23find out a whole lot more about how they 5:25did it in the next 90 to 120 days 5:26because someone else is going to do it 5:28and the details are going to start to 5:29leak and these Solutions are like 5:30leaking out faster almost than we can we 5:32can 5:33track so I would expect we're going to 5:36have a real solid sense of how to get 5:37reasoning to solve at a 30% Benchmark 5:39quite quickly even if they choose not to 5:41release how they did it and then the 5:44other thing is the competitive pressure 5:46means that it's hard to build Moes in 5:48this economy like if you are building in 5:50AI you have to think really carefully 5:52about your distribution advantage and 5:53building modes and that brings me to the 5:57second development which is in the kids 5:58department and I think this is a good 6:00example of distribution advantage or at 6:02least implied distribution Advantage 6:04this is a company uh called magical toys 6:07they produce something called Dino I'm 6:09going to link these in the in the bottom 6:11of the YouTube and the thing that's 6:13interesting is that this is an embodied 6:15llm and so I'm not saying the llm can 6:18sense its surroundings the way a human 6:19can but I'm saying that theyve placed 6:21the llm inside a plushy or a stuffed 6:24animal so that a kid can interact with 6:26it and the idea is kids have too much 6:29screen time let's make sure the child is 6:32talking to and developing verbal skills 6:35and interacting with a creature on a 6:37daily basis and oh the creature happens 6:39to be a plushy with an llm capability 6:42and so Mom and Dad can monitor the chats 6:44and Mom and Dad kind of keep track and 6:46the dino will sort of use positive 6:48emotional reinforcement and be 6:49supportive and emotionally empathetic 6:51and all of that I don't doubt the 6:54ability to implement an llm with that 6:57kind of emotional parameter right like 7:00that is safe to talk to kids that it's 7:02going to be emotionally uh supportive of 7:04the child's feelings that it's going to 7:05be able to help the child learn verbal 7:08skills we are headed to a world where 7:11children and the elderly are definitely 7:14going to be interacting with robots as a 7:16part of the human development process 7:19and whether or not you and I are 7:20comfortable with it it's 7:22happening I think the piece that is 7:25interesting to me is that they are 7:28getting the distribution advantage in 7:30the space by 7:32literally pushing the product into 7:34people's homes and then they have it and 7:36they're going to get another one when 7:37the next kid comes along and they're GNA 7:39up upgrade it when the company releases 7:41another one there's a loyalty because 7:43Dino is a member of the family and 7:46that's a really interesting play for 7:47distribution advantage and what is 7:49effectively a commoditized llm space and 7:52so they're looking to differentiate it 7:54by making it more human they're looking 7:55to differentiate it by making it more of 7:57the family and that's what they actually 7:59say on the site welcome Dino into your 8:03family we are going to see more plays 8:05like that and that lines up with the 8:06whole monetizable movement around AI 8:09companions which has been one of the few 8:10areas where llms have monetized really 8:12successfully is as talking companions 8:14for 8:16people and I think that one of the 8:21things that we haven't yet figured out 8:23how we as a species feel about is is it 8:27more human or less human is it positive 8:28for us to practice this human language 8:31skill with something that is effectively 8:33a robot that we trained on human 8:36language and it's really good at it I 8:39think net net it will probably make us 8:41better at language I would buy that 8:43spending two hours a day with Dino for a 8:45kid is better for the child's language 8:47development than two hours a day with 8:48the iPad I think that's pretty logical 8:51the interactivity is going to reinforce 8:53learning and I 8:56still think that we are going to see 8:58people who are going to take that 9:01learning and say well where is the other 9:03children playing with the child that 9:06would normally reinforce reinforce 9:08language development and so I think 9:09we're going to get questions as these 9:11companies come into our lives around how 9:13they work not just with individuals not 9:15just with families but with communities 9:17how does Dino work when the child has a 9:19play group does Doo pull the child away 9:21from the playgroup or 9:23not so I will be curious to see how that 9:26unfolds I think from a pure distribution 9:28perspective they have a distribution 9:30Advantage by personalizing the product 9:33and I think we're going to see more of 9:35that I also think they have an advantage 9:37by playing in the emotional space that 9:40will help them be not a commodity for 9:42the families that purchase them and I am 9:44beginning to wonder if that kind of play 9:46is something that will help llms 9:48breakthrough on the consumer side so two 9:50things I'm watching one is on the 9:52developer side definitely nerdy looking 9:54at agentic reasoning and the other is on 9:56the consumer side looking at stuffed 9:58animals and how childhood development is 10:00aided by interactive llm driven 10:03dinosaurs have a great one