Learning Library

← Back to Library

Genie AI Beats Bench for Bug Fixing

Key Points

  • The AI‑coding assistant “Genie,” built by Cosign, recently topped the WE‑Bench leaderboard, outperforming the previous leader Devin by roughly 2 × on bug‑fixing tasks.
  • Genie’s edge comes from a heavy emphasis on structured reasoning—encoding planning, code‑location, and architectural logic outside the LLM rather than relying on the model to “throw code at the wall.”
  • The system combines retrieval‑augmented generation (RAG) with context‑aware “in‑place edits,” allowing it to modify existing code while respecting the project’s style and architecture before performing an agentic validation step.
  • The speaker also teases a second new AI product aimed at children, which will be covered after the developer‑focused discussion.

Full Transcript

# Genie AI Beats Bench for Bug Fixing **Source:** [https://www.youtube.com/watch?v=VxzQiJ8Ngkk](https://www.youtube.com/watch?v=VxzQiJ8Ngkk) **Duration:** 00:10:07 ## Summary - The AI‑coding assistant “Genie,” built by Cosign, recently topped the WE‑Bench leaderboard, outperforming the previous leader Devin by roughly 2 × on bug‑fixing tasks. - Genie’s edge comes from a heavy emphasis on structured reasoning—encoding planning, code‑location, and architectural logic outside the LLM rather than relying on the model to “throw code at the wall.” - The system combines retrieval‑augmented generation (RAG) with context‑aware “in‑place edits,” allowing it to modify existing code while respecting the project’s style and architecture before performing an agentic validation step. - The speaker also teases a second new AI product aimed at children, which will be covered after the developer‑focused discussion. ## Sections - [00:00:00](https://www.youtube.com/watch?v=VxzQiJ8Ngkk&t=0s) **Genie AI Tops Bug‑Fix Benchmark** - The speaker highlights Genie, a new LLM tool from Cosign that now leads the WeBench bug‑fixing leaderboard, outperforming previous solutions by emphasizing advanced reasoning for software debugging. ## Full Transcript
0:00I like to stay up to AI in part by 0:02looking at where new products are 0:04pushing the edges of llm capabilities 0:07and I saw two products this week that I 0:09want to call out one is for software 0:12developers and one is for kids we're 0:13going to cover both of them so the one 0:15for software developers first uh it's 0:17called Genie it's developed by a company 0:20called cosign it was just announced I 0:23think today the 12th through yesterday 0:25the 11th that jeie is now on the 0:28leaderboard the number one performer for 0:32a set of tasks that test llm 0:34capabilities at solving bugs known aswe 0:38bench uh and what that does is if you're 0:40not familiar with it you basically give 0:44the llm a set of GitHub bugs to resolve 0:48and it goes and fixes them and you sort 0:50of measure how good it does and it's 0:52much better than much B hooded solutions 0:56that we have seen previously so Devon 0:58came out like like three four five 1:01months ago I forget and they were 1:03celebrating because they got a score of 1:05133% on re resolving novel GitHub bucks 1:09and now the bar is 30% because that's 1:13what uh Genie got so jeie is roughly 1:16twice as good as Devin and the reason 1:21why is kind of why I brought this up 1:23they are focusing specifically on 1:25pushing the edges of llms around 1:27reasoning and that's been one of the big 1:28gaps that has proved prevented llm 1:31performance in code environments in the 1:33past or at least mitigated it or made it 1:35less effective so one of the things that 1:38sometimes agents will do uh llm agents 1:41will do if they're trying to solve a 1:42software problem is they they sometimes 1:44just sort of throw code at the wall 1:45until it works and I've even seen this 1:48when I've played with Claude and I've 1:50had Claude write JavaScript for me and 1:52like if I tell it what the bug is I have 1:54to actually give it a fair bit of 1:55direction before it starts to sort of 1:57adjust the code the way I want it to 1:59adjust 2:01cosign and the genie product seem to be 2:04different because they've worked really 2:05really hard on starting to encode 2:07reasoning outside the llm itself so the 2:10llm can handle the language piece but 2:13there's a lot of training on the back 2:14end to make sure that the approach is 2:18structured the way an engineer would 2:19structure and so they emphasize locating 2:23code in the correct place in your 2:24existing code structure so there's a 2:26deep integration component they 2:28emphasize planning and structured 2:30planning that's where the logic comes in 2:32they said they use rag in there they're 2:33kind of Cy about what else is in there 2:35it's not just rag it's very clear 2:37there's something else uh they talk 2:39about in place edits as another thing 2:41which one of the things I noticed is 2:43that like in place edits is something 2:45that requires a degree of contextual 2:47awareness that typically human Engineers 2:49are really much better at because they 2:51understand the code base and so they're 2:53working on resolving that by making sure 2:55that you can do an inpl edit that's 2:58contextually aware of the style of your 3:00code the architectural principles and 3:02then come back and do the final step 3:04which uh is agentic validation so they 3:06go back and like independently validate 3:08and test the in place edits they've made 3:11and then if it doesn't work they go and 3:13rewrite the code in place until it does 3:15and and it's fin result so they are kg 3:18this is not something where they're 3:20going to publish enough of their 3:21methodology that you can replicate it uh 3:24they are deliberately not being open 3:25sourc they view being closed source as a 3:28competitive Advantage which of course 3:29they can and I think one of the things 3:33I'm keeping an eye on 3:35is the overall direction of the market 3:39as a product like this comes online 3:41because I've noticed in the past when 3:43something like this hits the market you 3:45wait two months three months someone 3:46else is going to beat them and they're 3:48going to take the same approach and push 3:49it harder one of the hardest things 3:52right now is 3:54competitive value how do you have 3:56durable value when we were building the 3:59railroad 4:00durable value was literally laying track 4:02to a village and you sort of owned the 4:04track and you owned access to the 4:05village and the trains could go there 4:07and you could sort of harvest rents off 4:09of the value that you would put down 4:12it's not that clear when you were 4:13building an llm because llms are 4:16expensive you have to train them now you 4:18have to have this agented reasoning 4:19piece that comes behind it all of the 4:21thinking that goes into how you design a 4:24system that doesn't just generate tokens 4:25but actually plans and reasons and at 4:27the end of the day the bar is moving so 4:30fast in this market and so much money is 4:32being poured into this intelligence 4:33development initiative from the industry 4:35as a whole that you will be overtaken 4:40within 90 days like it's a safe bet it's 4:43happened just about every 90day period 4:45that whatever the bar is it gets raised 4:46about every three 4:48months and yet you've put so much into 4:50that that you jumped on the treadmill 4:52you you've decided to raise your flag 4:53and say you're you're a company that 4:55focuses on enabling developers and now 4:57you're just having to just pump out 4:59better and better Solutions and from a 5:01developer perspective from a consumer 5:03perspective from a working perspective 5:04that's fantastic intelligence is getting 5:06cheaper and more available but if you're 5:09trying to build a company around llm 5:11intelligence it is not at all clear 5:13right now how you monetize this kind of 5:15development so the two takeaways I have 5:18are that the technical side is really 5:20interesting I love that they're working 5:21on the reasoning I think we're going to 5:23find out a whole lot more about how they 5:25did it in the next 90 to 120 days 5:26because someone else is going to do it 5:28and the details are going to start to 5:29leak and these Solutions are like 5:30leaking out faster almost than we can we 5:32can 5:33track so I would expect we're going to 5:36have a real solid sense of how to get 5:37reasoning to solve at a 30% Benchmark 5:39quite quickly even if they choose not to 5:41release how they did it and then the 5:44other thing is the competitive pressure 5:46means that it's hard to build Moes in 5:48this economy like if you are building in 5:50AI you have to think really carefully 5:52about your distribution advantage and 5:53building modes and that brings me to the 5:57second development which is in the kids 5:58department and I think this is a good 6:00example of distribution advantage or at 6:02least implied distribution Advantage 6:04this is a company uh called magical toys 6:07they produce something called Dino I'm 6:09going to link these in the in the bottom 6:11of the YouTube and the thing that's 6:13interesting is that this is an embodied 6:15llm and so I'm not saying the llm can 6:18sense its surroundings the way a human 6:19can but I'm saying that theyve placed 6:21the llm inside a plushy or a stuffed 6:24animal so that a kid can interact with 6:26it and the idea is kids have too much 6:29screen time let's make sure the child is 6:32talking to and developing verbal skills 6:35and interacting with a creature on a 6:37daily basis and oh the creature happens 6:39to be a plushy with an llm capability 6:42and so Mom and Dad can monitor the chats 6:44and Mom and Dad kind of keep track and 6:46the dino will sort of use positive 6:48emotional reinforcement and be 6:49supportive and emotionally empathetic 6:51and all of that I don't doubt the 6:54ability to implement an llm with that 6:57kind of emotional parameter right like 7:00that is safe to talk to kids that it's 7:02going to be emotionally uh supportive of 7:04the child's feelings that it's going to 7:05be able to help the child learn verbal 7:08skills we are headed to a world where 7:11children and the elderly are definitely 7:14going to be interacting with robots as a 7:16part of the human development process 7:19and whether or not you and I are 7:20comfortable with it it's 7:22happening I think the piece that is 7:25interesting to me is that they are 7:28getting the distribution advantage in 7:30the space by 7:32literally pushing the product into 7:34people's homes and then they have it and 7:36they're going to get another one when 7:37the next kid comes along and they're GNA 7:39up upgrade it when the company releases 7:41another one there's a loyalty because 7:43Dino is a member of the family and 7:46that's a really interesting play for 7:47distribution advantage and what is 7:49effectively a commoditized llm space and 7:52so they're looking to differentiate it 7:54by making it more human they're looking 7:55to differentiate it by making it more of 7:57the family and that's what they actually 7:59say on the site welcome Dino into your 8:03family we are going to see more plays 8:05like that and that lines up with the 8:06whole monetizable movement around AI 8:09companions which has been one of the few 8:10areas where llms have monetized really 8:12successfully is as talking companions 8:14for 8:16people and I think that one of the 8:21things that we haven't yet figured out 8:23how we as a species feel about is is it 8:27more human or less human is it positive 8:28for us to practice this human language 8:31skill with something that is effectively 8:33a robot that we trained on human 8:36language and it's really good at it I 8:39think net net it will probably make us 8:41better at language I would buy that 8:43spending two hours a day with Dino for a 8:45kid is better for the child's language 8:47development than two hours a day with 8:48the iPad I think that's pretty logical 8:51the interactivity is going to reinforce 8:53learning and I 8:56still think that we are going to see 8:58people who are going to take that 9:01learning and say well where is the other 9:03children playing with the child that 9:06would normally reinforce reinforce 9:08language development and so I think 9:09we're going to get questions as these 9:11companies come into our lives around how 9:13they work not just with individuals not 9:15just with families but with communities 9:17how does Dino work when the child has a 9:19play group does Doo pull the child away 9:21from the playgroup or 9:23not so I will be curious to see how that 9:26unfolds I think from a pure distribution 9:28perspective they have a distribution 9:30Advantage by personalizing the product 9:33and I think we're going to see more of 9:35that I also think they have an advantage 9:37by playing in the emotional space that 9:40will help them be not a commodity for 9:42the families that purchase them and I am 9:44beginning to wonder if that kind of play 9:46is something that will help llms 9:48breakthrough on the consumer side so two 9:50things I'm watching one is on the 9:52developer side definitely nerdy looking 9:54at agentic reasoning and the other is on 9:56the consumer side looking at stuffed 9:58animals and how childhood development is 10:00aided by interactive llm driven 10:03dinosaurs have a great one