Learning Library

← Back to Library

Beyond Chatbots: Tools for LLM Gaps

Key Points

  • We rely on chatbots by default because the AI landscape is flooded with thousands of tools, and developers keep them “sticky” (e.g., adding memory) to capture our attention.
  • Large language models still have six core structural limitations—such as weak spatial reasoning and poor spreadsheet context handling—that prevent them from fully replacing specialized tools.
  • These gaps exist not because they’re impossible to solve, but because model builders are preoccupied with broader challenges like scaling GPU resources to serve massive user bases.
  • Examples of the limitations include LLMs producing sub‑par design work (e.g., messy PowerPoint slides) and struggling to interpret the multidimensional relationships in complex spreadsheets.
  • The video proposes 12 complementary tools (two per identified gap) to help users fill those shortcomings and work more efficiently.

Full Transcript

# Beyond Chatbots: Tools for LLM Gaps **Source:** [https://www.youtube.com/watch?v=kN1h33Fbiio](https://www.youtube.com/watch?v=kN1h33Fbiio) **Duration:** 00:17:56 ## Summary - We rely on chatbots by default because the AI landscape is flooded with thousands of tools, and developers keep them “sticky” (e.g., adding memory) to capture our attention. - Large language models still have six core structural limitations—such as weak spatial reasoning and poor spreadsheet context handling—that prevent them from fully replacing specialized tools. - These gaps exist not because they’re impossible to solve, but because model builders are preoccupied with broader challenges like scaling GPU resources to serve massive user bases. - Examples of the limitations include LLMs producing sub‑par design work (e.g., messy PowerPoint slides) and struggling to interpret the multidimensional relationships in complex spreadsheets. - The video proposes 12 complementary tools (two per identified gap) to help users fill those shortcomings and work more efficiently. ## Sections - [00:00:00](https://www.youtube.com/watch?v=kN1h33Fbiio&t=0s) **The Limits of Chatbot Dominance** - The speaker explains why users default to chatbots amid a flood of AI tools, outlines six structural limitations of large language models, and urges exploring specialized alternatives beyond chat interfaces. - [00:04:18](https://www.youtube.com/watch?v=kN1h33Fbiio&t=258s) **Limits of LLMs in Production** - The speaker outlines how current LLMs lack safe code execution, operational visibility, and the ability to craft coherent visual narratives, exposing critical gaps for AI integration in production pipelines. - [00:07:58](https://www.youtube.com/watch?v=kN1h33Fbiio&t=478s) **Tool Comparisons: Mockups & Spreadsheets** - The speaker differentiates Visily’s fast wireframing from code‑generating tools, then contrasts Shortcut AI’s advanced spreadsheet creation capabilities with the more widely available Numbers AI for embedding AI into existing sheets. - [00:12:34](https://www.youtube.com/watch?v=kN1h33Fbiio&t=754s) **Chronicle Aims to Replace PowerPoint** - The speaker argues that Chronicle, now in public beta, offers pixel‑perfect, interactive presentations that outclass PowerPoint and rivals tools like Gamma, positioning it as a professional, keyboard‑first workflow for rapid, high‑quality storytelling. - [00:15:47](https://www.youtube.com/watch?v=kN1h33Fbiio&t=947s) **Choosing the Right AI Tool** - The speaker advises listeners to pinpoint their biggest workflow bottlenecks or shared team pain points, then match those specific needs to targeted AI solutions—like Whisperflow’s app‑level voice interface or Nata’s high‑accuracy transcription—rather than over‑relying on generic chat‑bot workflows. ## Full Transcript
0:00We all use our chat bots too much. We 0:02do. We all use our chat bots because 0:04that is the default thing to do. And I 0:08think it doesn't help that there's a 0:09hundred,000 other tools. And I'm not 0:11actually making that number up. That's 0:12roughly what the number of total AI 0:14tools out there are. It's just too many. 0:16How do we figure out which one to use? 0:18And so we end up defaulting back into 0:20the chatbot space. And the model makers 0:22know that our mind share is allocated 0:24there. And so they continue to invest in 0:26making those experiences more sticky. 0:28That's why chat GPT has memory, for 0:30instance. It's a stickiness feature. 0:32When you think about it that way, it 0:34makes sense that we would periodically 0:36poke our noses out beyond chatbot land 0:39and actually look around the landscape 0:40and say, what other tools fill gaps that 0:43LLMs are inherently struggling to close? 0:46So, first in this video, I'm going to 0:48lay out some of those gaps that LLMs may 0:50not be the best in the world at. Not 0:52because it's impossible for the model 0:54makers to close the gaps, but because 0:55the model makers are preoccupied with 0:57larger what I would call generic problem 0:59sets like frankly finding enough GPUs to 1:03serve their model to all the people who 1:05want it. And that is also not something 1:07I made up. That's something very well 1:09documented as a prime concern for both 1:11anthropic and open AAI right now. So 1:13what are the six structural limitations 1:16of LLMs that are being partially 1:18compensated for inside chat bots right 1:20now? But maybe there are specialized 1:22tools that would help us get farther and 1:24do our work more effectively. And then 1:26from there we'll get into 12 tools, two 1:28for each of the the structural gaps that 1:31you can survey. My goal here isn't to 1:33convince you to use these tools. It's to 1:35help you get a sense of how to think 1:36about structural gaps in the chatbot 1:38experience and then to understand what 1:40tools might be useful for closing the 1:42structural gaps that matter to you. So 1:45gap number one, spatial reasoning. Yes, 1:48LLMs are absolutely getting better at 1:50this. I still am impressed that 03 can 1:53produce 3D graphs, but fundamentally if 1:56you are trying to get to design, LLMs 1:59are not phenomenal designers. I have yet 2:02to see an LLM do a great job at that. I 2:04think my favorite anecdote here is agent 2:06mode where with great effort chat GPT 2:10taught an AI agent to make a PowerPoint. 2:13The results are less than stellar. Uh to 2:15put it very kindly, the text tends to 2:17run over. It tends to run off the slide. 2:19It tends to be poorly organized on the 2:21slide. It doesn't work well with the 2:23visuals. The visuals feel slapped on. I 2:25know interns that could do a vastly 2:27better job. Gap number two, spreadsheet 2:30context. We have an issue with 2:31spreadsheets because spreadsheets have 2:34orthogonal meaning. In other words, they 2:35have relationships horizontally, 2:37relationships orthogonally, and in 2:39complex spreadsheets, there's 2:40relationships between tabs. There's 2:42relationships between special columns 2:44and rows and the regular columns and 2:47rows of data. There's formulas. It is 2:49really, really challenging for LLMs that 2:51are designed for next token prediction 2:53to master spreadsheets. Again, we see 2:56some progress. I'll go back to agent 2:58mode. It can make a spreadsheet. It can 3:00make a spreadsheet with a simple 3:02formula. Now, it cannot process your 3:04existing spreadsheet. Well, it can't 3:06build a fully complex spreadsheet yet. I 3:08have tried it. Eh, it's okay. When you 3:10ask other LLMs like uh Claude or Opus uh 3:14Claude Opus 4 or Shad GP03 or Gemini 2.5 3:18Pro, they range from insisting on CSVs, 3:21which are comma delimited and therefore 3:23more friendly to tokens to trying to 3:25ingest and process Excelss and still 3:28struggling. still struggling if they're 3:30large, still struggling to read all the 3:32detail. I've uploaded 40 or 50 row Excel 3:35spreadsheets and have found that even at 3:39that scale, which anyone who's using an 3:41Excel sheet will know is tiny, they can 3:43still sometimes struggle to list every 3:45row. They just can't seem to read all of 3:48the data. So, spreadsheets are a 3:49problem. Code execution also remains a 3:52challenge. Fundamentally, none of the 3:54LLMs were constructed with the idea of 3:56being code execution environments. and I 3:58don't anticipate them becoming code 4:00execution environments anytime soon. And 4:02for those of you who are not coders, 4:03that means running the code. The fact 4:05that Claude can spin up a little React 4:08component and you can kind of run a 4:09little applet inside a preview window is 4:12about the best it gets right now. And 4:14that's still very very minor, right? 4:16It's not really a full code execution 4:18environment. Certainly not something 4:19that you would want to put into 4:20production. And that may seem obvious 4:22and you may think that isn't even an AI 4:24related thing. But increasingly because 4:27prompts and because AI generated code 4:30and because LLMs themselves are 4:32integrated into our production pipelines 4:34for software, we do need software that 4:37has code execution and AI capabilities. 4:40Another gap, operational visibility. 4:42Again, why would you expect this? But 4:44LLMs are not built to give you any kind 4:47of operational visibility on your AI 4:49software in production. They're just 4:51not. No big surprise there. And last but 4:53not least, narrative structure is a huge 4:56problem for AI. And this is one that I 4:58don't think gets talked about a lot. 5:00Text versus experience is very difficult 5:02for LLMs to convey. They often will 5:04respond with various versions of text 5:06because they can output text, but they 5:08can't think through the visual 5:09hierarchy. They have trouble sometimes 5:11thinking through the structure of the 5:13story in a way that's accessible. This 5:15is an area where I would expect 5:17breakthroughs like chat GPT5 to be 5:19helpful, but I still think that there's 5:21going to be a complex interplay between 5:24the structure of a narrative and the way 5:26a narrative is visually presented that 5:28is going to be hard for traditional LLMs 5:30to master. And I think it's it's just 5:32not something that is easy to do unless 5:34it's your sole focus. And even then, 5:36it's quite difficult. One more, last but 5:38not least, voice processing. Chad GPT 5:41famously launched meeting notes 5:43recently. I have used them. They are 5:45only okay. They don't give you live 5:47transcriptions. They give you only one 5:49generic summary. You can't really access 5:51the transcript. It's very much a bolt-on 5:53feature. And that is exactly what I 5:55would expect from a team that is 5:57fundamentally resource constrained and 5:58trying to ship a lot of things to an 800 6:00million or more user base. Now, they 6:03cannot do everything perfectly. And 6:05therein lies the opportunity for 6:07builders like the 12 tools that we're 6:09going to outline here. Again, these are 6:11not the best 12 tools ever. I think they 6:14are great answers for these six gaps 6:16that I've called out. The six gaps being 6:18uh voice processing, narrative 6:20structure, operational visibility, code 6:23execution, spreadsheet context, and 6:25spatial reasoning. Those are not the 6:26only gaps, but I thought they were 6:28really illustrative of the kinds of gaps 6:30that LLMs have. And these 12 tools do a 6:32good job hitting those gaps. So look at 6:34these. Think about the strengths. Think 6:36about the weaknesses. I'll call out and 6:38think about where your workflow doesn't 6:41work well with a chat. Tool number one. 6:44This is in interface builders. Magic 6:47patterns. Magic patterns has just I've 6:49had people coming to me with magic 6:50patterns. I' I've not been the one sort 6:52of sharing it out, but people have come 6:53to me and showed me magic patterns 6:55because they like them so much. 6:57Fundamentally, it makes it extremely 6:59extremely easy to extract a design out 7:01of a screenshot or something else, turn 7:02it into working components and get 7:04something back that is a working piece 7:08of compliant stylewise front-end code 7:12that illustrates a vision for a new 7:14interface, which is a complicated way of 7:17saying it is really easy now to copy the 7:20style off the website and change it and 7:23show your engineers. And that is 7:25something every single marketer and PM 7:28and program manager and anyone else CS 7:31who has an idea for something that 7:33should be different about the tool or 7:34the app or the website. We have all 7:36wished for this. We have wished for it 7:38to be magically easy to say here's my 7:41sketch. Here's my concept. But magically 7:43it's in the right style. Now that's as 7:46simple as a screenshot and throwing it 7:47in magic patterns. Specialized tool 7:49closes a specific gap in an LLM. Is it 7:52perfect? No, it's not perfect. It's not 7:54designed for full app building, right? 7:56But does it give you a quick sketch 7:58sense? Is it designed for exactly what 7:59it does? Well, yeah, it does. Visally is 8:02another option there. It is a little bit 8:04cheaper than magic patterns. It focuses 8:06on rapid mockup creation rather than 8:08code generation. So, if you need the 8:09code components, don't go with Visily. 8:11If you just need the quick mockup, 8:13wireframing can be much faster with 8:15Visily. And so, again, like both are in 8:17the interface category. They do slightly 8:18different things. So, I want to lay them 8:20out as distinct. My goal here isn't to 8:22make these like competitors, but to 8:24actually help you understand how each 8:26tool is attacking a particular gap that 8:28a chatbot has. All right, let's move on 8:30to the second one. Spreadsheet 8:32intelligence. What do we have? Shortcut 8:35AI is exploding. It's an early access. 8:37You may not be able to get it. It is 8:40definitely the best I've seen at 8:43tackling complex spreadsheet creation. 8:46And I want to underline the word 8:47creation. There are still some struggles 8:48with macros and existing spreadsheets, 8:51but if you want to create something and 8:52you are a Power Excel user, I am getting 8:55rave reviews on this one. Again, not me. 8:57People coming to me saying, "I'm trying 8:58shortcut AI and it's incredible." And 9:00so, I suspect once this goes more widely 9:03public, there's a good chance it becomes 9:05the definitive answer for AI and Excel. 9:07The other solution, which is more widely 9:09available, is numerous AI, which really 9:12focuses on embedding AI in your existing 9:15spreadsheet through custom functions. 9:17That's a different use case. It's 9:19supposed to help you add AI in useful 9:22ways to your current spreadsheets versus 9:24just creating new sheets. From a product 9:26strategy perspective, Shortcut is in the 9:28stronger position because they're 9:30inventing a solution to the entire 9:32spreadsheet problem I discussed rather 9:35than just trying to wrap AI in into your 9:37existing sheets. There's no way, as far 9:38as I can tell, for numerous AI to create 9:42a brand new spreadsheet that is very 9:44complex from scratch and have it sort of 9:48handle the kind of complexity that 9:49Shortcut is bringing to the table, 9:51especially from a prompt. They just do 9:53different things. Again, it's not 9:54necessarily a competitor thing. They 9:55just do different things. And Shortcut 9:56is solving the bigger part of the 9:58spreadsheet intelligence problem. Let's 10:00move on to another gap. Executing code. 10:03We wouldn't expect most LLMs to do this, 10:05but we do need solutions that include 10:06AI. I'm going to give two. I don't hear 10:09a ton about either of these, but I want 10:10to throw them out there and you can tell 10:12me what you think about them and which 10:14one you think is more useful. Uh the 10:16first one is e2b.dev. It starts at a 10:19free tier. It leverages AWS firecracker. 10:22Um the the critical piece is that it's 10:24it's effortless to integrate. Like if 10:26you want to throw this up and make it 10:28easy to execute code, e2b.dev makes it 10:31easy to stand up a sandbox and try 10:33something. It's super quick. Daytona is 10:36not as cheap. I love that it's named 10:38Daytona, by the way. It's also a little 10:39bit, as you would expect, more 10:40established, right? It has ISO 2701, 10:43sock 2, all that good stuff from a 10:45certification perspective. And it again 10:47makes it it makes it easy to execute 10:50code and ensures you won't damage 10:52production systems. And that is one of 10:53the biggest concerns that people have 10:55with vibe coding is that you're going to 10:57have the risk that it will damage 10:58production systems. So, the stakes here 11:00are real. And I think you're going to 11:01see a lot of traction in this space from 11:03startups like E2B.dev and Daytona. I'm 11:06curious if you guys have a strong 11:08opinion between the two. Moving on to 11:10LLM observability. Really important to 11:12understand if you're running a lot of 11:14prompts through production grade AI, you 11:16have to understand how they're actually 11:18working. And I want to call out too, 11:20both of these are very established at 11:22this point. Uh, helicone is very simply 11:25a clear visibility proxy that just sits 11:29across your stack and makes it really 11:30obvious where your chatbot logs are and 11:35how you can monitor them and enables you 11:37ultimately to track latency costs errors 11:40across more than 100 model providers in 11:42a single gateway. So far so good. I 11:44actually really like it. A lot of 11:46companies use it. Another one that is 11:48also strong is Langfuse. uh you can have 11:50observability tracing evaluation 11:52frameworks again they have sock 2 they 11:54have ISO2701 11:56uh you can track parent child 11:58relationships with execution tracing uh 12:00you can automate quality assessment in 12:01ways that telecom doesn't always attempt 12:03to do so there's some differences 12:05between the two to dig into I think in a 12:07sense the observability piece is 12:09something that we have had a little bit 12:11more runway on so it's it's been more of 12:15an obvious problem for a while whereas I 12:17think the vibe coding execution ution 12:20piece and the sandbox piece is newer 12:22because vibe coding itself is only a few 12:24months old and so we're still figuring 12:25out where the winds are there. Let's 12:27move on to story delivery. Another gap 12:29that we called out I want to call out. 12:31This one again is maybe not quite as 12:34widely accessible as it could be. I 12:36believe it's in public beta now, but I 12:38do worry that they're going to get 12:39overwhelmed. Chronicle is out. So, 12:42Chronicle enables really, really high 12:45quality storytelling like pixel perfect 12:47components, built-in interactivity and 12:48motion. And the idea is that you want to 12:51get to the sort of massive consultant 12:54army that is always building powerpoints 12:57and that is struggling to use AI to do 13:00so effectively. And so, you want to look 13:02for a workflow that is keyboard first, 13:05that enables presentation creation in 8 13:06or 10 minutes versus hours. And you want 13:09to be in a position where you can 13:11deliver on that promise in a way that is 13:14nearperfect out of the gate as long as 13:16you know what you want to say from a 13:17story perspective. You'll notice I am 13:19not mentioning Gamma here and that is 13:21because Gamma has been able to evolve 13:24but has not gotten to the level of 13:27professional quality where I or anyone 13:29else who presents to a serious CEO would 13:32really want to use that tool. it just 13:35hasn't been able to master the combined 13:37uh storytelling arc in visual and text. 13:40Story doc is an option. It's a little 13:43bit more mature. It really is designed 13:46to create elements that Chad GPT can't 13:48conceptualize. It does not it does not 13:52fit neatly into the PowerPoint bucket in 13:54the same way that Chronicle does. And so 13:56I think in a sense part of what 13:58Chronicle is looking to do is to become 14:00the new PowerPoint with more dynamic 14:02features that PowerPoint just can't do. 14:04Uh and it's designed to key off the fact 14:07that we really like slides and we've had 14:10slides in the workplace for 40 years. So 14:12in that sense, I think Chronicle is 14:13better positioned for high stakes 14:15presentations where excellence matters, 14:17especially design excellence. And Story 14:18Do is really handy if you just need to 14:20put together a quick somewhat visual 14:22doc. Maybe sales teams can use it, 14:24marketing content, that kind of thing. 14:26All right, let's go to voice intake. So, 14:28we talked about the fact that chat GPT, 14:31you know, just summarizes notes. There's 14:33a lot of note takingaking out there. I I 14:35use Granola. Granola is actually not 14:37what I'm going to talk about here. I 14:38want to talk about Nata and Whisper 14:40Flow. So, Nata is an extremely accurate, 14:43high quality audio transcriber, and it 14:46can process hour-long recordings in just 14:49like 5 minutes. like it's it's very 14:51efficient at process and recording. It 14:53handles uh 58 different transcription 14:55languages, a bunch of them. And if 14:58you're just trying to get to meeting 14:59notes and transcribe them really 15:00effectively, not as great for that. It's 15:03definitely going to be, as most of these 15:05purpose-built tools are, better than 15:07your standard chatbot for the 15:08experience. Whisper flow has a different 15:11approach. So, Nata is just obsessed with 15:14transcriptions, right? Whisper Flow is 15:16more like we think voice is the new 15:18interface and we're going to enable 15:19systemwide dictation. So you're going to 15:21be able to use Whisper Flow in all of 15:23your existing apps, which some people 15:25really like. Like they want to move to 15:26voice as the interface because we can 15:28talk faster than we can type. And so 15:30Whisper Flow gives them 3 or 4x on their 15:33traditional typing speed in a wide range 15:35of apps. And I think it's subsecond 15:37latency. I've tried it. It's not always 15:39sub-second latency, but it's it's quite 15:41fast. and it supports a hundred some 15:43languages with automatic detection. 15:45Again, I think it's really interesting 15:47to see in these examples how these 15:49products are solving different pieces of 15:51the problem. Whisperflow really 15:52conceptualizes voice as an interface and 15:54so they're looking to plug into your 15:56existing apps whereas Nata 15:58conceptualizes voice as something that 16:00needs accuracy to transcribe and so 16:01they're just obsessed with that and it's 16:03just a very clean point solution. Your 16:05mileage is going to vary as is your 16:07problem set. You have to think about 16:09where you really care about the workflow 16:11speed up. And as we wrap this up, that's 16:14where I want to leave you. I want you to 16:16think about your biggest time sync. I 16:18want you, if you're in a team, to think 16:20about your biggest shared pain point. 16:22Really, what you need is to get clear on 16:25that and then go back in and look at 16:28tools that make sense. I've laid out 12 16:30tools here that I think are useful for 16:32some structural gaps that come up in 16:33LLMs. Your biggest time sync, your 16:36biggest pain point may or may not be one 16:38of those six issues that I identified 16:40with AI chat bots like chat GPT. You may 16:43have a different one. That's okay. The 16:45point is this video should challenge you 16:48to think about where you are 16:50overindexing on time spent in a chatbot 16:53or time spent working around a chatbot 16:55flow and ask yourself, is there a point 16:57solution for AI that could solve this 17:00that I just haven't taken the time to 17:02invest in? And so if it would save you 17:0410 hours weekly, it's worth finding out 17:06if there's an AI tool that can do it. 17:08And there are lots of stories across the 17:1012 tools I've described that are in that 17:13category because you can imagine if 17:14you're using shortcut and you can create 17:16a bunch of Excel sheets and that's your 17:17living, it's going to save you a lot of 17:18time. Similar way with Nata, if you're 17:20just trying to transcribe stuff, it's 17:22going to save you a ton of time. And so 17:24my challenge to you is to not regard the 17:27100,000 tool universe of AI tools as 17:30this blank uh sea of tools that are just 17:33impossible to parse. There are useful 17:35tools in there. And the way to fish them 17:37out is understanding your own pain 17:39points. That's what really matters. 17:42That's what distinguishes people that 17:43can add tools strategically to their 17:45stack that fix what chat GPT can't do 17:48versus people who are just rolling their 17:50eyes and saying it's too much. I can't 17:52do it. So there you go. Do you know your 17:53own pain points?