Beyond Chatbots: Tools for LLM Gaps
Key Points
- We rely on chatbots by default because the AI landscape is flooded with thousands of tools, and developers keep them “sticky” (e.g., adding memory) to capture our attention.
- Large language models still have six core structural limitations—such as weak spatial reasoning and poor spreadsheet context handling—that prevent them from fully replacing specialized tools.
- These gaps exist not because they’re impossible to solve, but because model builders are preoccupied with broader challenges like scaling GPU resources to serve massive user bases.
- Examples of the limitations include LLMs producing sub‑par design work (e.g., messy PowerPoint slides) and struggling to interpret the multidimensional relationships in complex spreadsheets.
- The video proposes 12 complementary tools (two per identified gap) to help users fill those shortcomings and work more efficiently.
Sections
- The Limits of Chatbot Dominance - The speaker explains why users default to chatbots amid a flood of AI tools, outlines six structural limitations of large language models, and urges exploring specialized alternatives beyond chat interfaces.
- Limits of LLMs in Production - The speaker outlines how current LLMs lack safe code execution, operational visibility, and the ability to craft coherent visual narratives, exposing critical gaps for AI integration in production pipelines.
- Tool Comparisons: Mockups & Spreadsheets - The speaker differentiates Visily’s fast wireframing from code‑generating tools, then contrasts Shortcut AI’s advanced spreadsheet creation capabilities with the more widely available Numbers AI for embedding AI into existing sheets.
- Chronicle Aims to Replace PowerPoint - The speaker argues that Chronicle, now in public beta, offers pixel‑perfect, interactive presentations that outclass PowerPoint and rivals tools like Gamma, positioning it as a professional, keyboard‑first workflow for rapid, high‑quality storytelling.
- Choosing the Right AI Tool - The speaker advises listeners to pinpoint their biggest workflow bottlenecks or shared team pain points, then match those specific needs to targeted AI solutions—like Whisperflow’s app‑level voice interface or Nata’s high‑accuracy transcription—rather than over‑relying on generic chat‑bot workflows.
Full Transcript
# Beyond Chatbots: Tools for LLM Gaps **Source:** [https://www.youtube.com/watch?v=kN1h33Fbiio](https://www.youtube.com/watch?v=kN1h33Fbiio) **Duration:** 00:17:56 ## Summary - We rely on chatbots by default because the AI landscape is flooded with thousands of tools, and developers keep them “sticky” (e.g., adding memory) to capture our attention. - Large language models still have six core structural limitations—such as weak spatial reasoning and poor spreadsheet context handling—that prevent them from fully replacing specialized tools. - These gaps exist not because they’re impossible to solve, but because model builders are preoccupied with broader challenges like scaling GPU resources to serve massive user bases. - Examples of the limitations include LLMs producing sub‑par design work (e.g., messy PowerPoint slides) and struggling to interpret the multidimensional relationships in complex spreadsheets. - The video proposes 12 complementary tools (two per identified gap) to help users fill those shortcomings and work more efficiently. ## Sections - [00:00:00](https://www.youtube.com/watch?v=kN1h33Fbiio&t=0s) **The Limits of Chatbot Dominance** - The speaker explains why users default to chatbots amid a flood of AI tools, outlines six structural limitations of large language models, and urges exploring specialized alternatives beyond chat interfaces. - [00:04:18](https://www.youtube.com/watch?v=kN1h33Fbiio&t=258s) **Limits of LLMs in Production** - The speaker outlines how current LLMs lack safe code execution, operational visibility, and the ability to craft coherent visual narratives, exposing critical gaps for AI integration in production pipelines. - [00:07:58](https://www.youtube.com/watch?v=kN1h33Fbiio&t=478s) **Tool Comparisons: Mockups & Spreadsheets** - The speaker differentiates Visily’s fast wireframing from code‑generating tools, then contrasts Shortcut AI’s advanced spreadsheet creation capabilities with the more widely available Numbers AI for embedding AI into existing sheets. - [00:12:34](https://www.youtube.com/watch?v=kN1h33Fbiio&t=754s) **Chronicle Aims to Replace PowerPoint** - The speaker argues that Chronicle, now in public beta, offers pixel‑perfect, interactive presentations that outclass PowerPoint and rivals tools like Gamma, positioning it as a professional, keyboard‑first workflow for rapid, high‑quality storytelling. - [00:15:47](https://www.youtube.com/watch?v=kN1h33Fbiio&t=947s) **Choosing the Right AI Tool** - The speaker advises listeners to pinpoint their biggest workflow bottlenecks or shared team pain points, then match those specific needs to targeted AI solutions—like Whisperflow’s app‑level voice interface or Nata’s high‑accuracy transcription—rather than over‑relying on generic chat‑bot workflows. ## Full Transcript
We all use our chat bots too much. We
do. We all use our chat bots because
that is the default thing to do. And I
think it doesn't help that there's a
hundred,000 other tools. And I'm not
actually making that number up. That's
roughly what the number of total AI
tools out there are. It's just too many.
How do we figure out which one to use?
And so we end up defaulting back into
the chatbot space. And the model makers
know that our mind share is allocated
there. And so they continue to invest in
making those experiences more sticky.
That's why chat GPT has memory, for
instance. It's a stickiness feature.
When you think about it that way, it
makes sense that we would periodically
poke our noses out beyond chatbot land
and actually look around the landscape
and say, what other tools fill gaps that
LLMs are inherently struggling to close?
So, first in this video, I'm going to
lay out some of those gaps that LLMs may
not be the best in the world at. Not
because it's impossible for the model
makers to close the gaps, but because
the model makers are preoccupied with
larger what I would call generic problem
sets like frankly finding enough GPUs to
serve their model to all the people who
want it. And that is also not something
I made up. That's something very well
documented as a prime concern for both
anthropic and open AAI right now. So
what are the six structural limitations
of LLMs that are being partially
compensated for inside chat bots right
now? But maybe there are specialized
tools that would help us get farther and
do our work more effectively. And then
from there we'll get into 12 tools, two
for each of the the structural gaps that
you can survey. My goal here isn't to
convince you to use these tools. It's to
help you get a sense of how to think
about structural gaps in the chatbot
experience and then to understand what
tools might be useful for closing the
structural gaps that matter to you. So
gap number one, spatial reasoning. Yes,
LLMs are absolutely getting better at
this. I still am impressed that 03 can
produce 3D graphs, but fundamentally if
you are trying to get to design, LLMs
are not phenomenal designers. I have yet
to see an LLM do a great job at that. I
think my favorite anecdote here is agent
mode where with great effort chat GPT
taught an AI agent to make a PowerPoint.
The results are less than stellar. Uh to
put it very kindly, the text tends to
run over. It tends to run off the slide.
It tends to be poorly organized on the
slide. It doesn't work well with the
visuals. The visuals feel slapped on. I
know interns that could do a vastly
better job. Gap number two, spreadsheet
context. We have an issue with
spreadsheets because spreadsheets have
orthogonal meaning. In other words, they
have relationships horizontally,
relationships orthogonally, and in
complex spreadsheets, there's
relationships between tabs. There's
relationships between special columns
and rows and the regular columns and
rows of data. There's formulas. It is
really, really challenging for LLMs that
are designed for next token prediction
to master spreadsheets. Again, we see
some progress. I'll go back to agent
mode. It can make a spreadsheet. It can
make a spreadsheet with a simple
formula. Now, it cannot process your
existing spreadsheet. Well, it can't
build a fully complex spreadsheet yet. I
have tried it. Eh, it's okay. When you
ask other LLMs like uh Claude or Opus uh
Claude Opus 4 or Shad GP03 or Gemini 2.5
Pro, they range from insisting on CSVs,
which are comma delimited and therefore
more friendly to tokens to trying to
ingest and process Excelss and still
struggling. still struggling if they're
large, still struggling to read all the
detail. I've uploaded 40 or 50 row Excel
spreadsheets and have found that even at
that scale, which anyone who's using an
Excel sheet will know is tiny, they can
still sometimes struggle to list every
row. They just can't seem to read all of
the data. So, spreadsheets are a
problem. Code execution also remains a
challenge. Fundamentally, none of the
LLMs were constructed with the idea of
being code execution environments. and I
don't anticipate them becoming code
execution environments anytime soon. And
for those of you who are not coders,
that means running the code. The fact
that Claude can spin up a little React
component and you can kind of run a
little applet inside a preview window is
about the best it gets right now. And
that's still very very minor, right?
It's not really a full code execution
environment. Certainly not something
that you would want to put into
production. And that may seem obvious
and you may think that isn't even an AI
related thing. But increasingly because
prompts and because AI generated code
and because LLMs themselves are
integrated into our production pipelines
for software, we do need software that
has code execution and AI capabilities.
Another gap, operational visibility.
Again, why would you expect this? But
LLMs are not built to give you any kind
of operational visibility on your AI
software in production. They're just
not. No big surprise there. And last but
not least, narrative structure is a huge
problem for AI. And this is one that I
don't think gets talked about a lot.
Text versus experience is very difficult
for LLMs to convey. They often will
respond with various versions of text
because they can output text, but they
can't think through the visual
hierarchy. They have trouble sometimes
thinking through the structure of the
story in a way that's accessible. This
is an area where I would expect
breakthroughs like chat GPT5 to be
helpful, but I still think that there's
going to be a complex interplay between
the structure of a narrative and the way
a narrative is visually presented that
is going to be hard for traditional LLMs
to master. And I think it's it's just
not something that is easy to do unless
it's your sole focus. And even then,
it's quite difficult. One more, last but
not least, voice processing. Chad GPT
famously launched meeting notes
recently. I have used them. They are
only okay. They don't give you live
transcriptions. They give you only one
generic summary. You can't really access
the transcript. It's very much a bolt-on
feature. And that is exactly what I
would expect from a team that is
fundamentally resource constrained and
trying to ship a lot of things to an 800
million or more user base. Now, they
cannot do everything perfectly. And
therein lies the opportunity for
builders like the 12 tools that we're
going to outline here. Again, these are
not the best 12 tools ever. I think they
are great answers for these six gaps
that I've called out. The six gaps being
uh voice processing, narrative
structure, operational visibility, code
execution, spreadsheet context, and
spatial reasoning. Those are not the
only gaps, but I thought they were
really illustrative of the kinds of gaps
that LLMs have. And these 12 tools do a
good job hitting those gaps. So look at
these. Think about the strengths. Think
about the weaknesses. I'll call out and
think about where your workflow doesn't
work well with a chat. Tool number one.
This is in interface builders. Magic
patterns. Magic patterns has just I've
had people coming to me with magic
patterns. I' I've not been the one sort
of sharing it out, but people have come
to me and showed me magic patterns
because they like them so much.
Fundamentally, it makes it extremely
extremely easy to extract a design out
of a screenshot or something else, turn
it into working components and get
something back that is a working piece
of compliant stylewise front-end code
that illustrates a vision for a new
interface, which is a complicated way of
saying it is really easy now to copy the
style off the website and change it and
show your engineers. And that is
something every single marketer and PM
and program manager and anyone else CS
who has an idea for something that
should be different about the tool or
the app or the website. We have all
wished for this. We have wished for it
to be magically easy to say here's my
sketch. Here's my concept. But magically
it's in the right style. Now that's as
simple as a screenshot and throwing it
in magic patterns. Specialized tool
closes a specific gap in an LLM. Is it
perfect? No, it's not perfect. It's not
designed for full app building, right?
But does it give you a quick sketch
sense? Is it designed for exactly what
it does? Well, yeah, it does. Visally is
another option there. It is a little bit
cheaper than magic patterns. It focuses
on rapid mockup creation rather than
code generation. So, if you need the
code components, don't go with Visily.
If you just need the quick mockup,
wireframing can be much faster with
Visily. And so, again, like both are in
the interface category. They do slightly
different things. So, I want to lay them
out as distinct. My goal here isn't to
make these like competitors, but to
actually help you understand how each
tool is attacking a particular gap that
a chatbot has. All right, let's move on
to the second one. Spreadsheet
intelligence. What do we have? Shortcut
AI is exploding. It's an early access.
You may not be able to get it. It is
definitely the best I've seen at
tackling complex spreadsheet creation.
And I want to underline the word
creation. There are still some struggles
with macros and existing spreadsheets,
but if you want to create something and
you are a Power Excel user, I am getting
rave reviews on this one. Again, not me.
People coming to me saying, "I'm trying
shortcut AI and it's incredible." And
so, I suspect once this goes more widely
public, there's a good chance it becomes
the definitive answer for AI and Excel.
The other solution, which is more widely
available, is numerous AI, which really
focuses on embedding AI in your existing
spreadsheet through custom functions.
That's a different use case. It's
supposed to help you add AI in useful
ways to your current spreadsheets versus
just creating new sheets. From a product
strategy perspective, Shortcut is in the
stronger position because they're
inventing a solution to the entire
spreadsheet problem I discussed rather
than just trying to wrap AI in into your
existing sheets. There's no way, as far
as I can tell, for numerous AI to create
a brand new spreadsheet that is very
complex from scratch and have it sort of
handle the kind of complexity that
Shortcut is bringing to the table,
especially from a prompt. They just do
different things. Again, it's not
necessarily a competitor thing. They
just do different things. And Shortcut
is solving the bigger part of the
spreadsheet intelligence problem. Let's
move on to another gap. Executing code.
We wouldn't expect most LLMs to do this,
but we do need solutions that include
AI. I'm going to give two. I don't hear
a ton about either of these, but I want
to throw them out there and you can tell
me what you think about them and which
one you think is more useful. Uh the
first one is e2b.dev. It starts at a
free tier. It leverages AWS firecracker.
Um the the critical piece is that it's
it's effortless to integrate. Like if
you want to throw this up and make it
easy to execute code, e2b.dev makes it
easy to stand up a sandbox and try
something. It's super quick. Daytona is
not as cheap. I love that it's named
Daytona, by the way. It's also a little
bit, as you would expect, more
established, right? It has ISO 2701,
sock 2, all that good stuff from a
certification perspective. And it again
makes it it makes it easy to execute
code and ensures you won't damage
production systems. And that is one of
the biggest concerns that people have
with vibe coding is that you're going to
have the risk that it will damage
production systems. So, the stakes here
are real. And I think you're going to
see a lot of traction in this space from
startups like E2B.dev and Daytona. I'm
curious if you guys have a strong
opinion between the two. Moving on to
LLM observability. Really important to
understand if you're running a lot of
prompts through production grade AI, you
have to understand how they're actually
working. And I want to call out too,
both of these are very established at
this point. Uh, helicone is very simply
a clear visibility proxy that just sits
across your stack and makes it really
obvious where your chatbot logs are and
how you can monitor them and enables you
ultimately to track latency costs errors
across more than 100 model providers in
a single gateway. So far so good. I
actually really like it. A lot of
companies use it. Another one that is
also strong is Langfuse. uh you can have
observability tracing evaluation
frameworks again they have sock 2 they
have ISO2701
uh you can track parent child
relationships with execution tracing uh
you can automate quality assessment in
ways that telecom doesn't always attempt
to do so there's some differences
between the two to dig into I think in a
sense the observability piece is
something that we have had a little bit
more runway on so it's it's been more of
an obvious problem for a while whereas I
think the vibe coding execution ution
piece and the sandbox piece is newer
because vibe coding itself is only a few
months old and so we're still figuring
out where the winds are there. Let's
move on to story delivery. Another gap
that we called out I want to call out.
This one again is maybe not quite as
widely accessible as it could be. I
believe it's in public beta now, but I
do worry that they're going to get
overwhelmed. Chronicle is out. So,
Chronicle enables really, really high
quality storytelling like pixel perfect
components, built-in interactivity and
motion. And the idea is that you want to
get to the sort of massive consultant
army that is always building powerpoints
and that is struggling to use AI to do
so effectively. And so, you want to look
for a workflow that is keyboard first,
that enables presentation creation in 8
or 10 minutes versus hours. And you want
to be in a position where you can
deliver on that promise in a way that is
nearperfect out of the gate as long as
you know what you want to say from a
story perspective. You'll notice I am
not mentioning Gamma here and that is
because Gamma has been able to evolve
but has not gotten to the level of
professional quality where I or anyone
else who presents to a serious CEO would
really want to use that tool. it just
hasn't been able to master the combined
uh storytelling arc in visual and text.
Story doc is an option. It's a little
bit more mature. It really is designed
to create elements that Chad GPT can't
conceptualize. It does not it does not
fit neatly into the PowerPoint bucket in
the same way that Chronicle does. And so
I think in a sense part of what
Chronicle is looking to do is to become
the new PowerPoint with more dynamic
features that PowerPoint just can't do.
Uh and it's designed to key off the fact
that we really like slides and we've had
slides in the workplace for 40 years. So
in that sense, I think Chronicle is
better positioned for high stakes
presentations where excellence matters,
especially design excellence. And Story
Do is really handy if you just need to
put together a quick somewhat visual
doc. Maybe sales teams can use it,
marketing content, that kind of thing.
All right, let's go to voice intake. So,
we talked about the fact that chat GPT,
you know, just summarizes notes. There's
a lot of note takingaking out there. I I
use Granola. Granola is actually not
what I'm going to talk about here. I
want to talk about Nata and Whisper
Flow. So, Nata is an extremely accurate,
high quality audio transcriber, and it
can process hour-long recordings in just
like 5 minutes. like it's it's very
efficient at process and recording. It
handles uh 58 different transcription
languages, a bunch of them. And if
you're just trying to get to meeting
notes and transcribe them really
effectively, not as great for that. It's
definitely going to be, as most of these
purpose-built tools are, better than
your standard chatbot for the
experience. Whisper flow has a different
approach. So, Nata is just obsessed with
transcriptions, right? Whisper Flow is
more like we think voice is the new
interface and we're going to enable
systemwide dictation. So you're going to
be able to use Whisper Flow in all of
your existing apps, which some people
really like. Like they want to move to
voice as the interface because we can
talk faster than we can type. And so
Whisper Flow gives them 3 or 4x on their
traditional typing speed in a wide range
of apps. And I think it's subsecond
latency. I've tried it. It's not always
sub-second latency, but it's it's quite
fast. and it supports a hundred some
languages with automatic detection.
Again, I think it's really interesting
to see in these examples how these
products are solving different pieces of
the problem. Whisperflow really
conceptualizes voice as an interface and
so they're looking to plug into your
existing apps whereas Nata
conceptualizes voice as something that
needs accuracy to transcribe and so
they're just obsessed with that and it's
just a very clean point solution. Your
mileage is going to vary as is your
problem set. You have to think about
where you really care about the workflow
speed up. And as we wrap this up, that's
where I want to leave you. I want you to
think about your biggest time sync. I
want you, if you're in a team, to think
about your biggest shared pain point.
Really, what you need is to get clear on
that and then go back in and look at
tools that make sense. I've laid out 12
tools here that I think are useful for
some structural gaps that come up in
LLMs. Your biggest time sync, your
biggest pain point may or may not be one
of those six issues that I identified
with AI chat bots like chat GPT. You may
have a different one. That's okay. The
point is this video should challenge you
to think about where you are
overindexing on time spent in a chatbot
or time spent working around a chatbot
flow and ask yourself, is there a point
solution for AI that could solve this
that I just haven't taken the time to
invest in? And so if it would save you
10 hours weekly, it's worth finding out
if there's an AI tool that can do it.
And there are lots of stories across the
12 tools I've described that are in that
category because you can imagine if
you're using shortcut and you can create
a bunch of Excel sheets and that's your
living, it's going to save you a lot of
time. Similar way with Nata, if you're
just trying to transcribe stuff, it's
going to save you a ton of time. And so
my challenge to you is to not regard the
100,000 tool universe of AI tools as
this blank uh sea of tools that are just
impossible to parse. There are useful
tools in there. And the way to fish them
out is understanding your own pain
points. That's what really matters.
That's what distinguishes people that
can add tools strategically to their
stack that fix what chat GPT can't do
versus people who are just rolling their
eyes and saying it's too much. I can't
do it. So there you go. Do you know your
own pain points?