Learning Library

← Back to Library

Tokenizable Data: Docs vs Spreadsheets

Key Points

  • The first step in assessing whether AI can handle a task is determining if the underlying data is “tokenizable,” meaning it can be represented as text-like chunks that fit into a document.
  • Tokenizable data is categorized into tiers: Tier A (easily tokenized, like wiki text), Tier B (moderately tokenizable, such as spreadsheet‑scale tables that may need preprocessing), and Tier C (large data lakes or massive time‑series that are difficult to fit into a context window).
  • While AI readily processes word documents, it struggles with spreadsheets and numeric accuracy, requiring specialized tools (e.g., data rails) or advanced techniques to extract meaningful insights.
  • Recent advances like OpenAI’s agent mode, which can generate and manipulate Excel sheets, show progress, but handling large, complex datasets still often exceeds current AI capabilities without dedicated solutions.

Full Transcript

# Tokenizable Data: Docs vs Spreadsheets **Source:** [https://www.youtube.com/watch?v=SRYgH2WvknQ](https://www.youtube.com/watch?v=SRYgH2WvknQ) **Duration:** 00:13:04 ## Summary - The first step in assessing whether AI can handle a task is determining if the underlying data is “tokenizable,” meaning it can be represented as text-like chunks that fit into a document. - Tokenizable data is categorized into tiers: Tier A (easily tokenized, like wiki text), Tier B (moderately tokenizable, such as spreadsheet‑scale tables that may need preprocessing), and Tier C (large data lakes or massive time‑series that are difficult to fit into a context window). - While AI readily processes word documents, it struggles with spreadsheets and numeric accuracy, requiring specialized tools (e.g., data rails) or advanced techniques to extract meaningful insights. - Recent advances like OpenAI’s agent mode, which can generate and manipulate Excel sheets, show progress, but handling large, complex datasets still often exceeds current AI capabilities without dedicated solutions. ## Sections - [00:00:00](https://www.youtube.com/watch?v=SRYgH2WvknQ&t=0s) **Understanding Tokenizable Data in AI** - The speaker clarifies the concept of “tokenizable” data by suggesting that anything that can be expressed in a document is likely tokenizable, and explains how this notion determines whether a task fits within AI’s context window and handling capabilities. - [00:03:33](https://www.youtube.com/watch?v=SRYgH2WvknQ&t=213s) **Tokenizable Data Simplifies AI Adoption** - The speaker argues that data small enough to fit in a Word document or even on a napkin is far easier to tokenize and integrate with LLMs, whereas massive, complex datasets in data lakes make AI architecture much more challenging. - [00:09:22](https://www.youtube.com/watch?v=SRYgH2WvknQ&t=562s) **Choosing Prompt Length Strategically** - The speaker explains that large, context‑rich prompts are best for well‑defined, production‑focused tasks, while short, iterative prompts are more effective for exploratory, brainstorming, or casual conversations. - [00:12:57](https://www.youtube.com/watch?v=SRYgH2WvknQ&t=777s) **Balancing Thoughts on Big Prompts** - The speaker acknowledges that conversations aren't solely about big prompts, yet confesses a personal preference for them and stresses their honesty. ## Full Transcript
0:00I want to take a minute to talk about 0:02three tricky ideas in AI. I want to 0:05explain why they're confusing, why 0:07they're hard to understand, why I often 0:09get questions about them, and I want to 0:11make sure that I explain them clearly 0:13enough that you can understand and teach 0:15them to others because they underlly a 0:17lot of the concepts I teach and talk 0:20about. And I find that people often 0:22misunderstand them. Number one, what is 0:24tokenizable data? I talk about something 0:26that's tokenizable. I talk about 0:28tokenizable distributions and people 0:31just like I can kind of see them glazing 0:33over. It's like what is tokenize? Very 0:35simply, ask yourself if a piece of data 0:38in your business or a piece of data in 0:40your world could appear in a document. 0:42If it could, that's a really good sign. 0:45It's probably tokenizable. If you can't 0:48imagine it fitting in a document, it's 0:50probably not tokenizable. And so when 0:52you ask if AI can do this, people often 0:54think about the task as a whole, but I 0:56always ask about the data in the tokens 0:57first. Can I even fit it in? Can I even 1:00see if the tokens will go into the 1:02system? Then we get to subsequent 1:04questions I talk about a lot, like, is 1:06there too much data here? Is it too big 1:08for the context window? Is the task too 1:11multifaceted for a single prompt? Or is 1:14the task too complex for an AI to handle 1:16with nuance? AI often sort of polishes 1:19off the nuance in a task. Those are all 1:21questions that are downstream of 1:23tokenizable data. Understand that the 1:26way to think about whether AI can do 1:28something starts with the token. It's 1:30just a little chunk that passes into the 1:33transformer. It's a piece of a word. 1:34It's about four characters. So, as an 1:37example of something that doesn't easily 1:39tokenize, spreadsheets. You have to have 1:42special techniques for spreadsheets. AI 1:44is still X farther behind on 1:46spreadsheets than it is on Word docs. Is 1:48it getting better? Absolutely. It's not 1:50where word doc processing is. You can 1:52hand a very large word document to an AI 1:55and ask it to at least give you a sense 1:57of what's in there. If you handed a 1:59large spreadsheet to an AI because you 2:01value accuracy in numbers, you're not 2:03going to get nearly as lucky in most 2:05cases unless you have a specialized 2:07tool. And that's why tools like data 2:10rails exist. You need specialized tools 2:12that help. And we are seeing progress. 2:14Notably, agent mode came out from OpenAI 2:17and they can now create Excel sheets. 2:18It's getting better. But if you look at 2:21tokenized data as like tier A is easily 2:24tokenized. Anything in your wiki is 2:26tokenizable, super easily tokenized. 2:29Tier B would be data that is at 2:33spreadsheet scale, right? It fits in a 2:36spreadsheet. It's not super easy to 2:37tokenize, but it's probably there's some 2:39stuff you can do to massage it and get 2:41it in there. Tier C, data in a data 2:44lake. It can be available for search 2:48potentially through concepts like 2:49agentic search. It is not something it's 2:52too big. It's not something that easily 2:53tokenizes because it's like hundreds of 2:55thousands and millions of rows of time 2:57series data that you have to relate. And 3:00so the the the traditional LLM 3:02transformer architectures don't do well 3:04with that kind of data. Now, you can 3:06take small pieces of it and you can look 3:08at tokenization and maybe learning 3:11something there. But most of the time 3:12when people talk about how they hook up 3:15LLMs to large sources of data, what 3:17they're really saying is they have 3:19figured out how to search the data lake 3:22in order to retrieve useful pieces of 3:26information that they can ladder into 3:27insights. And there's some preparatory 3:29steps they need to get into to do that. 3:32Well, I'm trying to keep it simple. 3:33We're not going to go too far down that 3:35path today. Think of tokenizable data as 3:38stuff you can fit in a Word doc first. 3:41Second, maybe it can go in an Excel. 3:43Third, anything that is so big and 3:45massive and complex and structured that 3:47it has to go into a a data warehouse, a 3:49data a data lake, that is going to be 3:52much harder. And what's interesting is 3:56the easier it is to tokenize, the more 3:59you have a chance to shape your destiny 4:02with AI and that content. It is actually 4:05quite difficult if you're working with 4:06data lakes to pivot and figure out how 4:09to architect AI solutions over the top. 4:11Organizations wrestle with this all the 4:13time. But if you have something that's 4:15much simpler, if you have like company 4:17policies and how you write your 4:18documents and this and that all in one 4:20neat word doc or three or four, you can 4:22easily get that into LLMs and 4:24immediately control your destiny and be 4:25off to the races. So think in terms of 4:28tokenizable data. Think in terms of 4:31whether it fits in a word doc, maybe 4:33whether you can sketch it on a napkin 4:34because whether you can sketch it on a 4:36napkin, by the way, is also a really 4:38handy test for context window size. If 4:40you can draw the complexity on a napkin, 4:42AI can be very helpful. If you can't fit 4:44the complexity onto the napkin, it may 4:47be too complex for a nuance perspective 4:49from AI. 4:51Okay, concept number two, moving on from 4:53tokenization, jagged intelligence. I 4:56talk about jagged intelligence a lot and 4:58again, people's eyes glaze over. Jagged 5:00intelligence simply means that we have 5:03AIs that are in some ways as smart as 5:05Einstein and in other ways worse than 5:08the worst intern you've ever met. I was 5:10a pretty bad intern to be honest with 5:12you. The the problem here is that AI is 5:16not a continuous intelligence surface. 5:18It has really really large gaps driven 5:21by very known issues particularly around 5:25memory. If AI can't remember something, 5:28it can't learn as it goes. And yes, you 5:30know, LLM teams are working on this, but 5:32it's a hard problem. They haven't made a 5:34ton of progress yet. And for the moment, 5:37it is very very difficult to get an AI 5:41to consistently do certain simple things 5:45that require memory. So for example, if 5:48you talk to the AI about your role and 5:51ask it to fulfill an assignment and 5:53write you an excellent article or write 5:55me a proposal or write me this email, 5:58you have to retalk to it again and again 6:00and again and again. And if you make any 6:03mistake, it will make a mistake. That's 6:05jagged intelligence. It's good enough to 6:07write those emails. It's good enough to 6:08write that proposal or that article. It 6:11is not good enough to remember how to do 6:12it. It's not good enough to not be 6:14extremely sensitive to mistakes you make 6:16in the briefing. In a sense, you have 6:18Shakespeare who who is just obsessed 6:20with following instructions. And if you 6:22make any mistake in the instructions, 6:23Shakespeare is going to make mistakes. 6:25This is why prompting matters so much 6:27because you're essentially prompting and 6:28you're trying to get the LLM to do what 6:31it does best rather than getting stuck 6:34in a place it doesn't do well. Other 6:36examples of places that LLMs don't do 6:38well, the low points in jagged 6:40intelligence, math. They will call other 6:42tools to do math and there are 6:44specialized models. So Gemini has one, 6:46OpenAI apparently has one that does math 6:49olympiad problems. But when it comes to 6:53is 9.9 or 9.11 bigger LLMs can still 6:57struggle with that. And so if you were 6:59trying to look at mathematical modeling 7:02of concepts, if you're trying to 7:04understand how to weigh the levers of a 7:07business, you get some insight from AI, 7:11but I find that the insight tends to 7:13cluster around the existing distribution 7:16on strategic advice, McKenzie decks more 7:18broadly. It doesn't tend to be deeply 7:21insightful unless you are 7:24extraordinarily good at giving it 7:26strategic intent and excellent context 7:29and then it can reason across your 7:31information specifically. And that 7:33highlights another one of the tricky 7:34things about jagged intelligence. Jagged 7:37intelligence can be made less jagged if 7:40you prompt better. And so if you are 7:42better at communicating your intent, you 7:45can erase some of those gaps a little 7:46bit. You're still going to feel the 7:48gaps. I still feel it because I find 7:51that AI is really, really good at things 7:54like outlining and often not as good at 7:56things like capturing tone the way I 7:58want it to capture tone. I'm very picky 8:00about tone. I feel it when I'm asking AI 8:03to think about strategy and I feel like 8:05AI is good as a sounding board, but it 8:07doesn't feel like it's as refined as it 8:08needs to be. The more you cultivate high 8:11taste, the more you cultivate saying it 8:14can be better and I know how it can be 8:15better, the more you are going to be 8:18sensitive to jagged intelligence. And so 8:20my challenge to you is basically to ask 8:23where is my taste bar? If I can't sense 8:25jagged intelligence, have I insisted on 8:27a high enough bar with AI? Because I bet 8:29you know something better than AI and 8:31you can start to insist on a high bar 8:33there. So that's the idea of jagged 8:34intelligence. You have a sense of a 8:36Shakespeare that has to follow 8:37instructions and has amnesia. Third 8:39concept, when do you apply big prompts 8:41versus casual chats? I get this question 8:44because I think people perceive me as 8:46the kind of guy that always does the 8:48fancy prompts. I get it. I I write the 8:50big prompts. I understand that. What I 8:53want to suggest to you is that the 8:55planning and the thoughtfulness that go 8:57into an excellent prompt pay off when 9:00you have an important task that you want 9:02to do. And when you have something that 9:05you need to iterate on and discover as 9:08you go, it pays more often just to start 9:10with a sharp one or twoliner and go from 9:13there. And so really what I'm saying, 9:15and maybe people are like, well that's 9:16obvious, right? Obviously if it's 9:18important you put more time into it. 9:20It's a little more nuanced than that. If 9:22it's important and if it needs to be 9:25anchored around a lot of context that 9:27you give it, big prompt can make sense. 9:29If it's casual and or if it's iterative 9:33in nature, it can make sense to have a 9:36longer sort of conversation as you go 9:38with shorter prompts to start. In other 9:40words, iterative tasks, you are 9:43discovering meaning with the AI as you 9:45go. So, it makes sense to start shorter 9:47and just have a little bit of intent. 9:48larger prompts are for anchoring around 9:50a specific topic. And sure, you're going 9:52to have multi-turn, but it's going to be 9:54a conversation that happens inside that 9:56box you've set with a big piece of 9:58strategic intent at the top in a big 9:59prompt. If you're trying to iterate and 10:01riff and brainstorm and think through 10:03things, it often is actually much more 10:05useful to start with a very short prompt 10:07and leave the model room to expand. And 10:09so to me, it's a little bit deceptive to 10:12think about it as meaningful work gets 10:14done with big prompts because I can get 10:16very meaningful work done with short 10:18prompts that I am kicking back and forth 10:20rapidly if I need to discover the 10:22meaning iteratively. And so my 10:24encouragement to you is, is this piece 10:27of work something that I already know 10:30enough about that I want to be focused 10:32on production? Probably a bigger prompt. 10:34Is this piece of work something I don't 10:36even know the shape of and I need to 10:38discover it? probably a shorter prompt. 10:40Both can be valuable. And for casual 10:42chats where you're just trying to like 10:43brainstorm and riff around, again, 10:44iterative, it's going to be a shorter 10:46prompt. You can use a more formal 10:48brainstorming process if you have 10:50context and you want to constrain it and 10:52set your assumptions. This is often what 10:54we do with humans when we have a formal 10:55brainstorming session. But we all know 10:57that humans also think well at drinks 11:00after work. And so you can have that 11:02equivalent conversation with AI and 11:03still get a ton of value. And so one of 11:05the things I want to I guess encourage 11:07you with is 11:09don't think of it as Nate writes big 11:11prompts and I have to write them too. 11:12Think of it as know when to use a larger 11:15more formal prompt versus when to use a 11:17casual one. So if you put this all 11:19together I think that you are going to 11:21get farther with AI this week if you can 11:24do a couple of little exercises that 11:27help you to think through these 11:28challenges. So find something you can 11:30tokenize this week. Find something maybe 11:32that you haven't tokenized this week. I 11:34scribble stuff on notepads all the time. 11:37It's terrible handwriting. I find that 11:39with the right model, 03 is better at 11:41this. I can visually process that data 11:44and get it into text and I can tokenize 11:46it. That's an example of tokenization 11:48for me. You can find one for you. 11:50Second, look for something that feels 11:52jagged with AI and be intentional about 11:55how you cultivate the strength, the 11:58peak, the good part, the Shakespeare 12:01part of AI rather than the part that 12:03isn't so intelligent, isn't so good. And 12:06then third, just keep an eye on how 12:08often you feel like the prompt fit the 12:12project. If you feel like the prompt fit 12:14the project some of the time and like 60 12:1870% of the time it's kind of where about 12:20where I am. I sometimes have to restart 12:22prompts because I'm like no that wasn't 12:23the right prompt. Let me retry this. And 12:25if you feel like you just are never 12:27getting the right prompt that's a signal 12:29like you can dive in. I've got lots of 12:30material on prompts. It's a signal for 12:32you to think about how you communicate 12:35intent and what kind of work you want to 12:36do. Where do you want to iterate for 12:38value versus where do you want to anchor 12:40and define and have a big conversation 12:42first? Those three pieces, if you get 12:45them, it's going to help you enormously 12:47in understanding how to use AI and 12:49augment it. So, I hope this has been 12:50helpful for you. I hope you understand 12:51tokenization better. I hope you 12:53understand jagged intelligence better. 12:55And I hope you understand that it's not 12:57just all about big prompts. But I do 13:00like big prompts. And I cannot lie.