Scaling Multilingual Data to Trillion Tokens
Key Points
- The “data‑as‑oil” metaphor highlights a looming scarcity of high‑quality training data for large language models, prompting a search for scalable pathways beyond the current trillion‑token datasets.
- Scaling to ~10 trillion tokens requires a truly multilingual corpus — roughly 30‑40 % English and the rest diverse languages like Chinese, Hindi, French, and Spanish — supported by automated cleaning, deduplication, and adaptable tokenizers that respect morphological differences.
- Achieving ~100 trillion tokens demands incorporating multimodal sources such as video streams, high‑quality transcriptions of podcasts and calls (with permission), and vision data, turning the dataset into a continuous, real‑time web‑scale ingest.
- This massive expansion will need unprecedented compute, storage, and network capacity, as well as next‑generation transformer architectures capable of unifying text, image, and video modalities within a single model.
- Current models (e.g., ChatGPT) feel “unnatural” in non‑English languages because their training sets are heavily English‑centric; balancing language representation is the simplest and most effective route to truly global, high‑performance LLMs.
Full Transcript
# Scaling Multilingual Data to Trillion Tokens **Source:** [https://www.youtube.com/watch?v=wIWtp0KZa3c](https://www.youtube.com/watch?v=wIWtp0KZa3c) **Duration:** 00:12:29 ## Summary - The “data‑as‑oil” metaphor highlights a looming scarcity of high‑quality training data for large language models, prompting a search for scalable pathways beyond the current trillion‑token datasets. - Scaling to ~10 trillion tokens requires a truly multilingual corpus — roughly 30‑40 % English and the rest diverse languages like Chinese, Hindi, French, and Spanish — supported by automated cleaning, deduplication, and adaptable tokenizers that respect morphological differences. - Achieving ~100 trillion tokens demands incorporating multimodal sources such as video streams, high‑quality transcriptions of podcasts and calls (with permission), and vision data, turning the dataset into a continuous, real‑time web‑scale ingest. - This massive expansion will need unprecedented compute, storage, and network capacity, as well as next‑generation transformer architectures capable of unifying text, image, and video modalities within a single model. - Current models (e.g., ChatGPT) feel “unnatural” in non‑English languages because their training sets are heavily English‑centric; balancing language representation is the simplest and most effective route to truly global, high‑performance LLMs. ## Sections - [00:00:00](https://www.youtube.com/watch?v=wIWtp0KZa3c&t=0s) **Scaling Data for Multilingual LLMs** - The speaker argues that advancing beyond trillion-token corpora requires curated, multilingual datasets and automated cleaning tools to provide useful, diverse training data for large language models. ## Full Transcript
so this past fall at the nurp conference
Ilia suver said that we are in a world
that has data as oil we are in a world
where data is running out and that that
phrase has been haunting me ever since I
want to suggest a pathway to more
data right now large language models can
get trained on something like a trillion
tokens of very curated text it's not the
whole internet I mean you can start with
the whole internet but it's not like
clean curated text
there's problems though that's a static
snapshot the internet continues to grow
there's also of course larger training
uh sets of private data that are
generated all the time so I asked myself
what kinds of breakthroughs are
needed to lead to useful data not total
data but useful data for large language
models basically what is the scaling
framework to move from a trillion tokens
up past 10 trillion tokens and Beyond
the this is what I want to talk about
today if we scale to 10 trillion tokens
in our training data set we're going to
need to do multilingual we're going to
need to do um very focused curated
expansion of historically
underrepresented texts we have to go
beyond our English focused Text corpus
or body for these llms to a much more
multinational one it has to be not 80%
or 90% English but like 30 40% English
lot of Chinese lot of Hind Hindi a lot
of French a lot of Spanish it has to
more represent the world's actual
language set we also need much better
curation tools we'll need to automate
cleaning we have to have better
duplication pipelines that handle
diverse languages and formats we have to
have more flexible tokenizers that adapt
to morphological differences across
languages and if we can get there models
are going to become much more truly
multilingual and they'll capture a
broader range of our expression I will
say someone who knows Indonesian that
the model doesn't feel as natural in
Indonesian as it does in English and
that would make sense because Chad GPT
and other models are mostly trained on
English data and so to me the simplest
path to 10 trillion tokens is actually
making the training set truly reflective
of the languages that we speak as a
global Community let's go a step further
what's the path to 100 trillion tokens
in our training sets that are high
quality
I think video is obviously your
real-time web streams continuous
ingestion of high quality web content
social media streams getting large scale
transcription going of good quality
podcasts of phone calls obviously with
permission if they're high
quality um and multimodal as well like
getting Vision involved where we have
high quality Vision
tokens those are all pieces we have to
Cobble together to make the jump from 10
trillion to 100 trillion tokens it's
going to require an immense amount of
compute a lot of storage huge network
capacity it's going to dwarf anything
we've got today um and it's going to
take Transformers that can absolutely
unify text images and even videos in the
same architecture we are just now
getting to truly native multimodal
architectures for text and images I know
text and images and video are on the
horizon at open AI but it needs to be
scalable
and if we're at 100 trillion tokens and
we have continuous feeds it means that
it opens the possibility of continuously
updated models that stay current on
world events don't ask me quite how we
architect that I think that's look
that's two or three jumps beyond what we
have now it's okay to dream a little bit
I'm mostly trying to color in roughly
what it would take from a tokenization
perspective to get
there let's go beyond what's past 100
trillion tokens if we get into the
quadrillion space for tokens or 10 to
the 15th power we're talking pulling in
sensor logs this is where robotics
starts to come into play I know a lot of
folks have been very excited about
robotics getting into the physical space
because embodied experience should
dramatically expand our healthy token
input selection for llms so sensor logs
from cameras lar tactile sensors motor
commands internet of things and where
Ables and being able to pull all of that
in again and actually train it as a good
quality token so that we can actually
train understand embody that
understanding and output it would take
even bigger GPU and TPU clusters it
would take systems that can learn from
interactions and unlabeled sensor
streams it would take tools that can
compress raw Visual and audio frames
into tokens that are semantically
meaningful and they would have to do all
of this at an unheard of scale
so again that's yet another set of
technical challenge we' challenges we'd
have to
overcome and then if we want to go even
farther get into zettabytes like 10 to
the 20th tokens and
Beyond now we have to look at
three-dimensional scans we have to look
at full audio streams simulation logs
Internet of Things sensor data at City
scale um data across multiple Industries
in real time in standardized formats
healthc care
manufacturing agriculture autonomous
vehicles um our infrastructure has to
handle tens to hundreds of
zettabytes we need to have modular
architectures that can handle
multi-trillion parameter
scales um it's just it's an absolutely
mindboggling
task so there in just a few minutes I've
taken you from where we're at now with
trillion token models up to 10 trillion
tokens up to 100 trillion tokens up to
quadrillion and then to the zetabyte era
I don't know when we'll get there and
one of the interesting things is I don't
know if we have to get there for
artificial general intelligence that's
meaningful if you think about it it is
notable to me that clairo was saying
that uh product requirements documents
are actually higher quality in general
without reasoning models in other words
the stuff we have right now the models
we have today the four class models we
use good enough for that piece of work
that doesn't mean I'm under the illusion
that we have good enough models for
everything else and all general purpose
work obviously not we have other things
to do
but I don't think it's eminently clear
that we need to get to zetabyte scale in
order to have general intelligence or
even to unlock Super intelligence and
start to kick off a chain reaction of
intelligence that's self-improving we
may be able to get there
sooner one of the reasons we may be able
to get there sooner is the reasoning
scale model which doesn't even touch any
of this you can scale reasoning by
scaling compute at test time
obviously regardless of the training
data set size and so imagine having a
much much smarter trained model at the
10 trillion or 100 trillion token scale
and also layering reasoning on top of
that as a second scaling
law and so when I think about sort of
the path to the next few years
I think it makes a lot of sense to be
really honest about some of these big
Lego
blocks that are steps along the pathway
to artificial intelligence
scale we need to be honest that we have
barely scratched the surface of high
quality tokens but at the same time
getting 10x 100x 1,000x and more high
quality tokens is going to be very very
difficult it's going to be difficult
from a uh research perspective just
understanding how to do it at scale it's
going to be difficult from a industrial
data management perspective building the
data centers ingesting the data cleaning
the data at scale that we've never
touched before Etc it's going to be
difficult from the perspective of
serving these models because they're
absolutely huge models to serve um and
we have that sort of crosscutting factor
of reasoning where model makers will
have to decide how big do these models
need to be in order to be useful to do
economic work
especially if you can use reasoning as a
sort of cheat code to scale up this is
what was on SAA nadella's mind when he
was talking about the value of being a
renter in the Gold Rush he was talking
about Azure and how he wants to rent
data centers to people who are
interested in models in model making and
implicitly was saying he was kind of out
of the model making business as just a
Microsoft
Project I think he was seeing the
immense data costs here and he was like
you know what I'd rather just just rent
these guys all the data centers they
need that's a money-making
proposition because if you think about
it this is an immense amount of effort
scaling up data like this and the value
you get is something that you have
to continue to accelerate in order to
harvest open AI is right now writing the
tiger's tail they have gotten a little
lead through 01 and
03 and they need to continue to hold on
to that lead in order to continue to
make the case demonetize for Enterprise
so they are they are locked into the
poker betting cycle for model makers
they can't get out and I think Sacha
looked at that and was like that's a
very expensive game to get in you can
put all of this investment in and then
you have to be locked in to keep doing
more and more investing farther and
farther because right behind you were
people like deep seek who are
relentlessly distilling models down and
people like quen who are distilling deep
seek models and I think there was
another Chinese model today that
distilled even further now you have tiny
tiny models like billion perometer
models or less that are really really
good um and those are only possible
because the original training work that
was done on the four class
models but the fact that you can train
once and then distill across the world
means that you only can harvest the
value of these models if you keep
maintaining your Edge if you keep
growing and growing and growing and
maintaining your Edge no matter matter
what and so when I look at this token uh
ladder up from a trillion tokens where
we are now to the zetabyte
era what I think of is opening eyes kind
of locked into this arms race anthropic
has locked into this arms race Gemini is
pretty locked into this arms race
arguably meta is locked into this arms
race and whether they like it or not I
think some of the Chinese llm uh model
makers are locked in as
well they all need to get to the
Winner's Circle and they don't know how
to define that and they only know they
have to keep climbing and that's very
good for consumers and that's very good
for Builders but there's immense
technical challenges coming up here and
I'll be really curious to see how they
balance the challenge of getting into
these new high quality tokens versus
reasoning one more thought I'll leave
you with this entire ladder is very
friendly to
Nvidia is friendly in terms of gpus it's
friendly in terms of the wearables and
the lar and the robotics that has to
happen for some of the later stages with
sensors you're not going to get robotics
at scale I think without some kind of
Invidia stack
underneath um and I might be wrong maybe
there'll be another company that will
come along that will be able to scale up
uh gpus and scale up sort of sensor
collection but my my gut says Nvidia is
already here with their big robotics
push at the beginning of this year and
they intend to relentlessly build into
this space the way they've relentlessly
built into a bunch of other spaces in
the
past so we'll see anyway some late night
thinking on where we're at with data I
just I don't want to sit there and say
that data is oil and we can't go get
more I want to think about the world in
terms of tokenization and I want to get
more creative so you tell me