Learning Library

← Back to Library

Meta Unveils Non‑Generative VLJ Model

10m • TheAIGRID • ai-ml • deep-dive • advanced • Watch on YouTube ↗

Key Points

Meta’s former AI chief scientist Yan Lun (Yan LeCun) published a paper on “VLJ,” a Vision‑Language model that uses a joint‑embedding predictive architecture (JEPPer) as an extension of the earlier VJA design.
Unlike generative models (e.g., ChatGPT, GPT‑4) that produce text token‑by‑token, VLJ is a non‑generative system that directly predicts a meaning vector in semantic space and only converts it to words when required.
This semantic‑space approach makes VLJ roughly twice as parameter‑efficient and faster than traditional vision‑language models while often delivering superior performance on image, video, and language tasks.
The authors argue that intelligence is about world understanding rather than language generation, positioning VLJ as a step toward agents that “think” first and “speak” later—an architecture with strong implications for robotics and embodied AI.
Yan Lun’s departure from Meta to start his own AI venture underscores his commitment to this philosophy, suggesting the VLJ paper may signal a shift away from token‑based large language models toward more grounded, meaning‑centric AI systems.

Sections

Full Transcript

# Meta Unveils Non‑Generative VLJ Model **Source:** [https://www.youtube.com/watch?v=Cis57hC3KcM](https://www.youtube.com/watch?v=Cis57hC3KcM) **Duration:** 00:10:26 ## Summary - Meta’s former AI chief scientist Yan Lun (Yan LeCun) published a paper on “VLJ,” a Vision‑Language model that uses a joint‑embedding predictive architecture (JEPPer) as an extension of the earlier VJA design. - Unlike generative models (e.g., ChatGPT, GPT‑4) that produce text token‑by‑token, VLJ is a non‑generative system that directly predicts a meaning vector in semantic space and only converts it to words when required. - This semantic‑space approach makes VLJ roughly twice as parameter‑efficient and faster than traditional vision‑language models while often delivering superior performance on image, video, and language tasks. - The authors argue that intelligence is about world understanding rather than language generation, positioning VLJ as a step toward agents that “think” first and “speak” later—an architecture with strong implications for robotics and embodied AI. - Yan Lun’s departure from Meta to start his own AI venture underscores his commitment to this philosophy, suggesting the VLJ paper may signal a shift away from token‑based large language models toward more grounded, meaning‑centric AI systems. ## Sections - [00:00:00](https://www.youtube.com/watch?v=Cis57hC3KcM&t=0s) **Meta Unveils Non‑Generative Vision‑Language Model** - Meta’s FAIR team released the VLJ paper, describing a joint‑embedding predictive architecture that directly maps visual inputs to semantic meaning—bypassing token‑by‑token generation, cutting parameters in half, and promising faster, more efficient performance for vision‑language tasks and robotics. - [00:03:22](https://www.youtube.com/watch?v=Cis57hC3KcM&t=202s) **VLJ vs Cheap Vision Models** - The speaker explains that unlike low‑cost frame‑by‑frame models that merely label each image, VLJ processes a continuous video stream, builds temporal context, and only outputs stable, confident action descriptions. - [00:07:10](https://www.youtube.com/watch?v=Cis57hC3KcM&t=430s) **Efficient Video Understanding via VJepper** - The speaker contrasts massive visual language models with a streamlined 0.5 billion‑parameter VJepper approach, showing it outperforms token‑based video classifiers while underscoring the richness and complexity of real‑world visual data. ## Full Transcript

0:00So Meta's AI chief released a new paper. 0:02And is this the beginning of the end for 0:04LM? Let's talk about it. So most of you 0:07guys know that Meta's AI chief scientist 0:08Yan Lun reportedly left Meta or is 0:11leaving Meta to build his own AI 0:12startup. But before that, he actually 0:14made a really interesting paper that I 0:17want to talk about. So the paper that he 0:19made with a bunch of different 0:20researchers from Meta is called VLJ. So 0:24this is a vision language model built on 0:27joint embedding predictive architecture 0:28which is Jepper and this is I guess you 0:30could say an extension of the VJA 0:32architecture. So this is really cool 0:35because this is from Meta's fair lab of 0:37course you know Lean Land is the one 0:39leading this and the you know ridiculous 0:41thing about this well not ridiculous but 0:43the super super interesting that I found 0:44about this is that unlike models like 0:46Chachi that generate answers word by 0:48word VLJ does something completely 0:51different. This is a non-generative 0:53model. So this predicts meaning directly 0:57and it's not via text. So this model 0:59builds an internal understanding of what 1:01it sees, images, video, and then 1:03converts that understanding into words 1:05if needed. Now, because it learns in a 1:07semantic space instead of token space, 1:09it's faster, more efficient, and uses 1:11about half the parameters of traditional 1:13vision language models while often 1:15performing better. And this is crazy 1:18because what this means for robotics 1:20agent is super crazy. So let's get into 1:22this. So one of the things I wanted to 1:23you know really point out here to show 1:25you guys how you know different this 1:26architecture is is that it talks about 1:28the fact that this is a non-generative 1:30system. So if you know what a generative 1:32system is usually this means a 1:34generative model like chat GPT GPT4 this 1:37produces tokens or words you know one at 1:39a time you know you go from left to 1:40right and every output must be fully 1:42written to exist. So to answer what's 1:45happening in this video, a generative 1:46model is going to be like, okay, I'm 1:47going to decide the first word, then the 1:49second, then the third until it finishes 1:51the entire sentence. It literally, you 1:52know, it can't know the final answer 1:54until it finishes generating it, which 1:56is very slow and very painful. But a 1:58non-generative system means here is that 2:00it does not need to talk to think. So 2:02VJA essentially what it does is that it 2:04does not generate words by default. It 2:05doesn't predict the next token. It 2:07doesn't need sentences to exist. 2:09Instead, it predicts a meaning vector 2:10directly. So think of the differences 2:12like this. generative AI is let me 2:14explain what I think while I'm still 2:15figuring it out and non-generative AI 2:17says you know I already know and I'll 2:18only explain if you ask and compared to 2:21and remember this is the entire reason 2:23that Yanlakan cares about this so much 2:25is because he has been saying for so 2:27long that language is not intelligence 2:29his belief is that intelligence equals 2:31understanding the world and language is 2:33simply just an output format but Vla 2:36reflects that philosophy exactly so this 2:38is why this video is talking about what 2:41this might be after LLMs where you're 2:43thinking in language, reasoning in 2:45tokens, [music] and where you're 2:47thinking in the latent space, reasoning 2:49in meaning, and language is actually 2:51optional. This is the paradigm shift 2:52that this paper is talking about. And I 2:54think that maybe, just maybe, if this 2:56gains more traction, this could be post 2:57LLMs. So, essentially what you're 2:59looking at in this video is where you 3:01have a map of the internal understanding 3:05over time. So, each dot is essentially 3:07what the AI thinks is happening at that 3:09moment. So you can see the red ones, 3:11those are basically the instant guesses, 3:13but the blue is essentially the 3:14stabilized understanding. So you have to 3:16understand that what you're seeing on 3:17the left is essentially the vision 3:19model, what it would be able to see. 3:20Now, now what most people are going to 3:22ask here is how is this even different 3:23from a cheap vision model just 3:25describing exactly what the video is 3:26doing. Well, the short answer is that 3:28cheap models, they talk, but VLJ is 3:30understanding. So we need to break down 3:32exactly what that means. So the lowcost 3:34vision model, the describer is basically 3:36a cheap basic vision model that works 3:38like this. You have the frame, then you 3:39have the label, then you have the frame, 3:41then label, then frame, then label. So, 3:42it looks at each frame, it guesses what 3:44it sees, and it spits out the text 3:45immediately. So, this is, you know, what 3:47does that look like? Hand, bottle, 3:48picking up canister, and it's jumpy, 3:50inconsistent with no memory, and it's 3:52basically reacting and not 3:53understanding. But this is where we have 3:55VLJ. So, Vlja does this instead. It's 3:58got a video stream, of course, and it's 4:00got continuous meaning, and then it's 4:02the event. So this tracks the meaning 4:04over time building a stable 4:06understanding and it only labels the 4:08action once it's confident. That's why 4:10you see red dot which is an instant 4:11guess. Well, it might be wrong. It might 4:13be bottle. But then the blue dot is a 4:15stabilized meaning it's a canister. So 4:16the reason that this actually matters a 4:18lot is because the cheap model is going 4:19to say I see a bottle. I see a bottle. I 4:21see a bottle. But then VLJ is going to 4:23actually understand the action and say 4:25the action is picking up a canister. So 4:27the killer difference is of course time. 4:29Lowcost models think in single frames 4:31and they have no real sense of before 4:33and after. VLJ thinks in temporal 4:35meaning and it knows when an action 4:37starts, continues and ends. That's why 4:39it's extremely useful for robotics, 4:41wearables, agents, real world planning. 4:43And why the dot cloud matters is that 4:45you know it's showing you know meaning 4:47drifting slightly from frame to frame 4:48then locking in once enough evidence 4:50exists. And this is something that you 4:52know the tokenbased models they can't 4:53really do efficiently because number one 4:55they need to you know keep generating 4:57text and number two they can't hold 4:59silent semantic state. So you know if 5:01you think about it a cheap model is 5:03basically like a CCTV motion detector 5:05shouting guesses but a VLJ is a human 5:07watching and saying ah okay he's he's 5:09picking something up. So then of course 5:11you might want to understand the diagram 5:12of the architecture. So this is the VLJ 5:15model architecture. So if you wanted to 5:17know how this works, this is basically 5:19the architecture. But honestly, it was a 5:21little bit confusing. So I decided to 5:22just get a simpler description. So I 5:25actually used GPT image 1.5 to get this 5:28image because this is actually pretty 5:30good. And if you know this is too much, 5:32I also have this one right here. So 5:34language is optional, understanding is 5:36not. So basically, you know, the X 5:38encoder is the visual input. So it's 5:40going to be the video frames. The 5:41predictor is basically the brain. The 5:43Yen encoder is the textual query which 5:45is what you'd be asking it. And then of 5:47course you've got the encoded meanings 5:48from the word which is the Y decoder. 5:50Then of course you've got your comparing 5:52the thoughts which is a training loss 5:53which essentially means that you know 5:55it's getting better over time. And then 5:56of course you got the final output which 5:58is the correct answer which is the 6:00actual meaning. Now if we look at the 6:02tests of this is currently the best. So 6:05we're looking at the scoreboard which is 6:06where we can see the other ones the 6:08different AI models. We can see that 6:10clip sig LP and P core. They're older 6:12well-known vision models and compared to 6:14VLJ base this is and VJA SFT which is 6:18you know fine-tuning and then we can see 6:19that VJER is a really really incredible 6:22improvement and one of the things I 6:24think you know a lot of people are going 6:25to miss is that of course you're 6:27probably going to miss the fact that VLJ 6:29is super super small so you know how 6:32generative models just you know tokens 6:33on tokens and tokens but if you're 6:35thinking about something that actually 6:36reasons like a human you can see that 6:38the number of parameters and number of 6:40samples seen you can see that VL jpa is 6:421.6 billion parameters and 2 billion 6:45parameters you know in terms of the 6:46sample scene. So it's remarkably more 6:49efficient than the other things that 6:50we're you know looking at. So I think 6:52it's I think it's pretty incredible how 6:54that is. I mean if we you know continue 6:56to look over here you can see that the 6:58zero shot video captioning. So this is 7:00where it's showing with the same data 7:01and same setup VOJepper actually learns 7:04faster and it reaches higher caption 7:05quality and predicting meaning you know 7:07learns faster than predicting words. 7:09Then of course you've got chart two 7:10which is zeroot video classification and 7:12it's the same thing VLJ pulls quickly 7:15ahead and the visual language models 7:16improve very slowly. So even without 7:18fine-tuning VJ understands videos better 7:21and this kills the idea that you need 7:23token generation to understand things 7:25and it you know it's clear it's clear 7:26that you know Yandan is on to something. 7:29So once again if we look at the right 7:31size remember once I said that again. 7:32Now remember once I said that if you 7:34look at the actual size of the models 7:35you can see that once again visual 7:36language models are you know much larger 7:39and much less efficient and vjer only 7:42needs like 0.5 billion parameters in 7:44terms of their predictor and so there's 7:46no heavy decoder during training. So 7:48VJepper is going to get better with 7:49results with half the trainable 7:50parameters which is pretty insane in 7:52machine learning terms. And of course 7:54here we have Yan Lerna talking about 7:56this stuff. I mean, this was I think 7:57around two to three weeks ago. 7:59>> Four-year-old has seen as much visual 8:02data as the biggest LLM trained on the 8:05entire text ever produced. And so what 8:08that tells you is that there is way more 8:11um information in the real world, but 8:13it's also much more complicated. It's 8:16noisy. It's high dimensional. It's 8:18continuous. And basically the methods 8:20that are employed to train LLMs do not 8:23work in the real world. That explains 8:26why we have LLMs that can pass the bar 8:29exam or solve equations or compute 8:32integrals like college students and 8:34solve math problems. But we still don't 8:36have a domestic robot. They can, you 8:39know, do the chores in the house. We 8:40don't we don't even have level five 8:42self-driving cars. I mean, we have them, 8:44but but we cheat. So, um I mean, we 8:47certainly don't have self-driving cars 8:48that can learn to drive in 20 hours of 8:51practice like any teenager. And then of 8:53course I actually went on Yelican's 8:55Twitter and I saw him uh reposting this 8:58from Sonia Joseph. Now this is someone 8:59of course that works at Meta and she 9:01essentially said that we don't simulate 9:02every atom to model intelligence. We 9:04don't use quantum field theory to model 9:06road traffic. Jeepa taught me the 9:07importance of learning physics at the 9:08right level of abstraction. Thank you 9:10Landin and the Jeppa team. It was a 9:12privilege to work with you. So I'll 9:13definitely take a look at this. The 9:14thesis behind Japa is that our current 9:16models are not predicting causal 9:18dynamics. And if you both predict in 9:21latent space and predict the future, 9:23then you're more likely to abstract away 9:25all these pixel level details. For 9:27[music] example, when we model even this 9:29conversation right now, we don't have to 9:31model it down to the level of atoms. 9:33That would be so computationally costly 9:35and so efficient. We model things at the 9:37representation that's suited for our 9:39goal. So similarly, JEPA is optimi 9:43optimized to have [music] physical 9:45representations at the level of 9:47abstraction it needs. It enables it to 9:48plan in the physical world and be able 9:51to do a counterfactual reasoning about 9:52objects that are moving around behind 9:54Japa. 9:55>> Now I did see a few comments on Reddit 9:57talking about the video saying that most 9:59of the actions that it detects are wrong 10:00though. If you stop the video at any 10:02time to actually read what it says, it's 10:03really bad. And someone also says, well, 10:06the guy, the same guy or the same person 10:08says that I stopped it like five times 10:09and they were all wrong. Made up a side 10:11of pizza, made up something else. But I 10:13think the most important thing here is 10:14not that it's going to be 100% right. I 10:16think the most important thing is that 10:17it's actually moving us in the right 10:18direction of where AI models should 10:21actually be and not just getting 10:23completely distracted by chat bots.