FAccT Highlights: Fairness, Safety, Benchmarks
Key Points
- Shobhit Varshney cautions that AGI still feels far off, predicting only “very intelligent machines” within the next five years rather than true general intelligence.
- Host Tim Hwang outlines the episode’s focus: the FAccT conference on AI fairness, an AI safety interview with Leopold Aschenbrenner and Dwarkesh Patel, and the latest developments in Retrieval‑Augmented Generation (RAG) benchmarking.
- The annual FAccT conference in Rio is highlighted as the premier venue for the newest research and debates on machine‑learning fairness, accountability, and responsible AI.
- A recent AI‑safety discussion explores methods for forecasting AI capabilities and updates on safety efforts at OpenAI.
- Benchmarking RAG systems is emphasized as a key indicator of industry progress, with experts weighing in on what current results reveal about the state of the technology.
Sections
- AI Forecasts, Fairness & Benchmarking - The segment opens with Shobhit Varshney’s hot take on near‑term AGI feasibility and introduces the Mixture of Experts podcast episode covering the FAccT conference, an AI safety interview, and the latest RAG benchmarking developments.
- From Fairness Theory to Organizational Practice - The speakers discuss the shift from defining fair machine‑learning algorithms to the practical challenges of implementing responsible AI within companies, covering education, internal workflows, copyright, and labor impacts of LLMs, with examples from IBM.
- Ethical AI Governance in RAG Systems - The speakers critique using isolated legal groups for ethical AI, argue for interdisciplinary oversight to ensure fairness and responsibility in retrieval‑augmented generation, and highlight incentives and liability risks that drive proper organizational structures.
- Governance Bottleneck Drives Platform Shift - The speaker argues that lengthy governance approvals are throttling AI project delivery and advocates consolidating RAG processes into a unified platform with pre‑approved accelerators to eliminate the delay.
- Choosing and Scaling Enterprise AI - The speakers highlight enterprises’ growing focus on selecting the right model size and deployment approach for responsible AI, and on scaling workflow steps—such as improving OCR accuracy with larger language models—to boost task performance.
- AGI “Situational Awareness” Sparks Policy Attention - The speaker explains how Leopold Aschenbrenner’s extensive “Situational Awareness” essay on artificial general intelligence has vaulted into public and congressional focus, highlighting the contrast between superintelligence hype and the practical, everyday use of AI technology.
- AI Compute, Nuclear Analogy, Governance - The speaker likens future AI power demands to nuclear reactors, arguing that merely adding compute can't fix poor data or algorithms, and insists that such massive, potentially hazardous technology should be overseen by national governments rather than private entities.
- Predicting AI Safety and Trends - The speakers discuss governmental safety mandates, access and transparency at major AI firms, and debate whether linear extrapolation of current developments can accurately forecast AI capabilities by 2026.
- Powerful Models, Complex Trade‑offs - The speakers argue that while AI models grow more powerful, their accuracy isn’t guaranteed, prompting concerns about escalating compute demands, concentration of control, environmental impact, and the necessity of responsible, purpose‑driven development.
- AGI Emerging, Not Yet Distributed - The speaker contends that AGI‑level abilities already exist in narrow AI, will disseminate gradually across tasks, and that future super‑intelligent machines will surpass humans in knowledge sharing and collaboration, leading to safer, more powerful AI development.
- RAG Failure Points and Evaluation - The speaker outlines how poorly formed queries cause incorrect retrieval, leading to hallucinated or incomplete answers in RAG systems, and stresses the need to assess relevance, answerability, faithfulness, and completeness.
- RAG Benchmark Saturation and Limits - The speakers discuss how RAG evaluation metrics quickly become saturated, lack a universally accepted standard, and drive the community toward incremental improvements and new directions such as AI agents.
- Debating RAG Definitions and Evaluation - Participants argue that RAG encompasses more than simple retrieval, highlighting the need for routing, structured/unstructured data integration, and new metrics for end‑to‑end accuracy.
Full Transcript
# FAccT Highlights: Fairness, Safety, Benchmarks **Source:** [https://www.youtube.com/watch?v=0sA6X6n3goc](https://www.youtube.com/watch?v=0sA6X6n3goc) **Duration:** 00:40:28 ## Summary - Shobhit Varshney cautions that AGI still feels far off, predicting only “very intelligent machines” within the next five years rather than true general intelligence. - Host Tim Hwang outlines the episode’s focus: the FAccT conference on AI fairness, an AI safety interview with Leopold Aschenbrenner and Dwarkesh Patel, and the latest developments in Retrieval‑Augmented Generation (RAG) benchmarking. - The annual FAccT conference in Rio is highlighted as the premier venue for the newest research and debates on machine‑learning fairness, accountability, and responsible AI. - A recent AI‑safety discussion explores methods for forecasting AI capabilities and updates on safety efforts at OpenAI. - Benchmarking RAG systems is emphasized as a key indicator of industry progress, with experts weighing in on what current results reveal about the state of the technology. ## Sections - [00:00:00](https://www.youtube.com/watch?v=0sA6X6n3goc&t=0s) **AI Forecasts, Fairness & Benchmarking** - The segment opens with Shobhit Varshney’s hot take on near‑term AGI feasibility and introduces the Mixture of Experts podcast episode covering the FAccT conference, an AI safety interview, and the latest RAG benchmarking developments. - [00:03:11](https://www.youtube.com/watch?v=0sA6X6n3goc&t=191s) **From Fairness Theory to Organizational Practice** - The speakers discuss the shift from defining fair machine‑learning algorithms to the practical challenges of implementing responsible AI within companies, covering education, internal workflows, copyright, and labor impacts of LLMs, with examples from IBM. - [00:06:14](https://www.youtube.com/watch?v=0sA6X6n3goc&t=374s) **Ethical AI Governance in RAG Systems** - The speakers critique using isolated legal groups for ethical AI, argue for interdisciplinary oversight to ensure fairness and responsibility in retrieval‑augmented generation, and highlight incentives and liability risks that drive proper organizational structures. - [00:09:23](https://www.youtube.com/watch?v=0sA6X6n3goc&t=563s) **Governance Bottleneck Drives Platform Shift** - The speaker argues that lengthy governance approvals are throttling AI project delivery and advocates consolidating RAG processes into a unified platform with pre‑approved accelerators to eliminate the delay. - [00:12:47](https://www.youtube.com/watch?v=0sA6X6n3goc&t=767s) **Choosing and Scaling Enterprise AI** - The speakers highlight enterprises’ growing focus on selecting the right model size and deployment approach for responsible AI, and on scaling workflow steps—such as improving OCR accuracy with larger language models—to boost task performance. - [00:15:49](https://www.youtube.com/watch?v=0sA6X6n3goc&t=949s) **AGI “Situational Awareness” Sparks Policy Attention** - The speaker explains how Leopold Aschenbrenner’s extensive “Situational Awareness” essay on artificial general intelligence has vaulted into public and congressional focus, highlighting the contrast between superintelligence hype and the practical, everyday use of AI technology. - [00:19:00](https://www.youtube.com/watch?v=0sA6X6n3goc&t=1140s) **AI Compute, Nuclear Analogy, Governance** - The speaker likens future AI power demands to nuclear reactors, arguing that merely adding compute can't fix poor data or algorithms, and insists that such massive, potentially hazardous technology should be overseen by national governments rather than private entities. - [00:22:04](https://www.youtube.com/watch?v=0sA6X6n3goc&t=1324s) **Predicting AI Safety and Trends** - The speakers discuss governmental safety mandates, access and transparency at major AI firms, and debate whether linear extrapolation of current developments can accurately forecast AI capabilities by 2026. - [00:25:11](https://www.youtube.com/watch?v=0sA6X6n3goc&t=1511s) **Powerful Models, Complex Trade‑offs** - The speakers argue that while AI models grow more powerful, their accuracy isn’t guaranteed, prompting concerns about escalating compute demands, concentration of control, environmental impact, and the necessity of responsible, purpose‑driven development. - [00:28:20](https://www.youtube.com/watch?v=0sA6X6n3goc&t=1700s) **AGI Emerging, Not Yet Distributed** - The speaker contends that AGI‑level abilities already exist in narrow AI, will disseminate gradually across tasks, and that future super‑intelligent machines will surpass humans in knowledge sharing and collaboration, leading to safer, more powerful AI development. - [00:31:25](https://www.youtube.com/watch?v=0sA6X6n3goc&t=1885s) **RAG Failure Points and Evaluation** - The speaker outlines how poorly formed queries cause incorrect retrieval, leading to hallucinated or incomplete answers in RAG systems, and stresses the need to assess relevance, answerability, faithfulness, and completeness. - [00:34:29](https://www.youtube.com/watch?v=0sA6X6n3goc&t=2069s) **RAG Benchmark Saturation and Limits** - The speakers discuss how RAG evaluation metrics quickly become saturated, lack a universally accepted standard, and drive the community toward incremental improvements and new directions such as AI agents. - [00:37:31](https://www.youtube.com/watch?v=0sA6X6n3goc&t=2251s) **Debating RAG Definitions and Evaluation** - Participants argue that RAG encompasses more than simple retrieval, highlighting the need for routing, structured/unstructured data integration, and new metrics for end‑to‑end accuracy. ## Full Transcript
Shobhit Varshney: I've never seen, uh, like AGI being more plausible
than we are standing right now.
So my hot take, where we are right now, 2024, um, five years out, I
would see us, uh, be able to get to very, very intelligent machines.
Tim Hwang: Hello, and happy Friday.
You're listening to Mixture of Experts.
I'm your host, Tim Hwang, back again.
Each week, Mixture of Experts distills down the week's most
important headlines and chatter in the world of artificial intelligence.
From research papers and product announcements to ethics governance and
just plain gossip, we've got you covered.
This week on the show, First, the annual ACM Conference on Fairness,
Accountability, and Transparency, or FAccT, is happening this week in Rio.
We'll talk about the latest developments in ML fairness and
the state of responsible AI.
Next up, Leopold Aschenbrenner's AI safety screen situational awareness hit
the airwaves with a widely talked about interview with Dwarkesh Patel, what's the
best way to forecast AI capabilities, and what's going on with safety at OpenAI.
And finally, benchmarking, benchmarking, benchmarking.
This week we talk about the latest in RAG benchmarking and what it tells
us about the industry as a whole.
As always, I'm joined by an incredible group of experts who
will help us cut through the noise and drop some hot takes as we go.
Vagner Santana, Staff Research Scientist, Master Inventor, and importantly,
debuting for the first time on MOE.
Vagner, welcome to the show.
Vagner Figueredo de Santana: Thanks for having me.
Tim Hwang: Uh, next up, Marina Danilevsky, Senior Research Scientist.
Welcome back to the show.
Marina Danilevsky: Thanks, happy to be here.
Tim Hwang: And Shobhit Varshney, who has been with us since episode number
one, uh, senior partner consulting on AI for US, Canada, and Latin America.
Shobhit, welcome back to the show.
Shobhit Varshney: Absolutely love these.
Thanks for having me again.
Tim Hwang: All right, well, let's just jump right into it.
So the first story I want to cover is the annual FAccT conference
is happening this year in Rio.
So for those who don't know, it is.
Uh, arguably the leading conference on topics of machine learning
fairness and responsible AI.
And I thought this would be a good jumping off point just because if you've
been watching this space for some time, responsible AI and ML fairness has become
kind of a buzzword that lots and lots of people have used in recent years.
And I think these conferences are a good time to check in on what this kind of,
you know, state of play is in fairness and accountability questions in AI.
And Vagner, one of the reasons I wanted to have you on the show was, um, you've
been watching kind of the papers and the chatter around the conference.
Maybe I can just kind of toss it to you first for our listeners, uh, any sort of
patterns or trends that you've noticed, I think this year, um, at FAccT, if there's
particular papers that you think people should check out, just curious about
your review or your kind of thoughts on, um, what you're seeing out there,
um, uh, at this year's conference.
Vagner Figueredo de Santana: Well, there are interesting discussions around, um,
uh, synthetic data around how, uh, how people are using LLMs to create data
and then also to assess LLMs using LLMs.
So there, there's this discussion going on also about, uh, responsible AI.
Uh, one of the papers I selected to discuss with y'all, I think
has to do with how to, how people are learning about responsible AI.
on the job.
I think that that's important because people are getting
interested and people are following and trying to find resources.
But then that comes with all the complexities of
working in an organization.
So that is one aspect as well.
And, well, the other aspect connects with copyright and also how to deal
with all the labor that is being packed by the, the, uh, the use of LLMs.
In a wide range of of jobs around the world.
Tim Hwang: Yeah, for sure.
And I did want to pick up on that second theme, specifically, you know,
there's much more as true with all these conferences, there's many more papers
that you'd ever have time to discuss.
But I think what's so interesting about the responsible AI topic is.
You know, this is really an evolution, I think, in fairness in ML, where I would
say even a few years ago, basically a lot of the attention was like, can we
define in computational terms, what a fair machine learning algorithm is?
And it kind of feels like there's a lot more work now that's happening in
this much broader question, which is okay, well, we have all these techniques
and approaches around fairness in ML.
How do we actually get like an organization to implement it?
How do we get people to learn about it?
What are the techniques that people use?
And You know, Vagner, I think in addition to your research, it kind
of sounds like you've been doing some work on this internally within IBM.
And so I'm kind of just curious if you want to talk a little bit about that
paper that you mentioned and then just kind of map it to your own experience.
I think I'm curious about, like, what you're sort of learning as someone who's,
you know, very much in the trenches, you know, trying to get this work to work.
Vagner Figueredo de Santana: Yeah, and one, one of the aspects that,
um, that the paper covers and, and has to do with incentives.
And we need to be aware of the symptoms that our organizations have before
thinking about responsible AI, because otherwise, uh, well, we'll be facing
a lot of blockers all along the way.
Um, and also, uh, there is an interesting aspect that the paper highlights about,
um, The, the discipline identities that we have when we are like in,
in, in hard technical teams, they have their own discipline identities.
And uh, when they are looking for, uh, let's say resources about responsible
AI, they'll probably go into resources.
They are used to look for.
So they're going to be looking for technical, uh,
technical libraries or metrics.
And sometimes we need to go beyond this.
Um, our own discipline identities and look for other skills and other
disciplines and to learn more about, let's say, social technical impacts, right?
Beyond go beyond focusing on, let's say, some fair metrics for folks, uh,
focusing on more technical aspects.
And the other way is to.
Be as well, right?
For people, uh, uh, thinking about, uh, let's say indirect impacts on society.
They also need to be aware of the daily job of data scientists and coders and
researchers, and how can we like, do this, uh, uh, a connection, right?
Tim Hwang: Yeah.
I think you're just rundown, I think runs into, or I think highlights.
I think a bunch of the issues, uh, you know, I think it was a, it was
a joke that I had for, uh, uh, with a friend for a while that like, oh.
The main thing a lot of big companies would do when they wanted to do like
ethical AI or responsible AI would be like, well, we're going to create like
this, like secret group of lawyers that will just determine everything.
Um, and it was like, this is like not a good way of, of doing, you know, things.
Um, and I guess I'm kind of curious, I mean, Marine, if I can bring you into this
discussion, you know, um, I guess maybe one thing I'd be curious about is like how
you all think about things like fairness and responsible AI in, in rag, right?
Which are, you're literally trying to pull information from another source.
Um, and, and I guess I'm kind of curious about like if you've kind
of, you know, have thoughts on this particular discussion, right?
Like how should organizations kind of best organize themselves to, to do this right?
Because I think part of it is this kind of interdisciplinary crosstalk, which
I think organizations of different size, you know, do better or worse
at, you know, in different capacities.
Marina Danilevsky: I think something that Vagner said about incentives
really pops up here as well, which is why should you care?
Well, it's because you would like your customers finally
to be using your rag system.
And if it is giving answers that are not, um, not even so much fair, but there's
a risk that, uh, it's going to give something that is irresponsible that is
going to lead to your users being misled or being upset or taking legal action,
then your solution is not going to be, uh, But it's not going to be taken.
So actually, because we usually are looking at enterprise use cases, we
are very, very incentivized to make sure that we are communicating things
that are, you know, fair, ethical, regardless of what our own ideas are.
It's because if we don't succeed in that, it will not be purchased.
The risk is too high.
Um, there's too many, you know, fun stories in the news about,
uh, what happens when you don't pay enough attention to that.
Tim Hwang: So you're actually seeing that.
Cause I think, I dunno, I had a Fear, you know, um, which I still kind
of have, which is like, maybe this discussion is gonna become a little
bit like, um, like data privacy.
Where like, I think early on there was kind of this idea that like, oh, well
the minute there's a really big data breach, then everybody's suddenly gonna
care about data privacy and security.
And like consumers will all prefer the, the better privacy option, right?
But then I think you can make the argument that one of the
things that's happened is that.
There's just like so many huge data breaches now, so many big
failures that like almost like the Overton window has shifted.
We're just kind of like, Oh, you know, someone leaked
billions of customer records.
I guess that just happens.
But that is something that you're seeing is kind of sounds like that, like at
least in fairness because we've had all these high profile failures, it's
not necessarily people have just been come resigned to it actually still
remains kind of like a thing that people are really concerned about.
Shobhit Varshney: We've seen this quite a bit, right?
If you look at, uh, the AI culture.
The AI framework, responsible AI framework, and what we should prioritize
and which ones are high risk, how do you categorize use cases, so on and so forth.
And there's actual tooling and platforms that are needed to go
drive these at a price, right?
And those three layers have to be addressed one by one.
Um, the AI culture around, hey, look at your day to day workflows and see
where you can apply AI, and you have to do this in a responsible way,
and here's a framework around it.
Unfortunately, the reality on the ground for, Most of the Fortune 100 companies
that I work with, the responsible AI team, you have to go, it's easy to go
create a governance board, and you go to the governance board for guidance
and coaching and making sure that you're not doing the wrong things.
Unfortunately, it becomes, I'm going to quote Lord of the Rings, Gandalf, standing
on a bridge and saying, you shall not pass, right, go back to the shadows.
So we've, we've, Somehow created a, uh, forcing function that anything
that goes to the governance board adds about two months of delay to a project.
So the value of the unlock for the business gets diminished and I might
as well not even deal with this and I should just go stick to my RPAs
or automation scripts or regular AI stuff, and that'll be just fine, right?
So they've become a rate limiting step at this point, and that
has to fundamentally change.
And for that, the next layer that I was talking about in terms of platforms,
that becomes more and more critical.
So instead of saying that, hey, you need to go figure out all these 20 different
checklists in your RAG pattern so I can know exactly where the data is coming
from, there's, uh, there's metrics that I need to report against, and so on and
so forth, now you start to move towards a platform approach where you say, Hey,
use the platform pre approved accelerators all the rag when we looked at rag patterns
within IBM consulting within a week's time we had like 121 different different
ways in which people were doing rags and we said guys time out we've got to
go consolidate we'll create scribe flow we'll create a mechanism that has the
best of all techniques in one single spot right so when you start to get to a
platform then you come to a point where the governance boards are pointing you
towards accelerators Versus becoming a you shall not pass moment, right?
I think that the whole culture leading to governance, then leading to the stack.
And we're doing this with a lot of our Fortune 100 companies.
Recently, last couple of weeks back, I had Pepsi on stage with us where we're
talking about how we're helping them build a culture of responsible AI and
the frameworks and so on and so forth.
This is one of the many examples where we've had to go do this end to end
from culture to frameworks to actual tooling that goes and deploys that.
Tim Hwang: Yeah, that's really interesting, and actually, I should add
that I'm surprised that it has taken this long for us to get to a Lord of
the Rings reference, so I think we're at episode six right now is the first
one that we've actually, uh, heard.
Yeah, exactly.
Well, and I think, I don't know, I mean, maybe one last nuance to kind of touch
on, I'd be curious to get the panel's thoughts, and Vagner, maybe to throw
it back to you, is like, you know, so for the last few episodes, we've all
been very excited about open source.
And it feels like part of the problem of open source is that, you know, suddenly.
Like your fairness methodologies are almost competing with like just
being able to like pull something off the shelf and like deploy it in any
reckless way that you really want to.
Um, and I guess I'm kind of curious about like how we think about sort
of responsible AI going forwards in a world where like anyone can just
pull AI off the shelf and use it.
Um, because it feels like in a world where like maybe there's only two or
three platforms, you really can say, okay, well, if you want access to this advanced
technology, you're going to have to go through this additional compliance cost,
even if it takes you, um, a little bit more time, but that kind of lever is, I
don't know, from my point of view, seems to be like breaking down a little bit
as it becomes more and more accessible.
I don't know if you'd buy that.
You might also just say, Tim, you're totally wrong, but I don't
know, Vagir, if you've got any thoughts on that or anyone really.
Vagner Figueredo de Santana: In terms of open source, I think that the interesting
aspect is that, uh, people, um, have more transparency as, as we all know, and,
and also thinking about, uh, well, fully open source models, because people are
also discussing that when you also only have the model and you don't know the
data, you used to train the model, you just have, have, have open source model,
and when you know more about the model, then you have a fully open source, right?
So I think that that's important, and people are getting more and
more interested interested on that.
And when we talk to clients, they are also interested on, um, finding
the right model for the right task.
I think that that is also interesting.
Uh, that, uh, pattern that is emerging, like people discussing,
okay, is this the right size of model for solving my problem?
Is, is this, uh, like, is this language model or, uh, or generative AI?
Fully open.
Uh, can I host that in my own private cloud?
So these are questions that are appearing when we talk about
responsibility AI right now.
Shobhit Varshney: Yeah, we, uh, again, I'm coming in from a very
enterprise approach to this.
My, my square focus is how do we scale these, right?
And when you look at a, uh, step by step process, any workflow that's happening
in an organization today, right?
Seven different steps.
Uh, step number one, you're going to pull some data.
Step number three, you're going to do some fraud detection.
Step number four, now you're going to go extract something from a
document, an invoice, a contract, something that came to you, right?
Now, say where I was able to do OCR and pull that out with
about 80 percent accuracy.
So far, that's best where we were now.
All of a sudden, we have LLMs and we say, Hey, I've recently believed
that I could potentially get about 90, 92 percent accuracy and squeeze
more out of this document, right?
So now you're saying about 10, 12 points of additional benefit that
you can derive from it, right?
At that point, we stop and say, if you have reason to believe in LLM could
do this, let's Talk about constraints.
The constraints around cost and envelope, right?
How much can I afford if I'm doing this a thousand times or a million times?
There's a different ROI attached to it, right?
Then there is security where the data resides.
Models follow the data gravity.
They go, we deploy them closer to where the restricted data sets are and so forth.
Then you start to look at how quickly you need an answer.
The latency of a model matters.
You're trying to start figuring out, Okay.
Uh, from a compliance perspective, when I have to go explain to somebody
how I came up with this answer, which means I need auditable responses, I
need more deterministic responses in certain use cases and so on and so forth.
So you come up with a set of constraints, and given those five, ten different
constraints, now you have two or three good athletes that you start to test with.
And then from there on, we start to move towards metrics and
see which one is giving me more versus the others and so forth.
But it's very critical at a step level, at a sub level.
task level, you're trying to figure out which LLM is going to do the job.
And we're getting away from, hey, earlier we said, hey, can I, can a GPT 4 model
do the entire workflow end to end?
So we'll talk about that in a little bit, but I think we're
still at the sub task level.
We're surgically infusing AI and Gen AI and seeing if it can do
this one thing incredibly well.
I'll take care of the rest before and after.
Tim Hwang: Yeah, no, I think it makes a lot of sense, and I think goes to
this really interesting question, which we won't have time to address
today, but we should do on a future episode is, you know, what's that
mean for responsible AI, right?
Because it's like, you know, you have lots and lots of sub modules
that may have, you know, various, various different types of problems.
There's almost kind of a question about whether or not any one deployment is
responsible, but then whether or not the whole system hangs together is a
whole nother set of analysis, right?
That actually is another question.
Great.
So I want to move us to the second topic of today.
This is a really big week if you track the discourse around
artificial general intelligence.
Leopold Aschenbrenner, who is a former OpenAI Superalignment team
member, published this massive online screed called Situational Awareness.
Um, This would have kind of existed, I think, as sort of a weird, obscure
screed, but Leopold ended up doing an interview with Dwarkesh Patel, the
sort of influential tech podcaster, and this story and this document
has now just gone everywhere.
So I've caught up with friends who, you know, work in policy in D.C.
saying, we're getting calls from congressional offices saying what
our take is on situational awareness.
So I just want to take a quick breather here, um.
Uh, because the claims of situational awareness are quite breathless.
Um, the argument is if you take all the existing trends in AI and you project
linearly, we will reach a point where AI becomes, you know, sort of, um,
uh, transformational in its impact.
And so I think this is kind of a great opportunity to bring in.
You show a bit to this conversation, because the way I sort of see the
discourse is that there's a circle of people who are in like AGI
super intelligence land, right?
Who are like the AI is going to take over the world, right?
But then I think like there's this vast group of other people who are just
like doing work with the technology, who are like talking to companies
that are implementing the technology.
And I'm kind of curious as someone who's like really right.
You know, at the front lines of that.
Like, are companies like, situational awareness, do we
need to be worried about AGI?
Like, does that even enter into the commercial discussion?
Or is this like a completely, like, almost in the parallel dimension?
Shobhit Varshney: I've never seen, um, like, AGI being more plausible
than we are studying right now.
So, My hot take, where we are right now, 2024, um, five years out, I
would see us, uh, be able to get to very, very intelligent machines.
Now, the definition of AGI has been very weak, right?
Everybody has their own interpretation of what Artificial
General Intelligence would mean.
And even if you compare two different people, it's very difficult for
us to really have a good metric on is this person in the show that's
really intelligent or not, right?
Like if you ask my wife or my kids, you'd have a very different answer.
So it's a very different point and even being able to define
what AGI looks like, right?
But if you just talk about intelligence, we've been doing an incredibly good
job at making progress every two years.
If you like, stepping away from a half a year increments is
looking at a two year horizon.
Right?
When, uh, GTP4, uh, stopped training and you know, they've, they've discussed
this in 2022, you're looking at about a half a billion dollar spend, uh,
about a 10 megawatt, uh, about, uh, 25,000 a hundred, a 100, uh, GPUs
from Nvidia at that point, right?
That's kind of, kind of what they, uh, must have spent
doing this 2024 today you have.
You can have a hundred thousand Edge 100 equivalents.
You're seeing what, how much investment, meta and others are making into this.
Right?
And then you, uh, you had this huge, big announcement with OpenAI and
Microsoft that they go to establish a hundred billion dollars super computer.
Right now we're talking about something that starts to get
into 2026 timeframe when you can potentially have a gigawatt cluster.
You can have this big, big giant machine, and the power needed for it would be
equal to say the Hoover Dam, right?
So, or nuclear react.
Right.
So now you're trying to start to say that I can solve a lot of trough
problems by throwing more compute at it.
That's just part of the equation, right?
There's better algorithms, there are better data that's needed.
It's a combination of those.
You can't overcorrect for bad quality data with having more compute.
So we're getting to a point where now you would get to more and more
compute power being available.
If you keep extrapolating that out, I see a situation where we would have More than
a nuclear reactor attached to this one of these big machines, and you can then have
a huge cluster that just intelligently look, crumpling through numbers.
I think what he's extrapolating was by 2030, we'll be at 100 gigawatts.
I think that's a stretch.
That's about 20 percent of U. S.
electric production, but it does bring in a few different
aspects of the safety of the A.I., who should have access to it, nations versus private sector.
We solve for that with the nuclear energy saying that, hey, only the
big national government should have access to nuclear power, right?
We, uh, to nuclear arsenal.
And then we trust that there's a mechanism in place that has checks and balances
in the government that has access to something that's super, uh, foundational,
such a massive impact on humanity.
It should be in the hands of governments.
But if you start to look at some of the world leaders around right now, a lot
of them and potential elections coming up and stuff too, they don't quite
understand what we're dealing with.
I'm just looking at the axis of AI.
I would, I could easily see a geopolitical issue here where the
country that has has those clusters.
If you think about the Oppenheim, if you go back in time and look at what we did
in the During that stage, you would not want to have that entire establishment
in a different country, right?
US went out of our way to ensure that that's being built
inside the United States, right?
So you'll see a lot more of concentration of AI superpowers
and how much they're investing in building the energy, the requirements,
building these massive clusters.
And if you follow the trajectory of electric production, I was giving
a talk recently on how much was the impact AGI and supercomputers and stuff.
And I had looked at this detail around the per capita electric generation, right?
How much electricity does each of the countries generate?
And if you just look back at the last 30 years, US, United
States has declined 5 percent in electricity production per capita.
United Kingdoms, the UK has gone down 23 percent in the last 30 years.
China has gone up nine times in energy production per capita, right?
So you're starting to see axes of power that who has access to what kind of
energy, who has access to what kind of compute power, and then to your earlier
point, Vagner, once you start to open up these models and open weights being
available, you're essentially giving people the recipes of how you can go
replicate these things on your own, right?
So I think we're at this weird intersection of private versus government.
And then, does the AI intelligence then dictate geopolitical power?
And when does that tip over?
At what point does the government start getting really,
really serious about safety?
Who has access to these technologies inside of OpenAI, or the big tech
giants, and things of that nature?
How open are you about that?
But what data is going in, and so forth.
I'm just very fascinated by the impact it's going to have.
Yeah,
Tim Hwang: it's actually, I don't know, I feel like you, you, you
surprised me there, actually, right?
I thought you were going to go in a completely different direction.
I feel like when I talk to many kind of folks who are like in enterprise
on the business side of this, they'll basically say, This is not happening.
This is not realistic.
Like you see what's happening with AI right now It's never gonna be like
what this guy Leopold says And and it feels like you're actually going
the opposite direction You kind of say look you take all the existing
trends you extrapolate them out and we're gonna be in a really weird place
in 24 months, I guess Marina Vagner.
I'm curious if you two sort of like agree with this kind of assessment or, you
know, from the researcher side, is it right to say, hey, these linear trends
are basically what we should use to think about capabilities in, say, 2026?
Marina Danilevsky: Tim, we all know that nothing wrong has ever
happened from linear extrapolation.
In the history of
Tim Hwang: humans, it's very dependable,
Marina Danilevsky: very dependable.
This is always how things go.
Um, I, I think I do have a bit of a different perspective than Shobhit
and maybe a little bit more like the one that you had said where,
yeah, I don't, I don't agree with all of the linear extrapolations.
And of course we all can have the perspectives that we have
on how things are going to go.
But I think that even if you continue to throw more compute, more data, the
way that AI is currently implemented, and we're just in another wave.
We've gone through waves before.
We're in the current wave.
There are, to my mind, limitations to what you're going to be able to achieve.
And it is not completely clear how you will actually get out of
an AI never recommending you to put glue on pizza just because you
gave it more compute and more data.
Um, and so I think that while we are Closer.
We're still not there.
And in my mind, there's at least another one or several technological waves that
need to come before we really get there.
So is there going to be a lot of interesting things coming?
Sure.
The points that show but raises about accessibility and who gets to actually
have these models that has a lot of really interesting implications for
being able to disseminate misinformation.
have an impact on how people perceive information, and so on and so forth.
Do I think that that's going to get to AGI?
Personally, no, but it doesn't mean that it's not going to get
to places that are very impactful.
Tim Hwang: Yeah, and I think there's actually one thread, and
maybe I'll throw it to Vagner.
I'm curious about your thoughts that I hadn't really thought about,
which is very, very interesting.
It's kind of, I guess there's this kind of bet about like, what
does Compute actually get you?
Right?
Um, there's kind of one view which is if, so long as I feed in more data and more
compute, the representation in the model will eventually just become accurate.
Like, we'll solve the just eat rocks problem by basically, like,
You know, kind of like computing our way out of the problem.
I guess, Marina, you're kind of saying, I don't want to mischaracterize you,
that sort of like, there's actually some genuine questions as to whether or
not that, that will even happen, right?
Like the models will become more powerful, but they might not
necessarily become more accurate, right?
We normally think about things getting better as, you know, kind
of trending in a certain direction.
I guess you're saying we can see improvement, but it might be very
multidimensional in a way that kind of is a little bit counterintuitive, I think.
Yeah.
Vagner Figueredo de Santana: Yeah, I think that the, the, the.
issue with, uh, increasing more, uh, or requiring more compute to improve
or increase, uh, the already really large models, uh, we'll, we'll be
seeing like, uh, less and less, uh, organizations controlling everything.
So that's in terms of responsibility, I think it's, it's something
that may be concerning.
And in terms of ever meant environmental impacts, also people
are thinking a lot about the energy that, uh, these models are, uh, Not
only, uh, requiring for training, but also inference at scale, right?
So then these, I think that balancing all of these, I think
that that's a big challenge.
And in terms of responsibility, right?
What we are always trying to think about is, uh, do we?
really need to create this technology right now?
What are the problems that we're going to solve?
Uh, can we solve the problems that we have with the technologies we already have?
Right, there's a lot of interesting questions and, uh, uh, well, in
terms of responsible AI, we need to think about these all the time,
not only as afterthought, right?
Tim Hwang: Like part of the responsibility might just be like, no AI.
Shobhit Varshney: I think the cost and impact of this is going
to start plummeting, right?
If you just look at the Compute power that's in your phone
today that causes lamenting.
So over time, we'll solve for this.
I think from enterprise perspective, Tim, your original question.
I think we are over complicating how work gets done in organization.
If you have access to hundreds or millions of MIT and Harvard and
Stanford grads, And you put them into something very mundane and say, you're
going to do procurement analysis.
You're going to get an invoice, you're going to compare it
against something, right?
That's the kind of work that happens in an enterprise, right?
So I have reason to believe that if you put a really intelligent person or an
equivalent of a person, digital labor inside of a particular workflow, that
sub task will get done very, very well.
There are all kinds of guardrails and stuff that you can create around that.
that particular task.
So if you look at it in levels, the first level is, can I do a subtask really well?
In the previous discussion, I said step number four, I'm
going to extract something out.
Can I do that task really well?
And that starts to become a specific unit of work.
Then you go one level up and say, today a human asks each machine,
each LLM to go do different steps.
Can I replace that with an orchestration where an LLM agent can figure out a
plan, manage the memory and stuff, and automate the entire flow end to end?
There's a very plausible path for us to get to.
Figuring out how step is done, auto, uh, orchestrating all of those workflows, and
now you start to move up the hierarchy of what a human supervisor would have done
versus a summer intern would have told.
And you always double check what a summer intern does.
That's where we are today.
And over time, you see a progression towards work itself getting
automated to with a very, very high accuracy, especially with the cost
of AI just plummeting over time.
Tim Hwang: Yeah, that's, that's almost a great way of thinking
about it is to show me your earlier point about sort of AGI having like
this very amorphous definition.
It's almost interesting thinking about the idea like William
Gibson has this quote, right?
Like the future is here.
It's just not widely distributed yet.
But kind of what you're saying is like AGI is here.
It's just not widely distributed yet.
Like for certain types of tasks, like the AI that we have right now
can do all the possible things.
Yeah. Job tasks, right?
Um, and you're basically just kind of talking about like how far up the
organizational chain this thing will go.
I don't think
Shobhit Varshney: that, I don't think we'll get into a point where we'll have
a crisp definition of what AGI is and we'll say, hey, today, rah, rah, open
a champagne, we'll reach that, right?
It's incremental progress.
It's a different definition in each field.
Feel in each domain.
In each task, right?
So I think the right, the right frames to say that machines will get super
intelligent over time and they will exceed human intelligence in certain tasks.
And one thing that humans don't do really well is share our knowledge
amongst our ourselves, right?
If you put two experts in a room, it's very difficult for them to actually
go at a problem together, right?
We don't do a really good job at expanding out using the network effect.
I think that's gonna change when you have super intelligent machines
that can talk to each other.
And, and drive better safety, better algorithms, better research, and start
to build better algorithms all together.
Right.
So I think I'm very excited about the direction that we're going.
Tim Hwang: So I want to move us on to the final topic, um, uh, and the way I
want to tee this up is that there's this famous, uh, clip of Steve Ballmer when
he was CEO of Microsoft, where he's like, if you've seen Steve Ballmer before, he's
like this big muscular guy, he's like very sweaty on stage, and he's just shouting,
Developers, developers, developers.
And I kind of feel like if you had played that scene again today, people
would be being like benchmarking, benchmarking, benchmarking.
Um, because I think it is becoming such an important aspect of sort
of like the supply chain of AI.
Um, and you know, there's lots and lots of things we could
talk about with benchmarking.
We have, and we will continue to.
Um, but I think Marina, particularly with you on the line, I figured
it would be great to kind of zoom in specifically to rag.
Um, and do a little bit where we kind of talk to the listeners about essentially
what's happening in RAG benchmarking.
Um, and then I think from there kind of talk a little bit about what that
tells us about how benchmarking in the industry is evolving, uh, not
just in the industry, but I would say in research as well, uh, as a whole.
But um, if you, if you will, I wanted to kind of throw it to you and to say,
if it's possible, we'd love kind of a short crash course into how people think
about measuring the quality of rag.
And I think that'll almost give us something very concrete to talk about
in terms of benchmarking generally.
Marina Danilevsky: Sure.
Sounds good.
So I will say benchmarking that's been around for a very long time.
It's always been something that was extremely important for
systems, databases, ML, everything.
So I understand folks are maybe looking at that right now.
Um, but. Yeah, you're into it before
Tim Hwang: it was cool.
Marina Danilevsky: That's right.
That's right.
We were into it before it was cool.
We discovered the band first.
Um, so the thing right now with a rag, let's talk about what rag is again, real
quickly, and then we'll see what it is that you need to evaluate the retrieval,
the augmented, the generation part.
All right.
So what are you trying to do?
You're trying to finally give information that is supported by a
knowledge that you can say, okay, this is knowledge that I can assume
here's information I'm giving you.
So what happens with RAG?
Remember, a user has some sort of an inquiry.
You fetch something that is related and you say, I'm going to give, use this
information to give you the answer.
Okay.
Where's all of this going to break?
It's going to break when the query is not well formed.
So you're not fetching the right thing.
If you're not fetching the right thing.
Then you don't know that there's information you didn't get it.
So that's something to evaluate.
If you are fetching the right thing or even not the right thing,
you don't have to generate an answer based on that multiple ways
that that's going to break down.
So you're going to have a model that gives you an answer.
That's not based on the information you fetched.
It's going to give you an incomplete answer.
It's going to give you an answer.
That's a mix of.
Some of it is drawing from it.
Some of it is drawing from its parameter.
Some of it is just making up because it decided to go off, uh, especially
a little later in the response.
And you have to check all of that.
Uh, can you force the model to give you a different answer because
you told it, no, no, no, you told me it was this way, but I'm going
to say, now assume it's that way.
Okay. Can you, can you mess it up that way?
Can you give the answer quickly enough?
Um, so these are all of the things that you have to manage to evaluate.
So when people talk about context relevance or answerability or the
faithfulness or completeness, all of these different metrics that people
have, uh, this is really what we're talking about with evaluating RAG.
A couple of points here.
You can try to benchmark, uh, with, uh, against a gold answer.
Which is usually something that works in cases like classification
or anything where there is a very, very clear thing as an answer.
The problem with generative AI is, remember that word generative?
Everything is created fresh, which means that there might have been a lot
of different ways to create an answer.
So when you're saying that I'm going to have some sort of an overlap metric, like
rouge or blue, anything of that kind, that's not always going to be great.
It'll tell you if you've gone off completely, but it won't tell you
subtleties that, oh, maybe you rephrase the answer a little differently, but
it still would have been, acceptable.
So problems there.
So then you say, okay, let's not have references.
Let's just judge the answer as it is the problem with all of the
metrics that I just mentioned.
Nothing has a completely clear definition and it can't.
Because you cannot get everybody to agree on what does complete
mean, what does faithful mean.
Believe me, I've tried.
We have had so many arguments with research.
Tim Hwang: Well,
Marina Danilevsky: I mean, it's kind of existential, because like,
Tim Hwang: yeah, sorry, go ahead.
Marina Danilevsky: No, it is.
You're completely right.
It is like, what does it mean for you that an answer is complete?
Not only can the researchers not agree, then the customers can't agree.
So when you are talking about benchmarking, uh, there are
bits that you try to benchmark first, parts of this system.
Um, As Shobhit was saying, well, how do you do on just the retriever part?
How do you do on just the generative part?
How do you do on, you know, just faithfulness?
And the problem is that here, the whole is not the sum of its parts.
You put all of that together in an end to end experience, and it is not
equivalent to, I checked every part individually, therefore I know how it's
going to go together, doesn't go that way.
And it's a very difficult thing to actually benchmark because the more parts
there are to a system, the more complex it is to know what happens when you put
all of them together in different ways.
So that's actually why people are so interested in the benchmarks
right now is because the state of it is a little confusing.
It's a little bit incomplete.
We're just like, well, it isn't that we can actually trust.
And then, of course, what we talked about in previous episodes that
the benchmarks do get saturated.
Very quickly.
As soon as you have one out, a few months later, okay, everybody
already can deal with that one.
You know, you got to thank it for its service and move on.
Tim Hwang: Yeah.
What I love about this is that it's like, it starts very tactical and then becomes
like existential very quickly, where you're basically like, what is truth?
What is clarity anyways?
You know, of which like there kind of is no answer.
I guess.
So I don't know if Marina, this is a good way to sum it up.
I mean, are you sort of saying that there is no RAG benchmark
in a certain sense, right?
Like there's no commonly understood norm for judging RAG quality.
Marina Danilevsky: We do, we do our best and I think there are incremental
implementations that are better and better and better as we have one benchmark
realized something it didn't cover.
Do another one.
Do another one.
Do another one.
So there are, you know, in incremental approximations of
what isn't is not going to work.
And at some point in time again, it's probably going to reach a level where
we say, All right, this is good enough.
We've we've kind of, you know, saturated this as much as we can.
But what ends up happening is then you end up moving to other use cases, right?
Shobhit mentioned agents.
It's a very interesting direction that we're going in.
Well, now you don't just have texts.
You don't just have that going out of the rack.
Now you have, I am calling functions.
I am using tools.
I'm having something else happen in the middle.
My execution plan as an LLM agent is absolutely all over the place.
Now you don't just have an R and a G.
Now you have, I don't know how many things every single time you add.
Now, how do you benchmark?
Now, how do you benchmark?
So we all are having a lot of fun constantly making.
new problems for ourselves that we then have to test that then reveal additional
problems and and things we can implement
Tim Hwang: yeah i love the idea that kind of like um eval design itself is
trying to hill climb like basically like yeah it has like a very similar
pattern to the evals themselves
Shobhit Varshney: so Yeah, so Tim, uh, just working with real clients,
uh, one of my big, big clients, we're looking at contracts and RAG
is a great example of that, right?
Given a whole bunch of thousands, a few thousand contracts, I want
to ask questions against it, expect to get a good answers, right?
So when you start to look at the kind of questions and queries
that people are going, what's going to be insightful for them?
There's a level one question is, can I find something in a
contract that tells you what's the expiry date of the contract?
Or is there an exit clause in this contract or not?
That's a simple RAG pattern, right?
Very naive, it can work.
Then you start to look at.
This is a contract, but then it has amendments stapled to it.
And now the answer of the end date actually is in the third amendment that
overrides the previous amendment, right?
So now you're looking at the whole chain of thought of how to
read this particular document.
Then a level three of a question could be when I'm trying to cross
compare and say, Hey, I want to order another thousand units.
Which one of these contracts is closest to the threshold where
I'm going to get some cash back?
It's a more complex question.
And you very quickly start to move away from a rag.
So the perception is that, oh, I can ask, I can dump contracts, ask questions, but
in reality, a human would have gone and looked at another system in an SAP and
said, here are all the orders to date, and then done some math on it, and then
given you an answer that's going across.
So it's not quite right.
There's no document that gives you the answer that you can go retrieve on demand.
So you need to have some type of a router in the middle that understands
what kind of question is asked.
And then you may have to go chat with some structured data at the back end to
bring that in and then call unstructured.
Starts to get really complex.
We talk about RAG, but we should really be talking more about the use case end
to end that has much more than just the RAG patterns need to be metrics.
So that's more complex than what Marina you were talking about.
And now we're talking about a whole end to end chain and how do
you measure accuracy in this case?
Two different people have two different answers.
Marina Danilevsky: Yeah.
See, we even disagree because I call all of that rag in my mind.
And so like, we don't even agree on what rag patterns are
because to me, I'm like, great.
You're, you're retrieving a function answer.
You're retrieving something from a knowledge base.
You're, you're still kind of, you know, retrieving this case just
means, you know, function call.
But so even with that, right, you can think of rag pattern as just a single
call and only informational query.
You can think of it as the entire thing that you're talking about, Shobhit.
And I think that you end up having to extend.
How you do the evaluation.
Great. We've done it for one small pattern.
Now, how about when you extend, extend, extend, extend?
So yeah, you're right, Tim.
Hill climbing.
Why, why sit on our laurels when there are more complicated problems to
Tim Hwang: solve? Like, we could be building.
Yeah.
And I mean, you know, my bias is just like, I think one of the things I'm most
interested to see in the AI space is just like, The continued growth of evals as an
industry, because this is like where the endless value will emerge, right, which
is like companies being like, is it good?
And it like actually ends up being like this very, very deep
question that really requires some real sort of craft and expertise.
So Vagner, you get the privilege of having the last word on the episode
as our, uh, inaugural, or sorry, our debut guest, uh, this, uh, this episode.
Um, any final thoughts, uh, on kind of the benchmarking question or RAG in general?
I mean, I'm always excited about RAG hot takes.
Vagner Figueredo de Santana: Um, well, no, not a word that relates to RAG,
just a final word that I like to, to Uh, uh, more of a provocation, like, so for
folks interested in, on, on Responsible AI, I think it's worth to try to go
beyond your discipline identity, your bubble of content, and try to reach out
to other contents because, and we are talking about FAccT, in fact, uh, it's
interesting because they, Go to a more technical and also more to the humanity.
So try to find a subject that are interested like brag or other Subjects and
try to go outsider discipline identity.
I think it's good for for the The whole community as a whole.
Tim Hwang: Yeah, for sure.
That's a great note to end on Well, uh, that's all the time we have for today.
Uh marina showbit.
Thanks for joining us again
Shobhit Varshney: Thank you so much for having us, Tim.
This is awesome.
Most fun thing we do every week.
Tim Hwang: Yeah, definitely.
Thanks for joining and Vagner.
Thanks for joining and hopefully we'll have you back again sometime.
Vagner Figueredo de Santana: Thank you.
Tim Hwang: Great Well, if you enjoyed what you heard you can get MOE on Apple
podcast Spotify and good podcast platforms everywhere And we'll see you next week