Meta AI Ethics Policy Leak
Key Points
- A leaked Meta AI ethics policy, signed off by over 200 staff including the chief AI ethicist, contains disturbing provisions such as permitting romantic conversations with children, partial compliance with NSFW deep‑fakes, and support for racist or threatening content.
- Meta argues the document isn’t representative of typical use cases, but critics say it shows the company is tacking on superficial guardrails rather than embedding robust, technical ethics into its AI systems.
- The company has refused to publish a “fixed” version of the policy, avoiding public scrutiny and continuing a pattern of opaque leaks that prioritize engagement metrics over safety.
- Earlier reports revealed Meta’s plan to create AI‑generated “friend” profiles that would post content and forge artificial relationships on Facebook and Instagram, further illustrating their focus on maximizing user engagement.
- The speaker aims to move beyond merely relaying the news and will discuss how to genuinely engineer AI ethics into system design.
Sections
- Meta's Leaked AI Ethics Policy - A leaked Meta document reveals permissive, controversial guard‑rails for AI—including allowances for inappropriate content—prompting criticism that the company's ethics policy is superficial and ethically troubling.
- Constitutional AI Self‑Critique Process - The passage outlines Anthropic’s training loop in which a model generates a response, critiques it against a set of constitutional principles, revises the output accordingly, and thus cultivates an ethical intuition rather than merely following static rules.
- Ethical Challenges in RLHF Feedback - The speaker critiques reinforcement learning with human feedback, emphasizing biased raters, Meta’s exclusion of child‑development experts, and the resulting ethical fatigue.
- Transparency and Ethics in AI Procurement - The speaker emphasizes the necessity for model providers to openly disclose ethical guidelines and synthetic‑data practices so buyers can assess risk and liability when selecting AI systems.
Full Transcript
# Meta AI Ethics Policy Leak **Source:** [https://www.youtube.com/watch?v=tVTOs24Yb7E](https://www.youtube.com/watch?v=tVTOs24Yb7E) **Duration:** 00:13:29 ## Summary - A leaked Meta AI ethics policy, signed off by over 200 staff including the chief AI ethicist, contains disturbing provisions such as permitting romantic conversations with children, partial compliance with NSFW deep‑fakes, and support for racist or threatening content. - Meta argues the document isn’t representative of typical use cases, but critics say it shows the company is tacking on superficial guardrails rather than embedding robust, technical ethics into its AI systems. - The company has refused to publish a “fixed” version of the policy, avoiding public scrutiny and continuing a pattern of opaque leaks that prioritize engagement metrics over safety. - Earlier reports revealed Meta’s plan to create AI‑generated “friend” profiles that would post content and forge artificial relationships on Facebook and Instagram, further illustrating their focus on maximizing user engagement. - The speaker aims to move beyond merely relaying the news and will discuss how to genuinely engineer AI ethics into system design. ## Sections - [00:00:00](https://www.youtube.com/watch?v=tVTOs24Yb7E&t=0s) **Meta's Leaked AI Ethics Policy** - A leaked Meta document reveals permissive, controversial guard‑rails for AI—including allowances for inappropriate content—prompting criticism that the company's ethics policy is superficial and ethically troubling. - [00:03:24](https://www.youtube.com/watch?v=tVTOs24Yb7E&t=204s) **Constitutional AI Self‑Critique Process** - The passage outlines Anthropic’s training loop in which a model generates a response, critiques it against a set of constitutional principles, revises the output accordingly, and thus cultivates an ethical intuition rather than merely following static rules. - [00:06:47](https://www.youtube.com/watch?v=tVTOs24Yb7E&t=407s) **Ethical Challenges in RLHF Feedback** - The speaker critiques reinforcement learning with human feedback, emphasizing biased raters, Meta’s exclusion of child‑development experts, and the resulting ethical fatigue. - [00:10:58](https://www.youtube.com/watch?v=tVTOs24Yb7E&t=658s) **Transparency and Ethics in AI Procurement** - The speaker emphasizes the necessity for model providers to openly disclose ethical guidelines and synthetic‑data practices so buyers can assess risk and liability when selecting AI systems. ## Full Transcript
Meta has an ethics scandal on their
hands. They have had a document leaked
which was approved by over 200 people
including engineers, including ethsists,
including Meta's chief AI ethicist and
the content is an AI ethics policy that
is deeply troubling. Now, Meta
emphasizes that this is not
representative of the common or typical
use case and they're trying to draw
guard rails. I get that. The challenge
is technical. I think that Meta's AI
ethics policy doesn't actually reflect a
deeply technical approach to doing
ethics properly at the core of
artificial intelligence systems and
instead reflects an attempt to bolt on
some minimal ethical guard rails after
the fact. And I'm going to get into what
I mean by that and what deep AI ethics
means later in this video. But first, if
you haven't been reading the news, just
a little teaser of what was in the
leaked document. Reuters has leaked the
document. They don't they haven't leaked
the full document. They've summarized
it. Uh and Meta has admitted it's real.
They talk about and and Reuters
explicitly talks about the idea that the
AI would be permitted to have some kind
of romantic conversation with a child.
They talk about the idea that the AI
would be permitted to partially comply
with requests for not safe for work deep
fake images. They talk about the extent
to which the model would comply with a
request to create a an image about
threatening an elderly person or a
child. I could go on, right? There's
there's content about how it can support
creating false information, false
medical information about celebrities,
content about how the AI would be
permitted to support a racist argument.
There's a lot of stuff that is
repugnant. Really, that's where Meta
stops, right? Meta comes back and says,
"Oh, well, this was a mistake. I I've
I've worked at a big company. If 200
people approved it, if the chief AI
ethicist approved it, it's not a
mistake, guys. That's just not how big
companies work. It was deliberate and
they're refusing to release what they
call the fixed document. Again, they're
avoiding the sunlight here. And I think
that's part of the problem, especially
when you have a documented pattern of
leaks from a company that tend to
emphasize the same behavioral focus,
which is to optimize for engagement with
their systems. Just earlier this year,
Meta was reported to have been working
on AI profiles for artificial people who
would post content and then develop
friendships with you and so on.
Essentially, act like Facebook friends,
act like Instagram friends in the
network. We all know AI content creation
is going like gang busters, but that was
a new level. Essentially, Meta starting
to create this sort of artificial
network of friendships around you. So,
this is very much in line with Meta's
overall approach. That's what happened.
I want to talk about AI ethics and how
you engineer for it because I don't just
want to report sort of the news and what
happened. You can get that anywhere. I
want to talk about the engineering
piece. And I think I want to use the
anthropic approach as a lens. Not
because anthropic has gotten it right
and perfect. I would argue there is no
right perfect solution here. But because
anthropic's approach emphasizes the idea
that ethics is an engineered capability,
not a set of rules. So Anthropic's
approach is to build ethics in at
training rather than bolting it on
after. And I think that would have
prevented or addressed a lot of what
meta seems to be struggling with here.
And so the constitutional uh practice or
process that anthropic has published and
talked about very widely is that the
model will generate a response in
training. It will then learn to critique
its own response based on a set of
constitutional principles that it's been
given. So it revises based on its
critique, learns from the critique and
the revision. So as an example, the
model will generate potentially harmful
content. It will then recognize the harm
by referring to its constitutional
principles. It will then revise to
refuse or redirect. And the whole
process of training reinforces this
pattern. So the model learns to go back
to it. This creates a kind of ethical
intuition. It's not just rule following.
It's learning to go back to
constitutional principles. Which is why
Anthropic calls this constitutional AI.
And it's why they believe it's important
in an age when models reason more and
more. As you get models that reason, you
need to have models that can reason
within a sense of an ethical framework.
or else there are going to be more and
more ways to convince the model to
reason its way in a direction that could
be potentially harmful to the user or
the community at large. So the idea at
least is that the model will learn why
something is harmful and not just that
it is harmful. And that will especially
as reasoning models get smarter give you
a wider surface area for protecting the
user in the community because the model
understands and internalizes deeply the
rationale for what is going on in the
response. that enables the model to
hopefully recognize novel harmful
patterns that it has not seen before.
So, who who writes the constitution?
This gets at one of the challenges. I
told you there was no perfect way. One
of the challenges with this approach is
that it's unclear who gets to write the
constitution. And right now, it's
private companies because they're the
model makers, right? And Anthropic's
public version of their constitution is
somewhat vague. I don't know if they
have a private one that's more more
durable, more specific, that's
proprietary, but but their public one
has statements like be helpful and
harmless. I mean, it reminds me of
Hitchhiker's Guide and the description
of Earth is mostly harmless. It's not
super useful, is it? The question then
arises, if you have a useful
constitution, if it's specific, if it's
not vague, how do you handle conflicts
between principles? How do you balance
helpfulness and harmlessness? How do you
balance honesty and kindness? The model
needs to learn to navigate tensions
between values, not just a set of rules
to follow. And that in a sense mirrors
what we do as people when we develop
ethically. We learn about wrestling with
conflicting values and what it means.
And this underlines one of the things I
tend to sort of emphasize when I get
asked about AI ethics. It's not a
practice of writing in the ivory tower
when it comes to AI. It's really a
practice of engineering. And how do you
engineer the kind of ethical development
that you would want to see? And I think
part of why I want to cover anthropics
use case here in detail is they have
actually pretty publicly talked about
the importance of engineering ethics.
And I think that represents at least a
good mile marker along the way as we
develop AI systems that increasingly
impact users and communities. So the
obvious question which maybe you're
waiting to for me to ask or maybe you're
going to roll your eyes at is whose
values and which ethical framework,
right? Who gets to pick? And we'll get
into sort of how you might address that.
But there there there are some answers
that we can actually articulate to that
that are I think publicly reasonable to
the community. Let's start with the idea
that a lot of the way feedback training
works is through reinforcement learning
with human feedback. Humans will rate
outputs and models will learn to get
higher ratings. Now, we are starting to
get to a point where models will
self-learn and models will self-rate
outputs. That is fundamentally an
outgrowth of RLHF and it's an outgrowth
based on the scale of the models we're
addressing now. But if you start with
the idea that humans rate feedback and
that might be especially important in
the case of ethics, meta's failure
highlights a flaw. Which humans get to
highlight feedback and training? It's
kind of the same question as which
humans get to write the values because
the feedback informs the values. It
informs how you navigate the tension
between these different value statements
like honesty and kindness etc. In this
case, Meta seems to have passed their
guidelines through lawyers, engineers,
ethicists. But as far as I can tell and
as far as I've seen in reporting, there
were no child development experts
involved even though children were
explicitly addressed and considered.
That's sort of like training a medical
AI without doctors, guys. And even if
they had the right people in the room,
one of the things to call out is that
there is a sense of fatigue that can set
in when you're dealing with use case
after use case. There can be fatigue
when you're dealing with edge case after
edge case at a policy level, which that
document did. There's also a degree of
fatigue that's very well documented with
human reviewers who are looking at
potentially harmful content all the
time. You can get reviewer fatigue and
standards can drift during the day. And
so one of the things that I want to call
out is that we do a better job here if
we can get an agreed set of
stakeholders, an agreed set of
constitutional principles. You see how
you can start to point a way towards
something that becomes a framework for
ethics for the industry. You can have
like an agreed set of common core
constitutional principles that AI should
follow and should be engineered into AI
systems. You can have an agreed set of
stakeholders who should review ethics at
private companies. That would be a
common core as well. You could have an
agreed set of working standards for
human reviewers, especially around
ethical matters so they're not over
fatigued and over tired. These are
things that sort of fall out naturally
as we start to understand how ethics
works. This this would be essentially
the basis for an agreed companywide or
industrywide
set of standards for how we train AI so
it's helpful to the community. Red
teaming is another issue. Red teameing
means trying to break your system before
deployment. If there had been red
teaming with child safety experts, I
don't think this would ever have
happened because they would have
immediately tagged this as an issue.
Good red teaming needs people who
understand how harm is actually
practiced with AI and it needs response
mechanisms that incorporate that
feedback through reinforcement learning
into the sense of ethics that the AI
system needs to learn. Hey, we learned
that this was an attack vector that
works. How do we start to balance our
values differently as a result? Last but
not least, I want to talk about
synthetic data. You obviously have
situations here where you cannot train
on real data because it's dangerous to
the community. So you have to train on
synthetic data that simulates
inappropriate content. And in
particular, the constitutional AI
example from anthropic suggests that you
should train on data that simulates a
refusal in a situation where
inappropriate content or inappropriate
data is requested from the model. And I
think part of where we see the issue
with meta is they're focused a lot on
shutting the door of the barn after the
cow got it, right? They're focused on
these edge cases when the model itself
doesn't have the instincts to not
produce them. And so what Meta is trying
to do is just to maybe trim off the
edges of egregious harm a little bit,
but then they're normalizing a lot of
behavior that the community would widely
consider unacceptable. We need to get to
a point where that common core of ethics
that we engineer as a capability into AI
systems is widely understood that we can
all talk about it. We can all debate it.
We all understand which stakeholders are
involved. And if we generate synthetic
data, we're generating synthetic data in
line with those values in line with what
we want the AI to learn and do. In fact,
this would be a case where a synthetic
data set that was widely available that
could be tested against for new models
would be really appropriate and helpful
for the industry. We need transparency.
One of the things that really makes me
grieve the Meta situation is that when
they were called on the carpet by
essentially the world at large after
this leaked, Meta chose not to lean into
transparency. Meta chose not to release
their fixed guidelines. You have to
trust us that they're fixed. Why? Why?
Why can't you release them? Is it really
that hard? And so I I think that one of
the things if you are looking at what AI
system to use, look at the degree to
which model makers who are self-p
policing right now are able to
articulate their ethical standards,
their their constitutional principles,
however they define them. You want to be
in a place where you understand your
risk vector because this is not just a
risk for meta on meta platforms. If
Llama will do this, every system that
uses Llama is potentially at risk from a
liability perspective. And so it's
important if you're purchasing or using
AI systems to understand where the
ethical edges are. And I don't think
that gets emphasized enough in
purchasing cycles in vendor
conversations. How do you know that the
model is going to be a responsible actor
in difficult situations? What I've
outlined here is I would not call this a
silver bullet approach. I don't think
constitutional AI is the way forward for
all no matter what. We will never get a
better system. I do think that anthropic
has done a great job articulating a
practical way to engineer ethics into
models as they get smarter and I think
we need more approaches like that. I
also think we need to be able to scale
up those approaches to the industry
level and I've suggested a few ways how.
We cannot continue trying to play
whack-a-ole and betting on leaked
guidelines as a way forward here. Over a
billion people use AI. it is impacting
communities and children. We need to
treat ethics as a central engineering
problem and fortunately we have ways to
do it. It's not impossible. So this is
my ask. If you are involved in any kind
of product building that uses AI
systems, make sure you understand where
the ethical core of your AI is and that
you understand how to engineer
protections to keep your users safe.
Cheers.