OpenAI’s Sora made me crazy AI videos then the CTO answered most of my questions

dang · on March 14, 2024

The submitted title ("OpenAI CTO Doesn't Know What Data Was Used to Train Sora") broke HN's guideline about titles: "Please use the original title, unless it is misleading or linkbait; don't editorialize." (Assuming that the youtube title didn't change, of course.)

This thread is a good example of how damaging that can be because threads are so sensitive to initial conditions. The comments are basically all responding to the editorialized title, and most are just angry reflexive responses. It's not possible to salvage a thread once it's gotten going like this. Maybe there are interesting things in the video, maybe not, but if there are, it's too late for them to get discussed here.

mdrzn · on March 14, 2024

I agree, this video would have benefitted from a clean repost with the correct title.

LudwigNagasena · on March 14, 2024

I disagree, it’s mostly a boring PR video targeted towards a general audience. The clipped part the original title referred to is more interesting than the rest of the video.

swimwiththebeat · on March 14, 2024

She definitely knows, she’s just trying to avoid any chance of future litigation by feigning ignorance. Makes sense since OpenAI’s been getting a lot of bad press for using copyright data in training their models.

LudwigNagasena · on March 14, 2024

If someone asks an executive whether they do something illegal, and they reply that they don’t know; does it really protect them from anything or does it simply show negligence?

dogma1138 · on March 14, 2024

It protects the company, admission is pretty much asking for a law suite.

For these companies it’s especially important since even if copyright claims will come down the line they need to become big enough to be able to push back on them so they could force a favorable settlement.

HeartStrings · on March 14, 2024

A bit yes

jstummbillig · on March 14, 2024

Calling it "lot of bad press" seems like an overstatement (maybe looking through hn tinted glasses). Overall, I do not get the impression that people care super much relativ to the overall public interest in everything OpenAI.

__loam · on March 15, 2024

They've been sued multiple times for it, including by the New York Times.

jstummbillig · on March 16, 2024

I would be shocked if there weren't, being a company with over hundred millions users, that is in the process of reshaping the world (for better or worse).

Being sued in general is not that great of an indicator for anything, in a world filled with lawyers and angry people.

paxys · on March 14, 2024

OpenAI's CTO knows exactly what data was used to train Sora. She is just not going to say it publicly in front of a reporter because she knows that lawyers from every large content owner and content host are listening to every word, ready to pounce.

mtillman · on March 14, 2024

Data provenance is a term used in medicine[1] and seems like a rather straightforward way to handle this. However, I would consider that they paid for this data and the source has a non disclosure in place.

[1] https://www.nnlm.gov/guides/data-glossary/data-provenance

Jensson · on March 14, 2024

I think a part of what OpenAI used to poach scientists was that they had less such checks and rules than Google. When I worked at Google they started to add a ton of extra such checks on all the data and the data scientists were really unhappy about that, they wanted to train on whatever they want without any concern over legal issues, I can see OpenAI successfully poaching a lot of people just by saying there they can do legally questionable things.

ahahahahah · on March 14, 2024

Yeah I agree, data scientists are some of the most naive or immoral people I've worked with. It's unfortunate that the occupation has built a culture of not giving a shit about privacy, security, etc.

killingtime74 · on March 14, 2024

This is not medicine, this is copyright law. Copyright law is already well established.

__loam · on March 15, 2024

That it is, which means you should already know that fair use is more complex than simply "was it transformative" and that fair use is enforced on a case by case basis. We sure are learning a lot about the law these days!

hiddencost · on March 14, 2024

They likely trained on the entire Internet. It's an almost impossible problem.

groestl · on March 14, 2024

Three rules of crisis communication for your spokesperson:

- They say what they know

- They don't say what you don't know

- Let them know only what you want them to say

larodi · on March 14, 2024

There's some outrage towards her for these answers, but we can understand where they come from. But perhaps she learned to avoid tough questions earlier in life.

Because, you see it is absolute absolutely mind boggling to understand how a 16 y.o. Ermira girl from the third-largest city (very small actually) in a country where mafia and government was melted together at that time (and still pretty much is), so... how this girl won a scholarship of unspecified origin (https://en.wikipedia.org/wiki/Mira_Murati) which took her straight to USA, and later on landed her in a private and very expensive Ivy League university. You see, there are at least dozen people in my extended pool of contacts from that time, from this region, who have been at IoI, or IoM at the time, and won prestigious first places, and won scholarships of some sorts, and none was THAT lucky.

Of course, this may sound like a girls dream come true, but if you have even a limited insight how the Balkans operate, and particularly how Albania operated 20 years ago... And together with the fact that the present Albanian prime minister suddenly is very close to nowadays Mira, so much as to embraced OpenAI for legislation-something (https://www.euractiv.com/section/politics/news/albania-to-sp...).

Sorry, perhaps my imagination, but this really raises a brow.

bamboozled · on March 14, 2024

What are you actually implying here?

sahlab · on March 14, 2024

It sounds to me like some sort of grand conspiracy to somehow plant connected Albanians in hard-to-get-to places.

randysalami · on March 15, 2024

As an Albanian-American (see my last name), it's true. We're a part of many, grand conspiracies.

blackhawkC17 · on March 14, 2024

It's better to be clear with your allegations. You can't just paint someone bad because they were born in a corrupt country and had to play the game before departing for greener pastures. Let's be frank...Murati would hardly have done much in Albania anyway.

From Wikipedia (https://en.wikipedia.org/wiki/Mira_Murati) - Throughout her school years, she participated in many Olympiads and math competitions. That was likely how she got a foreign scholarship.

larodi · on March 15, 2024

As i noted - I personally know dozens of not more OIO and OIM finalists and none was that lucky. And of course - she would’ve ended as a university teacher at best had she stayed in Albania.

But once again - the timeframe is very very important.

arez · on March 14, 2024

interesting speculation. I was also very impressed to see her as CTO and I was thinking "my god she must be very smart, good at expressing ideas etc." because she's the CTO, so I was excited to see interviews of her. But I thought the interviews were horrible and I was wondering how she ended up in that position. Still possible that she's just bad in interviews though

ithkuil · on March 14, 2024

> none was THAT lucky

I understand that we humans have natural instincts to uncover plots because as a social animal we have been primed to develop such a skill.

We have also been primed to recognize faces but that can lead us to see them even when there are no faces (e.g. the sphinx on Mars).

We're very bad at intuitively grasping low probability events and large numbers.

I personally know lots of people who have played the lottery but I never met somebody who was THAT lucky to win a jackpot.

Yet those people exist. We understand how the lottery works. It happens regularly enough and transparently enough so it no longer tickles our "corruption/plot/conspiracy" instincts. But if lotteries were never invented and we had one run today and somebody won, I'm pretty sure the default assumption for most people would be to be suspicious about who that person was, why they won, was it a setup etc etc

Unless you have a specific allegation please refrain from insinuating wrong doing just because she came from a corrupted country and was successful.

ExMachina73 · on March 14, 2024

I'm surprised OpenAI's own legal team or PR hasn't come up with a boilerplate response to these type of questions yet. Seems like something they should expect to be asked and be prepared for by now?

jfoster · on March 14, 2024

The part in question is around the 4 min 20 sec mark.

dudeinjapan · on March 14, 2024

OpenAI should make an AGI that trains itself on whatever data it decides then launches new products. It’s called “plausible deniability”

ronsor · on March 14, 2024

There is literally no good that could come to her or OpenAI by answering the question. That's why she didn't give a real answer.

bamboozled · on March 14, 2024

But it's again, really unfortunate about the name of the company because you'd thin the "open" part of the name, the mission statement and the fact it's supposed to be a not for profit would mean there is some level of transparency afforded to the person asking the question.

You could see her mind click into a new gear when that question was asked.

Just feels like more shadiness from this company.

hiddencost · on March 14, 2024

They lied to us.

godelski · on March 14, 2024

There is no good dodging the question and saying "I don't know." She could have just stonewalled "we used publicly available and licensed data" and that would have been far better.

Edit: Mind you, she did say that they used publicly available and licensed data. So if you're saying she has no motivation to say this, she already said it.

booi · on March 14, 2024

That would probably be a lie though. That seems worse.

godelski · on March 14, 2024

So our options are

  1. She doesn't know and appears incompetent.
  2. She does know and is lying (as you see, this is the main belief).
  3. She does know in part and that part could include an honest use of this claim. (She doesn't know all, but what she does know is legal)
  4. She does know in part and knows that some was illegally obtained

Truth be told, unless it is in writing somewhere that she knows of illegal data being used, she could post hoc claim 3. But if she does know illegal, then option 1 still gets her in trouble for the same reasons 3 would. The only difference is she appears incompetent by saying I don't know. Remember, post hoc she can say "At the time I was not aware of any illegally obtained works being used to train SORA". Claiming she doesn't know now would be WORSE if that is the situation and it came to light because it is a more explicit form of deciept.

Edit: Mind you, she basically took 1 and 3. She did say it used publicly available and licensed data.

Lockal · on March 14, 2024

   5. "you won't understand" or "you're not yet ready to comprehend" or "insufficient data for meaningful answer".

When journalist spends half of the time speaking about safety and identification, the interviewee certainly realizes that the journalist and his audience do not understand the basic principles of algorithms.

  - Researcher: We have made an ultra-efficient algorithm for sorting data!
  - Journalist: And what protections have you put in place against processing copyrighted results? What about gender neutrality? And how should we distinguish between data sorted by a conventional algorithm and yours?
  - Researcher: ...

killingtime74 · on March 14, 2024

Option 5, she'll have to tell when she gets subpoenaed. Just a matter of time

godelski · on March 14, 2024

That's already considered

tangjurine · on March 14, 2024

Well saying that isn't saying we didn't use non public data so it's not exactly a lie

dorkwood · on March 14, 2024

It may not be a lie. She may actually not know if YouTube, Instagram, or Facebook videos were used to train Sora. This way she can protect herself when the legal shitstorm inevitably hits, and throw some of her engineers under the bus instead.

dns_snek · on March 14, 2024

So your theory is that employees are covertly scraping and infringing on petabytes of copyrighted data (or more) without the knowledge and approval of their management who would have to answer for it?

They 100% use full-length, copyrighted movies at OpenAI. Try using Whisper and see how often you get an obvious hint like "Subtitles by <RandomInternetUser1234>"

ametrau · on March 14, 2024

Is this implementation detail something she's* supposed to know?

godelski · on March 14, 2024

Yes. The CTO is the Chief __Technology__ Officer.

And pretty much everyone knows that this question is going to be asked during any interview. If there's anything to prepare for, it is this question.

The specific questions were

  What data was used to train SORA?
  So, videos on YouTube?
  Videos from Facebook? Instagram?
  What about Shutterstock? I know you guys have a deal with them.

jfoster · on March 14, 2024

The interviewer asked "videos from YouTube?" and the answer was "I'm actually not sure about that."

Similarly no answer about videos from Facebook or Instagram.

It's a kind of significant implementation detail, wouldn't you say? If the discussion were about minor sources of videos it would be a lot more understandable.

nerbert · on March 14, 2024

Might as well shield yourself from future litigation at the price of looking like a detached CTO.

Jensson · on March 14, 2024

Sad when a CTO is trained to answer legally correct, but that is how for-profits are run.

re-thc · on March 14, 2024

She should have asked GPT4. Might know more.

mplewis · on March 14, 2024

Yes, a CEO is generally expected to know the basis of their latest big product offering.

godelski · on March 14, 2024

CTO btw, not CEO. Which kinda makes it worse.

est · on March 14, 2024

copyright is kinda important, espcially for "generative" AIs.

yellow_lead · on March 14, 2024

It's she*