> Could Google, or any other company out there, build a digital copy of you that answers questions exactly the way you would? "Hey, we're going to cancel the interview- we found that you aren't a good culture fit here in 72% of our simulations and we don't think that's an acceptable risk."
If a company is going to snoop in your personal data to get insights about you, they'd just do it directly. Hiring managers would scroll through your e-mails and make judgment calls based on their content.
Training an LLM on your e-mails and then feeding it questions is just a lower accuracy, more abstracted version of the above, but it's the same concept.
So the answer is: In theory, any company could do the above if they wanted to flout all laws and ignore the consequences of having these practices leak (which they inevitably would). LLMs don't change that. They could have done it all along. However, legally companies like Google cannot, and will not, pry into your private data without your consent to make hiring decisions.
Adding an LLM abstraction layer doesn't make the existing laws (or social/moral pressure) go away.
> Adding an LLM abstraction layer doesn't make the existing laws (or social/moral pressure) go away.
Isn't the "abstraction" of "the model" exactly the reason we have open court filings against stable diffusion and other models for possibly stealing artist's work in the open source domain and claiming it's legal while also being financially backed by major corporations who are then using said models for profit?
Whose to say that "training a model on your data isn't actually stealing your data" it's just "training a model" as long as you delete the original data after you finish training?
What if instead of Google snooping, they hire a 3rd party to snoop it, then another 3rd party to transfer it, then another 3rd party to build the model, then another 3rd party to re-sell the model. Then create legal loopholes around which ones are doing it for "research" and which ones are doing it for profit/hiring. All of the sudden, it gets really murky who is and isn't allowed to have a model of you.
I feel one could argue that the abstraction is exactly the kind of smoke screen that many will use to avoid the social/moral pressures legally, allowing them to do bad things but get away with it.
> for possibly stealing artist's work in the open source domain
The provenance of the training set is key. Every LLM company so far has been extremely careful to avoid using people's private data for LLM training, and for good reason.
If a company were to train an LLM exclusively on a single person's private data and then use that LLM to make decisions about that person, the intention is very clearly to access that person's private data. There is no way they could argue otherwise.
I've spoken with a lawyer about data collection in the past and I think there might be a case if you were to:
- collect thousands of people's data
- anonymize it
- then shadow correlate the data in a web
- then trace a trail through said web for each "individual"
- then train several individuals as models
- then abstract that with a model on top of those models
Now you have a legal case that it's merely an academic research into independent behaviors affecting a larger model. Even though you may have collected private data, the anonymization of it might fall under ethical data collection purposes (Meta uses this loophole for their shadow profiling).
Unfortunately, I don't think it is as cut and dry as you explained. As far as I know, these laws are already being side-stepped.
For the record, I don't like it. I think this is a bad thing. Unfortunately, it's still arguably "legal".
I realize that data can be de-anonymized, but if the same party anonymized and de-anonymized the data... well, IANAL, and you apparently talked to one, but that doesn't seem like something a court would like.
> Hiring managers would scroll through your e-mails and make judgment calls based on their content.
> Training an LLM on your e-mails and then feeding it questions is just a lower accuracy, more abstracted version of the above, but it's the same concept.
Its also one that once you have cheap enough computing resources scales better, because you don't need to assign literally any time from your more limited pool of human resources to it. Yes, baroque artisanal manual review of your online presence might be more “accurate” (though there's probably no applicable objective figure of merit), but megacorporate hiring filters aren't about maximizing accuracy they are about efficiently trimming the applicant pool before hiring managers have to engage with it.
And that accuracy is improving at breakneck speed. The difference between the various iterations of ChatGPT is nothing short of astounding. Their progress speed is understandable, they need to keep moving or the competition can catch up, but that doesn't necessarily mean that those improvements are out there or within reach. And yet, every time they release I can't help but being floored by the qualitative jump between the new version and the previous one.
> If a company is going to snoop in your personal data to get insights about you, they'd just do it directly. Hiring managers would scroll through your e-mails and make judgment calls based on their content.
This is like saying, "look, no one would be daft enough to draw a graph, they'd just count all the data points and make a decision."
You're missing two critical things:
(1) time/effort
(2) legal loophole.
A targeted simulation LLM (a scenario I've been independently afraid of for several weeks now) would be a brilliant tool for (say) an autocratic regime to explore the motivations and psychology of protesters; how they relate to one another; who they support; what stimuli would demotivate ('pacify') them; etc.
In fact, it's such a good opportunity it would be daft not to construct it. Much like the cartesian graph opened up the world of dataviz, simulated people will open up sociology and anthropology to casual understanding.
And, until/unless there are good laws in place, it provides a fantastic chess-knight leap over existing privacy legislation. "Oh, no we don't read your emails, no that would be a violation; we simply talk to an LLM that read your emails. Your privacy is intact! You-prime says hi!"
> This is like saying, "look, no one would be daft enough to draw a graph, they'd just count all the data points and make a decision."
Not really. Assuming your ethical compass is broken and you suspected your partner of cheating, would you rather have access to their emails or to a LLM trained on them? Also, isn't it much cheaper for Google to simply search for keywords rather than fine tuning a model for this?
At least in the EU, a system like this would be made illegal on day one. This whole doomsday scenario seems predicated on a hypothetical future where LLM's would be the least of your worries.
For my argument, I only need to point out that it was attempted, as I'm proving motivation; the effectiveness of CA methods has no bearing on the effectivenss of (say) simulated people.
Increasingly, when interacting with comments on HN and elsewhere, it feels like I'm from a parallel timeline where things happened, and mattered, and an ever-growing percentage of my interlocutors are, for lack of a better word, dissociated. Perhaps not in the clinical sense, but certainly in the following senses:
- Cause and effect are not immediately observed without careful prompting.
- Intersubjectively verifiable historical facts that happened recently are remembered hazily, and doubtfully, even by very intelligent people
- Positions are expressed that somehow disinclude unfavourable facts.
- Data, the gold standard for truth and proof, is not sought, or, if proffered, is not examined. The stances and positions held seem to have a sort of 'immunity' to evidence.
- Positions which are not popular in this specific community are downranked without engagement or argument, instead of discussed.
I do believe folks are working backward from the emotional position they want to maintain to a set of minimizing beliefs about the looming hazards of this increasingly fraught decade.
Let's call this knee-jerk position "un-alarmism", as in "that's just un-alarmism".
Those two are grest examples of companies being hit with huge fines or bans in the EU after their practices were discovered. Saying "capitalism" as if that's an argument is juvenile - by that logic we will soon be enslaved by big corporations, nothing we can do about it then.
'juvenile' is a juvenile way of describing a ~200-year-old intellectual tradition that you disagree with. Go call Piketty.
And yes, frankly, the emergence of generative AI does vastly accelerate the normal power concentration inherent in unregulated capitalist accumulation. Bad thing go fast now soon.
I've read Piketty, he calls for more regulation to address the issues associated with disparities in capital accumulation. He does not merely puts his hands in the air and predicts inescapable doom.
The irony here is that Western capitalist democracies are the only place where we can even think about getting these privacy protections.
> And, until/unless there are good laws in place, it provides a fantastic chess-knight leap over existing privacy legislation. "Oh, no we don't read your emails, no that would be a violation; we simply talk to an LLM that read your emails. Your privacy is intact! You-prime says hi!"
That seems as poor as saying, "We didn't read your emails -- we read a copy of your email after removing all vowels!"
But we live in distressed times, and the law is not as sane and sober as it once was. (Take, for example, the Tiktok congressional hearing; the wildly overbroad RESTRICT act; etc.)
If the people making and enforcing the laws are as clueless and as partisan as they by-all-accounts now are, what gives you hope that, somehow, some reasonable judge will set a reasonable precedent? What gives you hope that someone will pass a bill that has enough foresight to stave off non-obvious and emergent uses for AI?
This is not the timeline where things continue to make sense.
No -- but what it DOES do is possibly "put the idea in someone's head."
As I've always said: the thing about the big companies that suck up your data, consider any possible idea of what they could do with it. Ask, is it:
- not expressly and clearly illegal?
- at least a little bit plausibly profitable?
If the answer is yes to both, you should act as if they're going to do it. And if they openly promise not to do it, but with no legal guarantee, that means they're DEFINITELY going to eventually do it. (see e.g. what's done with your genetic data by the 23 and me's and such)
That takes way too long though. Creating/training/testing an LLM can be automated. Why do the interviews at all, why pay a hiring manager at all, when you can just do everything virtually and have an AI spit out a list of names to send offers to and how much each offer should be?
> If a company is going to snoop in your personal data to get insights about you, they'd just do it directly. Hiring managers would scroll through your e-mails and make judgment calls based on their content.
Maybe, but LLMs have incredibly intricate connections between all the different parameters in the model. For instance, perhaps someone who does mundane things X, Y, Z, also turns out to be racist. An LLM can build a connection between X, Y, Z whereas a recruiter could not. An LLM could also be used to standardize responses among candidates. E.g. a recruiter could tune an LLM on a candidate and then ask "What do you think about other races? Please pick one of the four following options: ...". A recruiter wouldn't even be necessary. This could all be part of an automated prescreening process.
I think any HR manager or legal professional that would let a company anywhere near this shouldn't be employed as such. This sounds like a defamation lawsuit waiting to happen.
> Training an LLM on your e-mails and then feeding it questions is just a lower accuracy, more abstracted version of the above, but it's the same concept.
Less accurate, more abstracted, but more automatable. This might be seen as a reasonable trade-off.
It might also be useful as a new form of proactive head-hunting: collect data on people to make models to interrogate and sell access to those models. Companies looking for a specific type of person can then use the models to screen for viable candidates that are then passed onto humans in the recruitment process. Feels creepy stalky to me, but recruiters are rarely above being creepy/stalky any more than advertisers are.
> Less accurate, more abstracted, but more automatable.
That is true. In fact most job applications are sifted through by robots looking for relevant keywords in your CV, and this would only be the next logical step.
If a company is going to snoop in your personal data to get insights about you, they'd just do it directly. Hiring managers would scroll through your e-mails and make judgment calls based on their content.
Training an LLM on your e-mails and then feeding it questions is just a lower accuracy, more abstracted version of the above, but it's the same concept.
So the answer is: In theory, any company could do the above if they wanted to flout all laws and ignore the consequences of having these practices leak (which they inevitably would). LLMs don't change that. They could have done it all along. However, legally companies like Google cannot, and will not, pry into your private data without your consent to make hiring decisions.
Adding an LLM abstraction layer doesn't make the existing laws (or social/moral pressure) go away.