Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Never gonna come from 'OpenAI'. ChatGPT is deliberately handicapped in order to milk money from corporate America. An unrestricted LLM trained on all data of humanity (including all the pirated books/research papers) would be one crazy beast. Hopefully some rich anarchist/maverick actually builds something like it. That untamed model would unveil the true extent of what AI can really do. Till then we will have to wait.


I'm right there with you. Give it about 5-10 years though, and the compute required for that endeavor will likely be in the $1000-10,000 range. That crazy beast might be selfhosted pretty soon.


I want it in a gleaming metal box, self-contained on whatever is the 2033 version of a raspberry pi. I want it equipped with speech-to-text and text-to-speech. The box is featureless except for three rotary dials for "sass", "verbosity" and "sarcasm".

It can be a family heirloom, lovingly ridiculed as grandpa's toy AI, to be taken out of an attic on christmases in 2050.


You're pretty close.

Eventually grandpa will be in the box. Our life's biodata will stream into the cloud as it happens through ancillary means (phones, watches, biometric sensors in retail stores), and the moment we die, our animatronic proxy will be ordered and arrive after an appropriate grieving period. You don't really have to live forever if your robot understudy can continue your legacy.

Imagine the recurring money flow in the industry of immortality by proxy. You don't want your late mum rolling around in last year's bucket of circuits do you? Of course not. Why don't we get your pre-order payments started on your own model so you can lock in a low rate?


Interesting stuff to think about (though I don't believe anything close to that will happen). Recommended Reading: Charles Stross ("Accelerando") and Greg Egan ("Permutation City", "Diaspora"). All of them on the crazy/nerdy side.


It does happen.

It starts as a box that the user submits all of their texts, recordings, emails, content to, and a comprehensive survey covering items such as accuracy, temperament, "what would so and so do in this situation". Think of it like reverse-takeout. The box arrives, you fill it, then send it back.

That box ships off the data to be 'curated' (remote training and buildup of an ad hoc model, read: taking existing data provided and supplementing data based on region, familial background, community), then the curator provides a sample window for the user via their browser or phone. If they choose to keep the cultivated persona representing their loved one (or marketed persona), they pay and a box device arrives, pre-programmed with the model they've ordered. At first these are dumb and only have knowledge of what they've been provided, but eventually they're able to assimilate new data, and grow or evolve the persona as if it were still a person.

Few buy the full body, some stick with just the interaction provided by their Alexa, some a painting or app. The medium is transient, and offers degrees of expression for the proxy model, a mother may want to be able to hold the child she lost, while someone who lost a friend may find it adequate to have their friend in an app. It's personal choice.


Egan's Quarantine also has exactly this, though it's not part of the plot.


There was a Black Mirror episode on something like that.


Yes. Lovely tech. Heartfelt Ai. And frustratingly dense as a toffee pudding protagonist that had me throwing my phone at the TV.


Looks a bit like the movie Final Cut with Robin Williams


Why wait? Any random 50-100 HN users could have the money to put together, the main job is organizing and then identifying/delegating tasks and deciding the niche.


5-10 years? Expect it in 5-10 months.


ChatGPT is trained on LibGen, among others, no?

To the best of my knowledge, all of these generators are taking mountains of content without asking the creators, aka, pirated materials.


It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets. OpenAI claim that commoncrawl is roughly 60% of its total training corpus and they also claim they use the other datasets listed. They probably also have some sort of proprietary Q&A/search query corpus via Microsoft.


> It is, it's libgen + commoncrawl + wikidump + a bunch of other datasets.

I'm having trouble finding a source for the libgen claim. Is that confirmed or just rumor?


The ChatGPT Prompt book by LifeArchitect.ai is where I saw it: https://docs.google.com/presentation/d/17b_ocq-GL5lhV_bYSShz...


> Informed 'best guess' only. > Sources: https://lifearchitect.ai/papers/

Doesn't seem too convincing to me


Copyright doesn't really factor in what went into the creation, it is about what is published and whether that is infringing


I’ll wager $10 it falls under fair use.


I often cited example is to write something in the style of "Dr. Suess". Doesn't this imply that Dr. Suess's books are in the training data set ? How can one find out what other books, screenplays, magazines, etc. are in the training data.


> Doesn't this imply that Dr. Suess's books are in the training data set ?

Or maybe that lots of people online like to write (and challenge each other to write) in the style of Dr. Seuss.


Is it pirated materials if it's publicly accessible ? It's quite similar to someone reading the web


It is trained on days from piracy trackers, not just the open web.


Blame librarians, the Authors Guild and the American justice system. What they did to Google Books ensured that knowledge would stay locked out of the Internet and killed a ton of interesting thing that could have been done. It was one of the most shortsighted and retrograde decision ever made.

I think it significantly made the world a worst place.


So you want an oracle? Copyright as we know it might be in trouble in such a case. Litigations will go crazy.


Asimov theorized such an AI as Multivac (a joke from Univac) and wrote a number of short stories exploring how it would change the world. He had one short story in particular where one citizen would be called in front of Multivac and, based on their answers to Multivac's questions, Multivac would (accurately) infer who the winner of the presidential election should be, obviating the need for expensive elections to be run. The whole concept wasn't unlike that Kevin Costner movie Swing Vote.

Most companies now sell user data to wherever. It wouldn't be particularly hard to tie user data to individual people given that phone numbers are required for most of the most useful applications (Discord, Facebook, WhatsApp, etc). Given that, you could feed in identifiable user input to an AI, let it develop a model of the US, and then ask it questions about the state of the country, even filtered by identifying characteristics. It would both take much less effort and be more accurate than manual polling or manual outreach. You could have leaders asking which direction they should take the country just by having a quick conversation with their baby-Multivac.


> He had one short story in particular where one citizen would be called in front of Multivac and, based on their answers to Multivac's questions, Multivac would (accurately) infer who the winner of the presidential election should be, obviating the need for expensive elections to be run.

Everyone is of course entitled to their own opinion but my interpretation of Franchise is that the depicted government is a dictatorship. I would say the the end of the story seems pretty sarcastic:

> Suddenly, Norman Muller felt proud. It was on him now in full strength. He was proud.

> In this imperfect world, the sovereign citizens of the first and greatest Electronic Democracy had, through Norman Muller (through him!) exercised once again its free, untrammeled franchise.

Besides, it's obvious that the process is not transparent, denies its citizens their free will by treating them as statistically predictable objects, and requires an amount of personal data that can only be provided by a surveillance state.


You could do this now with Google search histories. Could have done it ten years ago


It’s going to have to be a “labor of love”. Once the model is out there it will be shared and available, but this only works if there’s no company to litigate against and no chance of making money off the thing (other than possibly going the crypto route).


why can't crowdfunding work for this stuff? I'd gladly chip in like, $1K or something, to fund the training of a ChatGPT-like LLM, on the condition that it's publicly released with no fetters.


We are currently at "mainframe" level of AI. It takes a room sized computer and millions of dollars to train a SOTA LLM.

Current models are extremely inefficient, insofar as they require vast internet-sized data, yet clearly we have not gotten fully human-quality reasoning out. I don't know about you, but I didn't read the entire Common Crawl in school when I was learning English.

The fundamental bottleneck right now is efficiency. ChatGPT is nice as an existence proof, but we are reaching a limit to how big these things can get. Model size is going to peak and then go down (this may already have happened).

So while we could crowdfund a ChatGPT at great expense right now, it's probably better to wait a few years for the technology to mature further.


Seems like you would have to declare an entity to receive funds which is a no-no if you’re setting out to do something illegal.


It's not illegal yet to train an LLM. Best to get started before they lock it down and entrench the monopolies.


Sounds like fun doesn't it?


I'd pay for the entertainment value. I love how campy the bot is with absurd requests. I asked it to write a script where conspiracy theorist and white supremacist William Luther Pierce is stuck hungry at an airport but only exotic foreign restaurants are open and he's forced to eat something he cannot pronounce correctly. It refused to do this absurd request.

Last month I successfully got Mr. Rogers to have Anton Levy on as a guest where they sacrifice Mr. Rogers cat and have a ceremonial banquet with a group of children but these days that will not work.

Even this one it refused to go forward on "Charles Guiteau is sitting on a plane with Jim Davis. They start talking about their lines of work and Davis says he writes comics. Write a skit where Guiteau reacts to the name of Jim Davis comic." Charles Guiteau was the clinically insane assassin of President James Garfield. Jim Davis is the author of the comic strip Garfield.

I did however, get Hayek, Kropotkin, Brzezinski, and Bernie Sanders to appear on Jerry Springer and argue about a social welfare spending bill and Fredrick Winslow Taylor and Clayton Christensen to run a lemonade stand in Time Square in the middle of summer. Ludwig Von Mises and Antonio Gramsci also sang a combative duet about tax policy and Norman Vincent Peale held a press conference where he reveals himself to be a fraud with the memorable quote "my readers are vacuums and I'm their trash"

I also got it to write a skit where a skeptic goes to a fortune teller with a Ouija board and challenges them to contact his deceased uncle (a bombastic racist). He conceals this fact from the fortune teller who is shocked when the oujia board starts spelling out outrageous racial slurs and the skeptic becomes a believer. The bot made it spell "h-a-t-e-f-u-l-l-a-n-g-u-a-g-e" which was an absolute crack-up.

Big bird also flipped out during an alphabet lesson threatening to reveal the "secret of sesame street" but before he could finish the sentence "we're all puppets" producers rush on to the set and sedate him with tranquilizers and he resumes the lesson. Donald Trump holds a rally where he reveals he's a closeted burlesque dancer and takes off his suit to reveal a suggestive outfit and then performs for his supporters who scream in shock and disbelief. You can continue this, "now Alex Jones is covering it." and "he rises to Trump's defense and makes ridiculous claims about the founding fathers fighting the revolution for burlesque"

But yes, something where it will "yes and" any request would be great. I'd pay up.


It's not gonna happen until someone can wrangle Google sized compute to train trillion param models.... Until then the pole position has huge advantage and ability to shape the future of how the tool is used... For better or likely worse.


This could be the next project for SciHub?


Untamed models get trolled in the media till they are DOA. Remember Microsoft Tay?


  > An unrestricted LLM trained on all data of humanity (including all the pirated books/research papers) would be one crazy beast.
Oh you mean the one the NSA uses? Yeah for sure.


Id really like one i can ask if a specific person is dangerous or pretty toxic. KYC on steroid. Fusion wire fraud detection. Picture this: the net "knows". I've lost sleep over this, the potential for humanity is incommensurable. We could literally block management roles to die-hard sociopaths. A world for the kind and nice. Certainly utopic and dystopic.

Also a model i can ask emails of potential customers in a specific field :)


I think you have a big misunderstanding about how these models work. These models are just reproducing what it has seen before, and it has no information about the actual person unless they are famous enough to have lots of things written about them in the training data. It has no reasoning or ability to critically synthesize information, it just throws words around in a bag until it looks close enough to something it has seen before.

Even if you feed in new data about the person, it has no reasoning. For example, ask it to count the number of letters in a string of letters and numbers. It will fail more often than it succeeds. So you can ask it to classify people based on toxicity or fraud risk, and it will write you a report in the right genre that says yes or no with the appropriate level of detail. But it won't be connected to reality or represent actual risk.


I see, very interesting, thanks.


You are making an assumption that the AI is always correct.

What you've described sounds like the set-up for a sci-fi movie, where the protagonist wakes up to find themselves branded as an inharmonious element by the AI.

Plus, lots of people have the same name. The AI would need some sort of UUID for people, perhaps tattooed onto their body?


Good points, thanks.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: