Hacker Newsnew | past | comments | ask | show | jobs | submit | vagabund's commentslogin

I'd push back on a couple things here.

The notion that Scale AI's data is of secondary value to Wang seems wrong: data-labeling in the era of agentic RL is more sophisticated than the pejorative view of outsourcing mechanical turk work at slave wages to third world workers, it's about expert demonstrations and work flows, the shape of which are highly useful for deducing the sorts of RL environments frontier labs are using for post-training. This is likely the primary motivator.

> LLMs are pretty easy to make, lots of people know how to do it — you learn how in any CS program worth a damn.

This also doesn't cohere with my understanding. There's only a few hundred people in the world that can train competitive models at scale, and the process is laden with all sorts of technical tricks and trade secrets. It's what made the deepseek reports and results so surprising. I don't think the toy neural network one gets assigned to create in an undergrad course is a helpful comparison.

Relatedly, the idea that progress in ML is largely stochastic and so horizontal orgs are the only sensible structure seems like a weird conclusion to draw from the record. Saying Schmidhuber is a one hit wonder, or "The LLM paper was written basically entirely by folks for whom "Attention is All You Need" is their singular claim to fame" neglects a long history of foundational contributions in the case of the former, and misses the prolific contributions of Shazeer in the latter. Alec Radford is another notable omission as a consistent superstar researcher. To the point about organizational structure, OpenAI famously made concentrated bets contra the decentralized experimentation of Google and kicked off this whole race. Deepmind is significantly more hierarchical than Brain was and from comments by Pichai, that seemed like part of the motivation for the merger.


- Could be wrong about Scale. I'm going off folks I know at client companies and at scale itself.

- idk I've trained a lot of models in my time. It's true that there's an arcane art to training LLMs, but it's wrong that this is somehow unlearnable. If I can do it out of undergrad with no prior training and 3 months of slamming my head into a wall, so can others. (Large LLMs are imo not that much different from small ones in terms of training complexity. Tools like torch and libraries like megatron make these things much easier ofc)

- there are a lot of fantastic researchers and I don't mean to disparage anyone, including anyone I didn't mention. Still, I stand by my beliefs on ml. Big changes in architecture, new learning techniques, and training tips and tricks come from a lot of people, all of whom are talking to each other in a very decentralized way.

My opinions are my own, ymmv


Dude you went to Columbia, you probably dont think people that went to state schools are even human.

Rest of the article was good


> you probably dont think people that went to state schools are even human

on the contrary. I have been quite vocal about why I felt my education was lacking and the respect I have for those who have gone for nontraditional paths


Dude, going to a state school isn't a nontraditional path.


The huggingface spaces link doesn't work, fyi.

Sounds awesome in the demo page though.



We are in the progress of fixing it! Thanks for letting us know :)


> I highly doubt you are getting novel, high quality data.

Why wouldn't you? Presumably the end user would try their use case on the existing model, and if it performs well, wouldn't bother with the expense of setting up an RL environment specific to their task.

If it doesn't perform well, they do bother, and they have all the incentive in the world to get the verifier right -- which is not an extraordinarily sophisticated task if you're only using rules-based outcome rewards (as R1 and R1-Zero do)


My understanding is they built a performant suite of simulation tools from the ground up, and then they expose those tools via API to an "agent" that can compose them to accomplish the user's ask. It's probably less general than the prompt interface implies, but still seems incredibly useful.


Still doesn’t seem possible with current technology? It would have to access those apis while it generates video.


It may just be my perception, but I seem to have noticed this steering becoming a lot more heavy handed on Spotify.

If I try to play any music from a historical genre, it's only about 3 or 4 autoplays before it's queued exclusively contemporary artists, usually performing a cheap pastiche of the original style. It's honestly made the algorithm unusable, to the point that I built a CLI tool that lets me get recommendations from Claude conversationally, and adds them to my queue via api. It's limited by Claude's relatively shallow ability to retrieve from the vast library on these streaming services, but it's still better than the alternative.

Hoping someone makes a model specifically for conversational music DJing, it's really pretty magical when it's working well.


Spotify's recommendations are biased towards what you've listened to recently. Do you share the account with someone else?


No, but it's also biased toward their commercial partners. From this page [0], detailing their recommendation process:

> How do commercial considerations impact recommendations?

> [...] In some cases, commercial considerations, such as the cost of content or whether we can monetize it, may influence our recommendations. For example, Discovery Mode gives artists and labels the opportunity to identify songs that are a priority for them, and our system will add that signal to the algorithms that determine the content of personalized listening sessions. When an artist or label turns on Discovery Mode for a song, Spotify charges a commission on streams of that song in areas of the platform where Discovery Mode is active.

So Spotify's incentivized to coerce listening behavior towards contemporary artists that vaguely match your tastes, so they can collect the commission. This explains why it's essentially impossible to keep the algorithm in a historical era or genre -- even if well defined, and seeded with a playlist full of songs that fit the definition. It also explains why the "shuffle" button now defaults to "smart shuffle" so they can insert "recommended" (read: commission-generating) songs into your playlist.

[0]: https://www.spotify.com/ca-en/safetyandprivacy/understanding...


that’s crazy, i am skeptical of the legality here: i believe they are legally required to disclose when content is paid.

(i work in advertising and we would never be allowed to introduce sponsored content into an organic stream like this without labeling)


The link they provided is the disclosure. You'd be surprised to find out this is the business model of the radio for years and why most radio stations that need profits only play recent songs, and usually the same songs over and over until new ones that are pushed by labels come out.


Is there a site that has hand-curated playlists I would love that let's say if I want to listen to Korean pop from the 90s or Minimal Techno from the 00s.


Searching Spotify for user created playlists is still probably your best bet. Youtube has some good results too.

Here are two that might fit what you're looking for:

'90s K-pop: https://open.spotify.com/playlist/6mnmq7HC68SVXcW710LsG0?si=...

'00s minimal techno: https://open.spotify.com/playlist/6mnmq7HC68SVXcW710LsG0?si=...

There are sites to convert from spotify to another service if you don't have it.


There are a few of them like Filtr or Digster

Usually I find them _by accident_ while browsing public playlists on Spotify


Yeah they absolutely do not use the pile.


GPT-Neo and Llama were both trained on The Pile, and both of those were fairly influential releases. That's not to say they don't also use other resources, but I see no reason not to use The Pile; it's enormous.

It's also not everything there is, but for public preservation purposes I think the current archives are fine. If Google or Meta turn out to have been secretly stockpiling old training data without our knowledge, I'm not exactly sure what "we" would lose.


"Hawk-3B exceeds the reported performance of Mamba-3B (Gu and Dao, 2023) on downstream tasks, despite being trained on half as many tokens. Griffin-7B and Griffin-14B match the performance of Llama-2 (Touvron et al., 2023) despite being trained on roughly 7 times fewer tokens."



Is there any sites for viewing Twitter threads without signing up?




They're generating the audio. They use a series of techniques to automatically generate metadata for speech samples in LibriSpeech for things like accent, recording quality, pitch, speed, gender, then use an LLM to format these tags into comprehensive natural language descriptions, leading to a more tunable model at inference time. This metadata generation pipeline is the key insight and what was missing from speech datasets unlike e.g. image datasets, which have obviously seen more rapid success.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: