(Posting as separate comment because wall of text)
Also - Honestly don't even understand why so many people would need to scrape for training data.
Is it naïveté (thinking it necessary), arrogance (thinking it better than others, and thus justified)?
Aren't most advances now primarily focused on either higher level (agents, layered retrieval) or lower level (eg. alternatives to transformers, etc.. which would be easier to prove useful on existing datasets)?
Genuine questions, all of these - if I'm off the mark I'm keen to learn!
Also - Honestly don't even understand why so many people would need to scrape for training data.
Is it naïveté (thinking it necessary), arrogance (thinking it better than others, and thus justified)?
Aren't most advances now primarily focused on either higher level (agents, layered retrieval) or lower level (eg. alternatives to transformers, etc.. which would be easier to prove useful on existing datasets)?
Genuine questions, all of these - if I'm off the mark I'm keen to learn!