Hacker Newsnew | past | comments | ask | show | jobs | submit | jakosz's commentslogin

Now we can start guessing what futures they are betting on: these, in which open-sourcing the whole thing commoditises critical complements.

---

https://www.gwern.net/Complement


Seriously, why?


You can get very good improvements over Spark too. I've been using GNU Parallel + redis + Cython workers to calculate distance pairs for a disambiguation problem. But then again, if it fits into a few X1 instances, it's not big data!


'Some conclusions' section reminds me of a short piece PG has once written on regrets of the dying. I find it amazing how consistent some of the points are -- happiness is your choice, cultivate friendships, say what you think and don't obsess with work.


Grep for the internet.

What I often want is not a search engine, not a recommender, but a filter. Something that would allow me to look at the distributions of content on the Web rather than trying to answer my questions. I badly wanted to pay someone a few quid for a service like this, but had to build it myself.

Feel free to piggyback on the next batch job; use fBd7guQLDLx6RIm00GE7uH5h0Lk1CKKl as access key.

https://alpha.crawlfilter.com/


Cool.

Suggested secondary source: https://archive.org/details/alexacrawls?&sort=-publicdate&pa... (spotty; sometimes the crawls are dark and can't be read)

Also: when you get lucky with ACD: https://redd.it/5s7q04 (I've heard other users getting hard-capped at 100TB though)


I struggle to see how data-driven bias is unexpected.


This is already an issue in many areas, perhaps most troubling with regards to scientific data (e.g. http://www.sciencemag.org/content/331/6018/694.short)


If you're going to perform a comprehensive scan (i.e. not just to sample alexa's 1M), there's a lot of crap waiting for you in the long tail -- you may want to use my subset of alexa's rankings instead, which contains only names that have been on the list for the last 322 days (it's ~700K rows): http://www.szejda.pl/pub/alexa-20130313-20140128.bz2


Dropbox is still my favourite: http://dropbox.com/404.html


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: