Hacker Newsnew | past | comments | ask | show | jobs | submit | seiyak's commentslogin

I've been working on a web crawler for a while. Once it's done,I'll start building a web search engine based on the crawler. It's not been in public yet.

I use C,OpenMPI,Pthread,Valgrind on Linux.


My intention for YC would be different from yours but you didn't start your project for YC, didn't you ? I guess that you started to working on it with motivation to achieve your goal or prove your concept besides becoming rich.

I got a rejection letter yesterday but I don't give up my idea until I prove the concept or I consider it's worthless. It's ok even though that YC rejects my concept.

So don't throw that away.


Can you tell us a little bit about your project to compete with Google if it's ok for you ? "try to compete with Google" is exactly what I'm doing right now and I'm just curious.

You were not able to crawl hundreds of thousands of websites because your crawler was disallowed by their robots.txt but major crawlers were allowed to do so ?


"It turns out I'm not alone in adding these types of restrictions. Yelp blocks everybody but Google, Bing, ia_archiver (archive.org), ScoutJet (Blekko) and Yandex. LinkedIn also has a similar opt-in robots.txt, though they have whitelisted a larger number of bots than yelp."

At least we can contact/email Yelp and LinkedIn regarding to the crawlers if one can crawl or not according to their robots.txt. It's more generous than just allowing the big search engines such as Google and Bing. I'm not quite sure what's actually happening if we ask them though. I'll try that.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: