Hacker News new | past | comments | ask | show | jobs | submit | thekeywordgeek's comments login

I like it! It seems to have a problem with half inches for me though and generates an error message. Latest Firefox/Ubuntu.


Thanks!!! If you feel comfortable with it , feel free to send me the measurements that you used so I can try to troubleshoot the problem! (or email me at easysloper@gmail.com)


I was contorting with a tape measure and typing them in, so my apologies I can't. :)


Haha no problem. I should probably add something in to send me measurements that don't work anyway


A big thank you for enquiring on my behalf!

A 503 would still require a GAE instance to be running so wouldn't necessarily deal with my problem.

I have seen "noindex nofollow" kill a site stone dead in the past so I am very wary indeed of using it. In my experience once you've noindexed a page it is nigh-on impossible to get the engine to index it again.

My content is autogenerated, though I hope it has enough value to be considered useful. It's time-series data of word frequencies in politics, so for example you might use it to see how one candidate is doing relative to another in an election campaign.


FWIW I think the main problem is that you're essentially creating an "infinite space," meaning there's an extremely high number of URLs that are findable through crawling your pages, and the more pages we crawl, the more new ones we find. There's no general & trivial solution to crawling and indexing sites like that, so ideally you'd want to find a strategy that allows indexing of great content from your site, without overly taxing your resources on things that are irrelevant. Making those distinctions isn't always easy... but I'd really recommend taking a bit of time to work out which kinds of URLs you want crawled & indexed, and how they could be made discoverable through crawling without crawlers getting stuck elsewhere. It might even be worth blocking those pages from crawling completely (via robots.txt) until you come up with a strategy for that.


And one more thing ... you have some paths that are generating more URLs on their own without showing different content, for example:

http://www.languagespy.com/politics/uk/trends/70th/70th-anni... http://www.languagespy.com/politics/uk/trends/70th/70th-anni... http://www.languagespy.com/politics/uk/trends/70th-anniversa...

I can't check at the moment, but my guess is that all of these generate the same content (and that you could add even more versions of those keywords in the path too). These were found through crawling, so somewhere within your site you're linking to them, and they're returning valid content, so we keep crawling deeper. That's essentially a normal bug worth fixing regardless of how you handle the rest.


> A 503 would still require a GAE instance to be running so wouldn't necessarily deal with my problem.

And persistence to track how many crawl requests have been served in the last N minutes. Even blindly serving a million 503's an hour could get really expensive.


Having a page that goes nofollow/noindex and back is fine, when we recrawl it, we'll take the new state into account.


I'd be concerned as to adverse effects on my indexing. But it wouldn't fix my problem as it would still be a request that would require a GAE instance to handle it.

(edit) Yes, the bot is still hitting the GAE site atm even though it's returning a quota error.


But would the bot do the same number of requests if it got an error?


Absolutely, even I can't say I'm deserving of a free lunch.


Really big :)

Its source is the English language, so if there's a word or phrase that gets used, it has a result. Corpus linguistics is fun like that.


If you've such big site, you should avoid resource priced cloud and go with your own VPS. It might take you more time to set it up, but it will be definitely much cheaper ($50/month and less), and it can surely handle all your traffic until it grows really big....


Yeah see here for example for 1TB/month traffic VPS: http://iwstack.com/, or unmetered traffic with dedicated box: http://www.online.net/en/dedicated-server/dedicated-server-o...

The cost effectiveness really depends on whether your data would fit into that 1TB or it'd require much more.


Even if his data doesn't fit into that 1TB, he can always delegate storage to different VPS and use a load balancer.

Saying "I have infinite data" is not an excuse for not looking for alternatives.


Absolutely, It's only coincidence that Google are both the host and the spider. I really would appreciate the ability to throttle it though.


See my reply to jacquesm above. Very wary of blocking, as sometimes persuading the engine you've unblocked it afterwards is nigh-on impossible.


Do you have any idea if this service will attract any actual users? Based on what I've read, I can't even figure out what it does, so I am certainly not a potential user. But, do you have any actual demand??

What I'm hearing is that you built a massive application, you've run into a technical problem and now you would rather wait on Google to fix it than to take any suggestions on how to get it up for actual users to use. Seriously, don't do this - at your stage, it would be better to have 10 real users than a site that has been fully indexed by Google.

On your note about persuading Google to index your site after being excluded, do you have any actual experience with this happening?? I've been doing this kind of stuff for years and years and have never had a problem. It can take five or six weeks at the outside, but that is still less of a problem than a product that can't be accessed...


I don't think you understand that your site is completely useless. It's not worth the death-by-bot-blocking if it's not there when you need it.


Unblocking via robots.txt is fine and won't cause problems.


Google doesn't respect crawl-delay, sadly. They rely on the Webmaster Tools setting, which is unavailable to me as I've described.


I am a little concerned about doing that though. Having seen sites killed stone dead by people blocking stuff by mistake in robots.txt and the engines then never looking at them again I'm very wary indeed of blocking stuff I intend later to unblock.


It sounds like you care more about google traffic than you care about real users, if you want to do this without having more than a few entries in your robots.txt and your sitemap then you could simply remove the other pages until you're ready to have them spidered, alternatively have them behind a login and hand out invitations.

In a nutshell, if you put up millions of pages and tell google about it it will index you, if you don't want that you'll have to make choices about the quantity and/or switch to a different kind of host.

Also, this kind of 'bot trap' tends to attract penalties so if this is not some ploy to get traffic out of google you may want to re-consider how you've laid things out, the difference between a legitimate site with a lot of generated pages and a page-spammer is hard to determine and google tends to err on the side of caution.


Since your site is down I can't see how it's organized, but I would think hard about having two separate parts to it. One would be a site that changes slowly, perhaps a descriptive page, maybe with a 'best of' or examples that you cull. Then have the main page that is rapidly updated and keep the 'bots out of that. There's no reason to index the rapidly changing map, is there? Just index a slowly changing pointer to it.


But don't you think the bot has already stopped crawling the site due to it being unavailable? That's probably gonna hurt too.


The original Sinclair ZX Spectrum manual featured a listing to play a few bars of Mahler's First, with the reader asked to play the whole piece as an exercise. Here's Matt Westcott completing that exercise with the help of a few friends and a table covered in Spectrums. In the process of making this possible it is also believed that the record was broken for the number of networked Spectrums in one place.


I can't get youtube (at work)

Is this the event in Oxford, with a bunch of speccy 48s hooked up with one of these? http://spectrum.alioth.net/doc/index.php/Spectranet

(I think there would have been a Raspberry PI in the mix somewhere as a network metronome)


Indeed it is, and using the Spectranet board. The Pi simply provided synchronisation, each Spectrum had the whole piece in memory as BASIC code.

We had a bit of a discussion about how it might have been done back in the day using 1 bit input ports and a Spectrum doling out sync pulses.

Edit: Though the Spectranet's auto-loading of the code made things far easier than it would have been with loading 12 Spectrums from tape!


Fabulous, I've been waiting for the video since I saw the photos on facebook :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: