Cloud providers pricing structures don't usually shine in edge cases, which I believe this project qualifies for. I would imagine the total cost to be prohibitive for the average hobby user and that the author neglected to mention it to hide this fact, or because he received special pricing either because he works at Google or is closely affiliated to it.
Still really cool project. Just doesn't sell GCE very well for the use case it embodies, big-data hobby projects (although I'm sure it could be applied similarly to business problems).
In my experience, cloud is affordable especially for bursty edge cases. They only get unaffordable for hobbyists when you run them 24/7 or have significant amounts of traffic, which is really expensive.
But since the pricing is public (https://cloud.google.com/pricing/#pricing), we can check. Please double-check my calculations, I may have made a mistake somewhere.
* "single 8-core Google Compute Engine (GCE) instance with a 2TB SSD persistent disk ... downloaded the books to the instance’s local disk" Unfortunately doesn't say how long it took. A n1-standard-8 instance costs $0.4 per hour without any discounts, plus a neglegible amount for 10 GB OS disk space. A 375 GB local SSD costs $0.113 per hour, so let's assume a total of about $1 per hour. Pretty affordable if you just run it for a day or so.
* "ten 16-core High Mem (100GB RAM) GCE instances (160 cores total) to process the books, each with a 50GB persistent SSD root disk" => 500 GB of persistent SSD root disk at $0.17/GB-month would be 85 per month, so about $3 per day for the storage. The instances are about $1 per hour each, so ~$10/hour. Affordable, but can cost a pretty penny if you need to let it run for more than a day.
* "single 32-core instance with 200GB of RAM, a 10TB persistent SSD disk, and four 375GB direct-attached Local SSD Disks" $2 per hour for the highmem instance, another ~$2.36/hour for the persistent SSD, plus the local SSDs, for a total of slightly under $5/hour. Again, dirt cheap if you need it for 1-2 hours, expensive if 1-2 hours turn into 10-20.
* cloud storage... no infos how much data it was, but $0.02/GB-month for durable reduced availability storage (which seems like a reasonable choice). For 10 TB, that would be ~$7/day ($200/month). There are additional costs for writing and access: 100,000 "Class A" operations (e.g. writes) cost $1, so that'll be likely at least another $35 for writing the files. Class B operations (reads) cost 1/10th of the price.
* traffic - inbound and internal is free, so likely neglegible if you just want to analze a lot of data. However, getting the full data set out would likely be very expensive. $0.12/per GB quickly adds up - 1 TB would be $120, 10 TB would be ~$1110! OTOH, if you just need 100 GB of results out, that's $12.
All in all, I'd say that "a couple hundred bucks" is a pretty reasonable assumption, which is still affordable for dedicated hobbyists (I just looked up the price of Märklin model locomotives, which also cost a couple hundred). Especially considering that you get $300 of free trial quota if you sign up - if you're fast, you may even be able to run this for free.
BigQuery pricing is highly dependent on what you query, but I'd call $5 per TB of data queried affordable (i.e. for $5, you can run 10 queries over a 100 GB dataset, and only the columns you touch in your query actually count against the amount of data). And the performance is just insane.
Basically, in VM terms BigQuery lets you scale to thousands of cores in seconds, for just a few seconds, and you get per-second billing. Pretty cost-efficient :)
great analysis, and if you are cost sensitive, be sure to have a much smaller dataset to practice the pipeline and subsequent queries on too! Blowing $300 per failed attempt can get expensive
Still really cool project. Just doesn't sell GCE very well for the use case it embodies, big-data hobby projects (although I'm sure it could be applied similarly to business problems).