More

timaelliott · on July 11, 2012

What unit suffix is 'l'

timaelliott · on July 11, 2012

Twitter's attitude is laughable. Yes, you have a decent amount of traffic but you're also only dealing with ~160 uncompressed bytes (plus whatever overhead) per event. The hurdles you've overcame aren't that particularly amazing nor challenging.

jasonwatkinspdx · on July 11, 2012

Your assumptions are laughable. Tweets have considerable metadata that pushes them far beyond 160 bytes. Fragmenting this into a secondary object is counterproductive due to the constant factor of 2 requests vs one larger payload.

Someone who's been around the block a few times understands that it's difficult to make pronouncements without informed observation. That you are not willing to extend twitter's engineering staff the benefit of the doubt considering your lack of visibility into their measurements speaks loudly.

InclinedPlane · on July 11, 2012

What does content size matter? The challenge is that every single page of content except for each individual tweet is utterly unique for every user. That defeats the vast majority of straightforward caching implementations. You can't cache fully rendered pages ever because the chance that one random timeline view at a given time will be identical to any other view (even by the same person at a different time) is pretty much as close to zero as possible. Every view is dynamically generated content from up to several hundred or thousand different streams of data and needs to be put in order and have all of the per-user metadata set correctly.

Once you start looking into the actual mathematical constraints of the problem of twitter you realize that it's a scaling nightmare. Hundreds of millions of updates per day and tens of thousands of views per second (billions per day). There's only a few people in the world who have the right to look down on stats like that.

burningout · on July 11, 2012

Again, as the parent poster also posted, I think you have never worked on large data. Twitter is like a big mailbox, only that every mail only has 160 bytes. This has been solved 10 years ago.

jasonwatkinspdx · on July 11, 2012

If you don't understand that the request distribution matters more than payload size, you aren't even seeing the problems.

I encourage you to analyze infrastructure for a twitter style app using inbox duplication. Once you model this against hardware costs you'll learn something about how utterly expensive write amplification is in a hot data set that must be backed by ram due to availability requirements.

achompas · on July 11, 2012

Wait, see my comment below. Twitter received 15B (yes, B) API calls/day last July. How does that compare to your typical email client?

I don't want to argue that Twitter is astoundingly hard, but serving ~170K requests/sec can't really be that trivial, even if they're 160 bytes (they're not, since Twitter sends metadata, logs those messages, tracks service metrics, etc. for those messages)

InclinedPlane · on July 11, 2012

If you treat twitter like a big mailbox, things will work "ok". It's not the worst approach ever, that's for certain. But end-user perceptible performance would be a fraction of what twitter has today.

P.S. How many images does twitter serve up per day at present? That's a tad more than 160 characters of data.

jasonwatkinspdx · on July 11, 2012

Another key difference is that email users generally contribute directly to their provider's infrastructure costs in providing email as a service. Email infrastructure (and the user experience) is fragmented, and global funding generally scales with global load.

Instead twitter must monetize via advertising of some form, and so the percentage of folks who do not respond to ads acts as a really strong factor in your cost calculations. In this sense, email software has it easy, and can be extremely wasteful in the resources it consumes.

It's not just that the availability expectations of twitter are higher than email, it's also that the economic base of the infrastructure is far more sparse.

InclinedPlane · on July 11, 2012

And yet another key difference is that there are few email server installations that support half a billion users. Saying that the scaling problem is "solved" because all you have to do is copy, say, gmail, is kind of silly.

achompas · on July 11, 2012

Yes, you have a decent amount of traffic but you're also only dealing with ~160 uncompressed bytes (plus whatever overhead) per event. The hurdles you've overcame aren't that particularly amazing nor challenging.

>42M uniques last month.[0] Are you really going to assert Twitter hasn't dealt with amazing or challenging hurdles in getting this far?

[0] http://siteanalytics.compete.com/twitter.com/

EDIT: this ignores that twitter.com is not the only Twitter client--they served 15B (!!) requests/day (!!!) as of a year ago.

Not to mention metadata, instrumentation for services, logging, DB backups, and managing configuration of all of those distributed resources. Are we still talking about the ease of 160B?

http://www.readwriteweb.com/hack/2011/07/twitter-serves-more...

timaelliott · on July 11, 2012

Yes.

In 2008/2009, another engineer and myself built an ad-platform that received around 500M impressions per day, 5M clicks per day. And it wasn't just recording a tweet or publishing out to followers. We took the user input query, had to do some keyword/relevancy targeting, geofiltering, matching to advertisers and deliver back a large result set of adverts. All within 100ms.

Our platform was also apache, mod_php, memcached, mysql and rabbitmq. So definitely not the most optimal of platforms by any means. We had two colos with ~20 servers (dell r410s) at each facility.

Twitter just recently announced 400M tweets/day. I'm not trying to brag about my experiences, because looking back now we made numerous amateur mistakes, but just showing that Twitter's "scale" is a joke compared to everyday challenges at any large internet ad network.

shadowfiend · on July 11, 2012

You understand that 400M tweets a day is the number of tweets posted to their system, right? That speaks not at all to the consumption of those tweets, which is the metric you're using for your ad platform.

Additionally, they don't just deal with 160 characters, because again, somehow you're still talking about data being posted, and not data being consumed. Data is consumed off their site via polling APIs, streaming APIs, and a website, all of which are pushing those 400M tweets a day out to plenty of consumers.

They may not have as ridiculous a scale as they act like they do. But let's be clear: it is nowhere near as trivial as you make it out to be, either. Armchair quarterbacking is always easy, because you aren't exposed to the complexity that arises when you've spent a few months and years hitting the corner cases of the problem you're commenting on.

marcusf · on July 11, 2012

So you had 500m reads on a relatively static data set + 5m writes on an unrelated log? Sounds like a fun problem, but I agree I doesn't sound like rocket science. On the other hand, it also doesn't sound like Twitter, having 400m writes per day, and 400*x million reads on that very dynamic data set. Just seems that's a slightly harder problem.

chewxy · on July 11, 2012

Adserving is not really static. Cachebusters are named so for good reason. Nowadays ad server developers are clever enough to separate click tracking and impression tracking (the non-Enterprise version of OpenX still deserves a lot of ಠ_ಠ though).

In an RTB environment, there is an additional constraint of having to serve up your ad (or decision) within 60ms (Google ADX sets a hard limit of 80ms), and the fastest best bid wins.

I don't think that's a less hard problem compared to Twitter, especially at high volumes. You can't just say "scale sideward!".

That said, the first link was totally misleading. I was actually quite shocked to see that Twitter only had 42M uniques per month, because a typical ad network does a lot more

EDIT: ah.. 15B requests/day makes more sense. Wtf is with the wrong stats?

achompas · on July 11, 2012

Are you talking about 15B vs. the visits chart I linked? If so, the 15B number comes from API calls, which do not have to happen through the website (think of all the Twitter clients).

chewxy · on July 11, 2012

Requests are requests.15B is a gigantic amount.

achompas · on July 11, 2012

Right I agree 100%. I just couldn't tell if you were trying to reconcile the 15B with the 45M number from Compete.

achompas · on July 11, 2012

See my edit. Your ad impressions reached approximately 3% of Twitter's daily request load last year. Note that those requests can serve up to 200 tweets + metadata.

This doesn't account for Twitter's budding ad service, which one can assume has some of the same functionality (targeted advertising, information retrieval) as traditional ad networks.

InclinedPlane · on July 11, 2012

You are off by nearly 2 orders of magnitude from twitter's scale. They have billions of views per day and each of those views is a stream comprised of hundreds of different sub-streams.

chewxy · on July 11, 2012

Add to that, the challenges of sub-60ms RTB. All the fun!

taligent · on July 11, 2012

Decent amount of traffic ?

Sorry but the only thing laughable is that comment.

timaelliott · on July 11, 2012

http://en.wikipedia.org/wiki/Travelling_salesman_problem

science_robot · on July 11, 2012

Yes, exactly. Travelling Salesman with distance being measured in dollars.

joezydeco · on July 11, 2012

Need to add one more dimension. Distance is dollars, given a particular day of the year.

timaelliott · on July 10, 2012

I use droplr for quick screen-sharing, pastes, file-transfers, etc. I do not use it for anything sensitive whatsoever.

timaelliott · on July 10, 2012

They really should have more intelligent individuals manning public-facing support channels.

paulitex · on July 10, 2012

The person (Bruno) replying on behalf of Droplr is "in charge of the whole server-side circus — API server, system administration, web app backend —, the SDK libraries for third-party clients, the Windows app and the iOS app." http://biasedbit.com/about/

(I realize there is a risk of sending a mob by linking to his personal page, but I think there is evidence he is indeed "intelligent" and simply misunderstood this particular piece of the puzzle. He just need a bit more humility.)

timaelliott · on July 10, 2012

Redis is a key-value store. S3 is a distributed file system.

Can we stop labeling the set of "not a rdbms" data storage mechanisms with the stupid fucking "NoSQL" moniker.

skeletonjelly · on July 11, 2012

To be fair I've hardly seen an article on the benefits of NoSQL on the frontpage for a while.

gaius · on July 11, 2012

First it was No SQL. Then it was Not Only SQL. Then it was, we have to get some real work done, Need Our SQL.

timaelliott · on July 10, 2012

I really wish Dropbox would add the ability for arbitrary directory syncing. This single "dropbox" folder is so annoying.

Until them, I'm quite content with one of their competitors.

hussong · on July 10, 2012

You could use symlinks to add arbitrary folders to your dropbox.

timaelliott · on July 10, 2012

Yeah, just not cutting it for me. I understand the workarounds but the reality is they shouldn't be required. I should be able to right-click a directory and "Add to Dropbox"

meifun · on July 10, 2012

They do have 'Selective Sync' but those folders do have to be in your Dropbox folder for that. So not exactly what you are looking for unless you re-organize to use that structure.

timaelliott · on July 10, 2012

> For our geo-search API, we used PostgreSQL for many months, but once our Media entries were sharded, moved over to using Apache Solr. It has a simple JSON interface, so as far as our application is concerned, it’s just another API to consume.

Does anyone have particular insight to share on this? Last I checked, Solr's geospatial searching methods are rather inefficient -- haversine across all documents, bounding boxes that rely on haversine and Solr4 was looking into geohashes (better but have some serious edge-case problems where they fall apart).

Meanwhile PostgreSQL offers r-tree indexing for spatial queries and is blazing fast.

Am I missing some hidden power about Solr's geospatial lookups that make it faster/better than an r-tree implementation?

awj · on July 10, 2012

It probably was the database sharding. If the Solr setup could handle the geo-search-related data without the need for sharding it probably can beat out Postgres with sharding.

Having this exposed through an api that is standardized and maintained by someone else is also nothing to sneeze at. I'd trade a bit of performance for that kind of standardization and turnkey use in the right scenario.

rbranson · on July 10, 2012

The reason we use Solr for this specific task is because PostgreSQL cannot efficiently and quickly merge two index queries (time & space). It can do this to a limited degree, but both of these dimensions potentially match 10s of millions of documents, and PG falls over at this.

timaelliott · on July 10, 2012

So you make the r-tree 3 dimensions (lat,lng,time). PostgreSQL supports this.

I dunno I can't envision Solr being more efficient than a properly designed RDBMS for these situations. If you were integrating a full-text search I'd absolutely believe that to be the case but...

rbranson · on July 10, 2012

We need independent time & geo searches as well. The indexes are vastly smaller in Solr. We use PostgreSQL extensively and prefer it, so it's not a matter of simply wanting to use something different.

fdr · on July 11, 2012

That's very interesting. Could you share your story with the mailing list pgsql-hackers a little bit? The guys who work on indexing are quite active on those lists.

Also, there's some new thing I don't understand super well, sp-gist, do you have any thoughts on that?

michaelt · on July 10, 2012

I'm no Solr expert, but bug SOLR2155 has a patch [1] that does a geospatial search using geohash prefixes [2].

As far as I can tell, you take the point's latitude and longitude and interleave the binary bits - so if your record's latitude is 11111111 and your longitude is 10000000 your geohash is 1110101010101010. You index on that, then when you do a spatial search for the point nearest to 11111110,10000011 you look up key 1110101010101101 and a prefix search finds the closest value in the index is the the record you inserted earlier. Presumably then you realize there could be an even closer record at 11111111,01111111 which would have got stuck at 1011111111111111 in the index so you look there too just in case, take the closer of the two search results, and bob's your mother's brother.

[1] https://issues.apache.org/jira/browse/SOLR-2155?focusedComme... [2] http://en.wikipedia.org/wiki/Geohash

timaelliott · on July 10, 2012

Hrm, so for a proximity search it basically has to take a combination of all potential encompassing geohashes and then do a second-pass (substantially reduced data set) using a haversine approach or something.

I suppose that might work pretty well.

timaelliott · on July 10, 2012

Paypal hates everyone, not just developers.

ajankovic · on July 10, 2012

You just made me stop and think where this hate is coming from. I am imagining a probable scenario were Paypal employees are hating their job. So their hate is transferred to everything they do for Paypal. End product would be the hate we feel by using their services.

jrockway · on July 10, 2012

Probably not. The reason Paypal appears to hate its customers is because a not-insignificant number of them are trying to steal Paypal's money. A poor developer API, on the other hand, is probably due to bad programmers. I've worked with a lot of bad programmers at other jobs, and one thing I've noticed is that they're not very good at coming up with APIs. (Why does Paypal have bad programmers? I don't know: why don't you work for Paypal?)

Also, much of the complaining about Paypal seems to be in the form "I violated Paypal's ToS, and then they closed my account." Well, maybe you should read all those legal documents before entering into a business agreement?

talkingquickly · on July 10, 2012

Very true re the not insignificant number trying to steal from them. PayPals business is as much fraud detection and avoidance as it is payment processing and they're very very good at the fraud detection bit.

The payment processing part seems to be great until it goes wrong but when it does they have the minimal resource possible devoted to dealing with it.

netcan · on July 10, 2012

I can absolutely confirm that paypal hates me. I'm just a regular customer that buys stuff online sometimes.

I've also had a lot of clients that used paypal to sell stuff. Paypal hated them too.

timaelliott · on July 10, 2012

Author has either been making some ridiculously inappropriate system apps or is about to start making equally questionable web apps.

Obviously languages can serve multiple purposes but the intended uses of these have almost no overlap.

dataminer · on July 10, 2012

Why do you think Go is no suited for web apps? At the end of the day web apps are just servers, and Go is very well suited to create reliable and efficient servers. Go is also one of the languages supported by Google App Engine to create web apps.

maratd · on July 10, 2012

> Why do you think Go is no suited for web apps?

Languages specialize. PHP is suited for the web because it is a dynamic language, stuff is compiled right when it is accessed, and there is no memory sharing between requests.

Go, C, C++, etc. are static languages, stuff needs to be recompiled for even a minor change, and of course, memory is shared for each request.

And that's just the tip of the iceberg.

Obviously you can bring all functionality that the web demands to Go. A jit compiler might help, maybe some loose typing, a few other widgets ... and then, guess what? You turned Go into PHP.

For each job, a proper tool.

elithrar · on July 10, 2012

> PHP is suited for the web because it is a dynamic language. Stuff is compiled right when it is accessed. Great if you need to quickly make a change here or there

I don't think this is really a problem. Although dynamic typing vs strong typing advantages/disadvantages are a whole 'nother argument, the reason you might choose (say) Python over Clojure for a web app is the ecosystem (libraries, frameworks, deployment), performance (is this a concern for your design?), the syntax (do you like it?) and tools support (IDE's, etc.).

> Obviously you can bring all functionality that the web demands to Go. A jit compiler might help, maybe some loose typing, a few other widgets ... and then, guess what? You turned Go into PHP.

You could say the same about Ruby or Python. Both built up a lot of web-centric libraries and packages. But neither have "turned into" PHP.

Yes, some languages reduce the barrier to entry for certain tasks. But that doesn't mean other languages are a bad choice; they likely have other benefits worth considering.

chrismsnz · on July 10, 2012

I get the gist of what you're saying but Go's compile speeds are ridiculously fast. Switching into the terminal takes me longer for my projects.

BarkMore · on July 10, 2012

Go, C, C++, etc. are static languages. Make a little change here? Gotta recompile. Big project? Might take a while.

Not with Go. Go compiles super fast.

dotborg · on July 10, 2012

how about using some scripting language inside your web app written in C/C++,

just like it was being done in game development for decades already, it's not like web app is much different from computer game

mseepgood · on July 10, 2012

When you use Go with the AppEngine SDK your program is automatically recompiled on a request if it was changed. And you don't even realize it because Go compiles blazingly fast.

nl · on July 10, 2012

Ignoring the weird ideas about dynamic languages

memory is shared for each request

What does this even mean? Something like an Apache module (as an example in C) may or may not share memory depending on how it is written, but there is nothing that means it must share memory.

All sensible web programming environments have a stateless programming model (if that is what you meant). Some add state on top of that using sessions etc, but that can be in any language.

sirclueless · on July 10, 2012

    All sensible web programming environments have a stateless programming model

By default, Go doesn't. The default HTTP server in Go works by calling into your code via a callback, and you are free to maintain any state you wish across requests.[1]

I don't consider this a problem in Go, because the good module system and scoping rules mean that global state is explicit and obvious, and a good concurrency model means that accessing global state can be done safely. Global state can be useful and performant, so I'm glad Go lets you take advantage of it, but it might rule it out as a "sensible web programming environment" by your definition.

[1] http://golang.org/pkg/net/http/#ListenAndServe

nl · on July 10, 2012

Yes, that is a fair point. Perhaps I could have better said that all sensible web programming environments give you access to a stateless programming model.