Analyzing 1.1B NYC Taxi and Uber Trips

minimaxir · on Nov 17, 2015

Yesterday, I made a post about how to reconstruct the NYC map visualization using the 1.1B Taxi Data using ggplot2: http://minimaxir.com/2015/11/nyc-ggplot2-howto/

Looking at the code for the visualization, the author did an independently similar approach (with the same tools), and one that turned out slightly different, which is what makes things interesting.

It's worth nothing that back in August, only the 2014 and 2015 datasets were released by the NYC TLC. I'm not entirely sure why they decided to release 2009-2012 now.

If you're looking to just playing with the data, I recommend using the BigQuery approach as noted in my article, since downloading and processing ~300GB might take awhile. However, the shape file approach used in the original article the next logical step after that, and one that is put to very good use in the article.

aw3c2 · on Nov 17, 2015

If you are already using PostgreSQL it makes no sense to involve Shapefiles in analysis. Instead use PostGIS.

minimaxir · on Nov 17, 2015

You need to the Shapefiles to tell where the districts in NYC actually are, though. (the GitHub repository in the OP's post contains the Shapefiles)

hadley · on Nov 17, 2015

Nice work!

apaprocki · on Nov 17, 2015

Can spikes be seen in the late-night data when new businesses open around the pickup address? In the Williamsburg section, the reason why the N 11th St block area is bright red in the observation is mostly due to the opening of a hotel/restaurant and two very popular electronic music clubs. Prior to those 3 businesses opening I can't think of any reason why anyone would be in that 1 block area late at night. Is there anything city agencies could do with this feedback loop of data after businesses open to assess their impact on an area? Liquor licenses? MTA?

lil_tee · on Nov 17, 2015

Sure, there are undoubtedly lots of examples of businesses that opened in desolate areas and created new taxi activity where there had been none previously, I just happened to focus on Williamsburg for the post.

Another idea that I didn't get around to doing was to look at concert venues and measure taxi traffic around particular concerts to see if it would correlate to bands' overall popularity

untog · on Nov 17, 2015

Certainly, there was a huge change when the Barclays Center opened in Brooklyn.

superuser2 · on Nov 17, 2015

I'm happy to see that Uber is not releasing dropoff data. It's not terribly difficult to de-anonymize someone in a dump like this.

kctess5 · on Nov 17, 2015

This is some seriously scary data. It would not be hard to de-anonymize this data if you know a person's address, and I'm sure there's all sort of nefarious activity that famous people/politicians/other people might not want to be public knowledge.

On that note, I think that "with a Vengeance" is a bit disingenuous, considering that that this data could be used to personally attack people (but wasn't)... That's almost certainly not a bad thing, though.

mseebach · on Nov 17, 2015

I think "with a Vengeance" was a reference to the Die Hard scene analysed in the post.

samstave · on Nov 17, 2015

With 19,000,000 rides in a 6-month period for uber, with an average assumed ride cost of ~$7 that would mean $133,000,000 in revenues, if Uber takes ~40% - that would be 53,000,000 or nearly 10,000,000 per month that Uber made in that 2009 period alone.

Wow.

mahyarm · on Nov 17, 2015

As a driver you get %70-%80 of the fare. So uber & co does not take %40

astazangasta · on Nov 17, 2015

If Uber is taking 40%, that leaves a tiny amount for the driver. The margins are hopefully much smaller.

samstave · on Nov 17, 2015

It was a wild guess on my part... I had no idea what their take was. I also have no idea what the average cost of the ride is either, mine are likely higher than $7 but most of my rides are between 6 and 8

thetwentyone · on Nov 17, 2015

Wouldn't $7 be on the low end? I would imagine that the average is somewhere closer to $15, as the distribution would be skewed to the higher fares.

ghaff · on Nov 17, 2015

A lot of interesting data here. Just eyeballing it, it appears as if Uber (plus green taxis) have significantly grown the number of taxi-like rides in Brooklyn and Queens relative to yellow taxis alone. However, in addition to Uber and green taxis being a small part of the Manhattan mix, it looks as if the number of rides they take may have come largely at the expense of yellow taxis.

1986 · on Nov 17, 2015

AFAIK, this data does not include livery (black) car trips. Based on personal experience, I think it's more likely that Uber trips have largely come at the expense of black car trips, which prior to the introduction of ridesharing companies and green cabs were the go-to method for getting a car ride in the outer boroughs. A lot of the car service companies have been rolling out app-based systems, see: http://www.nytimes.com/2015/08/12/nyregion/neighborhood-car-...

lil_tee · on Nov 17, 2015

My dataset does not include livery cabs

FiveThirtyEight has some additional for-hire vehicle data in their GitHub which they obtained via FOIL request: https://github.com/fivethirtyeight/uber-tlc-foil-response/tr...

Intuitively, I would think livery cabs have lost significant market share to Uber, but I don't actually know

zobzu · on Nov 17, 2015

The data is interesting, but the git repo is actually nice as well. easy to read through and replicate for other kind of stats. thanks!

thro1237 · on Nov 17, 2015

Was the 300GB of data processed on MacBook Air? How as the response time for queries?

lil_tee · on Nov 17, 2015

Yep, all on a 2012 MacBook Air. Data size was over 400GB with indexes

Simple queries on indexed columns of the trips table take a minute or two, more complicated queries that require a full sequence scan can take up to a few hours

ufmace · on Nov 17, 2015

All on internal storage? What was the max internal storage then, 512GB? Sounds like that's pretty tight, barely possible if you aren't doing anything else that takes up much space.

I wonder if it would be easier to stick the Postgres on a server, maybe AWS or local, and just do the queries from the laptop. Or maybe on a Tmux on the server, so you can let a long query run without having to keep the laptop up.

lil_tee · on Nov 17, 2015

Yes, the database is all on the machine's local 512 GB hard drive. I did store the downloaded flat text files to an external drive and loaded them into the db from there.

yunti · on Nov 17, 2015

That is phenomenal analysis, well done!