Crunching subway data – a New Yorker’s busiest stations

ibejoeb · on June 5, 2013

I noticed some funny stuff about the data, but I haven't tracked down the source. I'd expect the entry/exit counts to increase over the course of the day, and they generally do. Sometimes, though, the latest record doesn't have the greatest number. I suspect it either odd cutoff times, i.e., not aligned to the day.

If anyone wants to play with the data, here's some stuff to start with PostgreSQL:

  create table stats (
      ca varchar(10),
      unit varchar(10),
      scp varchar(10),
      dt timestamp,
      "desc" varchar(20),
      entries integer,
      exits integer
  );

  copy stats from '<path>/output.csv' delimiter ',' csv header;

Here's query that will show the entire set of exit counts align with both their greatest and latest values:

  select
      unit, scp, dt, exits,
      max(exits) over (
          partition by unit, scp, date_trunc('day', dt)
          rows between unbounded preceding and unbounded following
      ) as largest_exits,
      last_value(exits) over (
          partition by unit, scp, date_trunc('day', dt)
          order by dt 
          rows between unbounded preceding and unbounded following
      ) as latest_exits
  from stats
  order by 1, 2, 3;

If you want to see the discrepancies I described above, just wrap it up and find where latest <> greatest:

  with x as (
  select
      unit, scp, dt, exits,
      max(exits) over (
          partition by unit, scp, date_trunc('day', dt)
          rows between unbounded preceding and unbounded following
      ) as largest_exits,
      last_value(exits) over (
          partition by unit, scp, date_trunc('day', dt)
          order by dt 
          rows between unbounded preceding and unbounded following
      ) as latest_exits
  from stats
  )
  select *
  from x
  where largest_exits <> latest_exits
  order by unit, scp, dt;

chimeracoder · on June 5, 2013

I'm a bit surprised to see Penn Station below Grand Central.

Penn Station is the most trafficked train station in North America[0], which I would imagine would lead to more subway entrances/exits, especially during rush hours.

Also, Penn Station has the A,C,E, 1, 2, and 3. Grand Central only has the 4/5/6[1]. The 4/5/6 are the only lines on the east side and are therefore fairly busy, but I find this surprising nonetheless.

[0] This includes non-subway trains, [1] Don't even get me started about the T (ie, the Second Ave. Line)! :)

bradleyjg · on June 5, 2013

Not every commuter takes the subway to and from Penn Station. The surrounding area has a lot of offices, so many people take light rail there and then just walk to and from work.

natesm · on June 5, 2013

NJ Transit and the LIRR are heavy rail (as are the subway and the PATH train).

mathattack · on June 5, 2013

As others mention, Grand Central has the S and 7 (crosstown) which connect to Times Square.

Grand Central gets 800,000 visitors per day, getting many commuters from Long Island and Connecticut.

Grand Central is also a much bigger stopping point. It's near many more office buildings. Penn Station is in an urban area, and next to the Garden, but I don't think it's quite as dense. Many folks would exit at Times Square.

edit: And OP - Thank you for sharing the data!

jwoah12 · on June 5, 2013

Grand Central also has the 7 and S.

benihana · on June 5, 2013

The 4/5/6 line is the busiest line in the country. It gets more traffic than the entirety of Washington's, Chicago's, Boston's and San Francisco's lines.

dcalacci · on June 5, 2013

I didn't know this data was available! I wish we had the same breadth available through the MBTA. I'd be interested in using this data to:

- plan travel directions based on past congestion patterns

- pair this with any data found for NY's bus system or taxi system to map out hubs vs destination stations

- examine stations' frequency of repair and see if congestion at a station correlates with the frequency of repair; try to predict dates of repair

very cool!