Exploring NYC Transit Data

This project uses the NYC Transit turnstile data I thought it would be fun to use this data to answer a question for a hypothetical client.

My hypothetical client is the Manhattan School of Music, and they have asked me to find where are the best locations in NYC for their students to perform to gain exposure and hopefully make some money. Assuming that musicians perform at subway stations and dancers perform on subway lines, I'm looking to answer two questions:

  1. What are the best stations for musicians to perform?
  2. Which are the best lines for dancers to perform?
  3. The MTA website provides data on the number of Entries and Exits for each turnstile in each station by day. Each file contains data for one week, and data is updated weekly. Since class at the MSM started last week, I looked at the most recent week's data. I used pandas to read and analyze the data and seaborn to visualize it.

    First I import the libraries, and read in the data.

        import matplotlib.pyplot as plt
        import pandas as pd
        import numpy as np
        import seaborn as sb
        import datetime as dt
        df = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_170114.txt')

    This data is unique at the turnstile level. Each record contains a cumulative count of Entries and Exits for a turnstile at a station on a single day of the week and is updated every 4 hours. There is metadata about each station including how many lines it serves and the line the station originally belonged to when the subway lines were first built. I want to test the the assumption that a station that is serves more lines will have more entries than a station with less. We'll look at entries because we want to know when people will be in the subway station not leavingy

    Cleaning the Data

    This data is a bit messy. The entry counts are cumulative, so I need to calculate their deltas. It is also hard to aggregate in a uniform way because different turnstiles are audited at different times, and the counters will randomly reset which screws up the deltas. In order to resolve this issue, I make an assumption that no more than a person per second can enter a turnstile, so I filter my deltas down to 86400.

    I then extracted some new features from the data like the number of lines per station, and whether or not the counts took place on a weekday or weekend. I normalized the delta (entry counts) by the number of lines per station for future analysis of entries per line.

        df = pd.read_csv('http://web.mta.info/developers/data/nyct/turnstile/turnstile_170114.txt')
        #Calculate # of Entries
        df['DELTA'] = df['ENTRIES'].diff().abs()
        df.fillna(0, inplace=True)
        #filter out outliers
        mask_entry = (df['DELTA'] > 86400)
        df.ix[mask_entry, 'DELTA'] = 0
        #Calculate Number of lines
        df['LINE_COUNT'] = df['LINENAME'].map(len)
        #Normalize station counts by # of lines
        df['ENTRIES_PER_LINE'] = df['DELTA']/df['LINE_COUNT']
        df['DATE'] = pd.to_datetime(df['DATE'], format = '%m/%d/%Y')
        #Aggregate Total Entries and Entries per Line by Station and Date
        df_station=df.groupby(['STATION', 'DATE', 'LINENAME', 'LINE_COUNT'])['DELTA','ENTRIES_PER_LINE'].sum().reset_index()
        #Get weekend boolean field
        def is_weekend(row):
        weekend = ['2017-01-07', '2017-01-08']
        weekend_days = [dt.datetime.strptime(x, "%Y-%m-%d") for x in weekend]
        if row in weekend_days:
            return True
        else:return False
        df_station['is_weekend'] = df_station['DATE'].map(is_weekend)   

    Visualizing the Data

    I wanted to see how the number of Entrances vary across the number of lines a station has.

    This chart shows the median number of entries in the week across all stations by their "line number group". I created this plot to answer the question- do stations with more lines see a greater number of entrances than the stations with less lines? I looked at the median to make sure that outliers like Grand Central Station weren't skewing the distribution.

    Intrestingly enough, this bar chart shows that the answer to that question is not necessarily. I would show my client this chart and say "I'd recommend musicians stick to stations with 4, 7, or 8 lines."

    When I split this view by weekdays and weekends in a scatterplot, a couple of other insights become clear:

    • Overall ridership is down on the weekends
    • There is not much of a signal that the number of lines a station serves is related to the number of entrances to that station
    • Most stations serve 5 or less lines

    Finding the Best Lines

    Now let's take a look at the best subwah lines for dancers to perform on to gain for the biggest audience. I split this data up by weekday vs weekend as well.