Using GeoPy to Filter MTA Subway Stations

For our first project at Metis we analyzed MTA turnstile data. Our task was to recommend stations for a fictitious non-profit to place canvassers to collect email addresses for an upcoming event. My group decided to recommend stations that were among the busiest in the city and in proximity to the largest tech companies and a handful of upcoming events.

I have experience using Geographic Information Systems (GIS) software so my mind went immediately to coordinate pairs. I volunteered to filter the stations for proximity and map the results after my group members compiled the necessary data. To do this I decided to use the Google Maps API to collect coordinates for all of the stations, companies, and event venues of interest. I then used the geopy Python library to calculate the distance from each station to each company and event. As a group we decided a tenth of a mile was a reasonable distance to filter by, so the result was a list of all stations within a tenth of a mile of either a company or an event. I then wrote the coordinates of the stations, companies, and events to a csv file to import into QGIS, an open-source GIS software.

Once I registered for a Google Maps API key, I started to play around with the API using the Python requests library. The result was a function get_loc that constructed a URL request from strings and extracted the latitude and longitude coordinates from the API query.

import requests

def get_loc(places,req1,req2):

    locs = {}
    counter = 0
    
    for place in places:
        r = requests.get(req1 + place + req2)
        
        try:
            results = r.json()['results'][0][u'geometry'][u'location']
            lat = results[u'lat']
            lng = results[u'lng']
            locs[place] = [lat,lng]
            
        except:
            pass
            
    return locs

get_loc takes in a list of locations to search for with the API and a string for each half of the URL request. The locations should be strings formatted to work well with URLs. This required swapping white-space characters for +, for instance. The function constructs the full request by placing the location in between each half. Here’s an example of a full request that queries for the Fulton St station.

https://maps.googleapis.com/maps/api/geocode/json?address=FULTON+ST+mta+subway+NY+NY&key=API_KEY

The output of the Google Maps API is a .json file. A portion of the output is below. You can see the full output here.

{
   "results" : [
      {
         "formatted_address" : "Fulton St Subway Station, New York, NY 10038, USA",
         "geometry" : {
            "location" : {
               "lat" : 40.709373,
               "lng" : -74.00832579999999
         },
         "place_id" : "ChIJZRVRExhawokR_OhyrnV9EtM",
         "types" : [
            "establishment",
            "point_of_interest",
            "subway_station",
            "transit_station"
         ]
      }
   ],
}

The .json file confirms that we are indeed looking at the Fulton St subway station. get_loc digs through the .json structure to access the "lat" (latitude) and "lng" (longitude) entries. It ignores any requests that fail. The requests rarely failed so I manually entered the coordinates for any that did. After gathering all of the coordinates for the subway stations I used get_loc to do the same for the companies and events.

Once I had all of the necessary coordinate pairs I was ready to start measuring distances. I came across the geopy library, which makes calculating distances between coordinate pairs very easy. I knew I was going to need to calculate the distance between many coordinate pairs so I wrote a function calculate_distance to automate the calculation.

from geopy.distance import vincenty
from geopy.point import Point

def calculate_distance(a,b):

    pointA = Point(a)
    pointB = Point(b)
    
    return vincenty(pointA, pointB).miles

calculate_distance simply takes two coordinate pairs and calculates the distance between them using geopy’s vincenty function. calculate_distance first converts each coordinate pair into a geopy point object using the Point function. It then calls the vincenty function on the two points. The vincenty function uses the Vincenty method of calculating the distance between two points by approxmating the earth as a spheroid, which it is. This method is more accurate than modeling the earth as a simple sphere. The vincenty function allows you to request the unit of distance with a simple appendage to the function call, which is remarkably convenient.

The next step was to write a function, in_proximity, to use calculate_distance to calculate the distance between several coordinate pairs.

def in_proximity(location,coords,radius):
   
    proximates = {}
    
    for entry in coords.keys():
        coord = coords[entry]
        distance = calculate_distance(coord,location)
        
        if distance < float(radius):
            proximates[entry] = "%.3f" % distance
            
    return proximates

in_proximity takes a single coordinate pair, location, a list of coordinate pairs, coords, and a distance, radius, in miles. The function cycles through each set of coordinates in coords and calculates the distance between those coordinates and location. If the distance is less than radius, the function stores the coordinates. in_proximity returns the set of coordinates within coords that fall within the specified distance from location.

The last step was to scale in_proximity up to take a list of locations. Instead of altering in_proximity I wrote a new function collect_proximates that calls in_proximity on a list of locations.

def collect_proximates(locations,coordinates,distance=0.1):
    
    proximate_collection ={}
    
    for name, coord in locations.items():
        proximate_collection[name] = in_proximity(coord,coordinates,distance)
        
    return proximate_collection

collect_proximates returns a dictionary of the locations with each set of coordinates in coordinates that is within the specified distance of the location.

In the context of the project, I passed collect_projects a list of the companies and event locations and the coordinates of every subway station. The result was the set of subway stations within a tenth of a mile of either an event or a company of interest.

Here’s a sample of the output.

'Aol': {'8+ST+NYU': '0.060',
        'ASTOR+PL': '0.050'},
'Bloomberg': {'GRD+CNTRL+42+ST': '0.081'},
'Facebook': {'8+ST+NYU': '0.055'},
'Google': {'8+AV': '0.100'},
'Microsoft': {'42+ST+PORT+AUTH': '0.069'},
'Oscar': {"B'WAY+LAFAYETTE": '0.053'}

From the output of collect_projects I was able to extract the list of stations that were within reasonable proximity to either an event venue or company of interest. I cross-referenced this list with the list of the busiest stations to find the intersecting set of stations. Our recommendation consisted of the list of these stations.

The final step was to map our results. I imported the coordinates of the events, companies, and recommended stations into QGIS using csv files. I was then able to plot the coordinates as points on a map of NYC.

MTA Map

The results are not suprising if you are at all familiar with the NYC subway. It’s always good to check if your results are reasonable, though.

Written on April 11, 2017