The Traveling Hacker


Chronicles of an international programmer

Monitoring your Uber prices

From a tiny ETL using Python and Uber's public API to data visualization using D3.js

Have you ever wondered how your Uber prices evolve throughout the day, week, month or even year?
You might have already noticed that prices can go from "Awesome, that's cheap, I love this app!" to "Oh hell no! I ain't paying for that sh*t!" in a matter of minutes? This is because of Uber's surge: fare rates automatically increase, when the taxi demand is higher than drivers around you.

The Uber prices are surging to ensure reliability and availability for those who agree to pay a bit more. When the Uber prices are surging, it also encourages more drivers to get back on the road and be able to earn more money. Usually, the Uber surging only lasts for a few minutes, depending on the demand and the amount of available drivers in your area. Before, Uber used to show surge pricing rates in their app. However, that has changed (at least in Paris) and nowadays you can feel the surge, but you cannot see it anymore, like the Force...

In this tutorial, I will show you how to use Python to extract data from Uber's public API, transform it, and load it into a csv file that will be read by D3.js for visualization.

Getting a Personal Access Token

First, you need to create an Uber app. Go to your developer dashboard and click on NEW APP. Pick the Rides API, give your app a name and a description, agree to the terms of use and click on CREATE.

Now that you have access to your app, you need to give it access to your profile: go to AUTHORIZATIONS and under REDIRECT_URL put http://localhost and under PRIVACY POLICY URL put any url to your app's privacy policy, something like this: https://gist.github.com/Perados/f8c231151c67e8a02b75b2a4b2967268, although any url would work. You will not use any of these but we still need to fill them. Finally, under GENERAL SCOPES pick only profile and click on SAVE.

The next thing you need to do is to get a Personal Access Token. Since you are going to access your own data, you do not need to configure anything else nor bother with Oauth 2. Under TEST WITH A PERSONAL ACCESS TOKEN, click on GENERATE A NEW ACCESS TOKEN. Keep this access token in a safe place and do not share it with anybody.

Now you are ready to make authenticated calls to Uber's public API. For the next part of this tutorial, you will need your CLIENT ID, CLIENT SECRET and the ACCESS TOKEN you just generated.

Extract, Transform and Load

You will need to install two dependencies: geopy to get your adresses' gps coordinates and uber_rides to make api calls easily:

$ pip install geopy uber_rides
Successfully installed geopy-1.11.0 pyyaml-3.12 requests-2.12.5 uber-rides-0.3.1

Now to the interesting part: the code.
The script will be composed of some dependencies imports, some global variables and 3 functions: the first one will instantiate an api client, the second one will get the product id you want to monitor, in my case, uberX (Yes, that is how it is called in Paris...), and the third one will ask Uber for the price of your ride (Extract), adapt the output (Transform) and write it into a csv file (Load).

Configuration

The configuration could go in a configuration file, a json or yaml for example, but in order to keep all the code in the same file, I decided to put the configuration at the top. Please, if other people is going to read your code, do not forget to use environment variables to hide your personal information.

#!/usr/bin/env python

import csv
import datetime
import os

from geopy.geocoders import GoogleV3
from uber_rides.session import Session, OAuth2Credential
from uber_rides.client import UberRidesClient

# fill the following global variables with your values
ACCESS_TOKEN = 'your_access_token'
CLIENT_ID = 'your_client_id'
CLIENT_SECRET = 'your_client_secret'
PRODUCT_NAME = 'uberX'  # change this according to your city

# define your addresses and give them a short name
START_PLACE = ('short_name_1', 'start_address')
END_PLACE = ('short_name_2', 'end_address')

GEOLOCATOR = GoogleV3(timeout=5)  # we will use Google maps API

# use the addresses to get the gps coordinates
START_GEOCODE = GEOLOCATOR.geocode(START_PLACE[1])
END_GEOCODE = GEOLOCATOR.geocode(END_PLACE[1])

# define the ouput file and its headers
OUTPUT_FILE_PATH = 'squirrel_monitoring.csv'
CSV_HEADERS = [
    'date',
    'start_place',
    'start_latitude',
    'end_longitude',
    'end_place',
    'end_latitude',
    'end_longitude',
    'distance_estimation',
    'duration_estimation',
    'price',
]

...

Authenticate

There is no magic here. You will just use the credentials you defined in the configuration part to instantiate an Uber API client.

def authenticate():
    credentials = {
        'access_token': ACCESS_TOKEN,
        'client_id': CLIENT_ID,
        'client_secret': CLIENT_SECRET,
        'expires_in_seconds': 2592000,  # Uber's default value
        'grant_type': 'authorization_code',
        'scopes': None
    }
    oauth2_credential = OAuth2Credential(**credentials)
    session = Session(oauth2credential=oauth2_credential)
    client = UberRidesClient(session)
    return client

...

Get the product id

You could use the client to do this by hand and then hardcode the product id in the configuration, but I thought it was less confusing this way.
Uber has different products by city. For the sake of simplicity, we will only monitor one product for one ride in this tutorial, but you could go crazy and monitor all the products for 100 rides if you want...

def get_product_id(client, product_name):
    response = client.get_products(
        START_GEOCODE.latitude,
        START_GEOCODE.longitude,
    )
    products = response.json.get('products')

    for product in products:
        if product['display_name'] == product_name:
            product_id = product['product_id']
            return product_id

...

Write to csv

This part performs the ETL. Again, there is no magic here, this function uses a client to estimate the price of a ride and write the formatted output into a csv file.

def write_to_csv(client, product_id):
    file_exists = os.path.isfile(OUTPUT_FILE_PATH)

    with open(OUTPUT_FILE_PATH, 'a') as f:
        writer = csv.writer(f)

        if not file_exists:  # this allows you to write the headers only once
            writer.writerow(CSV_HEADERS)

        now = datetime.datetime.now().strftime('%Y-%m-%d %H:%M')
        # extract
        estimated_ride = client.estimate_ride(
            product_id=product_id,
            start_latitude=START_GEOCODE.latitude,
            start_longitude=START_GEOCODE.longitude,
            end_latitude=END_GEOCODE.latitude,
            end_longitude=END_GEOCODE.longitude,
        ).json

        # transform
        row = [
            now,
            START_PLACE[0],
            START_GEOCODE.latitude,
            START_GEOCODE.longitude,
            END_PLACE[0],
            END_GEOCODE.latitude,
            END_GEOCODE.longitude,
            estimated_ride['trip']['distance_estimate'],
            estimated_ride['trip']['duration_estimate'],
            estimated_ride['fare']['value'],
        ]

        # load
        writer.writerow(row)

...

The whole script

I just added a main function here, which uses the functions you just defined in order to instantiate a client, get the product id, and write to the csv file.

#!/usr/bin/env python

import csv
import datetime
import os

from geopy.geocoders import GoogleV3
from uber_rides.session import Session, OAuth2Credential
from uber_rides.client import UberRidesClient

# fill the following global variables with your values
ACCESS_TOKEN = 'your_access_token'
CLIENT_ID = 'your_client_id'
CLIENT_SECRET = 'your_client_secret'
PRODUCT_NAME = 'uberX'  # change this according to your city

# define your addresses and give them a short name
START_PLACE = ('short_name_1', 'start_address')
END_PLACE = ('short_name_2', 'end_address')

GEOLOCATOR = GoogleV3(timeout=5)  # we will use Google maps API

# use the addresses to get the gps coordinates
START_GEOCODE = GEOLOCATOR.geocode(START_PLACE[1])
END_GEOCODE = GEOLOCATOR.geocode(END_PLACE[1])

# define the ouput file and its headers
OUTPUT_FILE_PATH = 'squirrel_monitoring.csv'
CSV_HEADERS = [
    'date',
    'start_place',
    'start_latitude',
    'end_longitude',
    'end_place',
    'end_latitude',
    'end_longitude',
    'distance_estimation',
    'duration_estimation',
    'price',
]


def authenticate():
    credentials = {
        'access_token': ACCESS_TOKEN,
        'client_id': CLIENT_ID,
        'client_secret': CLIENT_SECRET,
        'expires_in_seconds': 9999999999,
        'grant_type': 'authorization_code',
        'scopes': None
    }
    oauth2_credential = OAuth2Credential(**credentials)
    session = Session(oauth2credential=oauth2_credential)
    client = UberRidesClient(session)
    return client


def get_product_id(client, product_name):
    response = client.get_products(
        START_GEOCODE.latitude,
        START_GEOCODE.longitude,
    )
    products = response.json.get('products')

    for product in products:
        if product['display_name'] == product_name:
            product_id = product['product_id']
            return product_id


def write_to_csv(client, product_id):
    file_exists = os.path.isfile(OUTPUT_FILE_PATH)

    with open(OUTPUT_FILE_PATH, 'a') as f:
        writer = csv.writer(f)

        if not file_exists:  # this allows you to write the headers only once
            writer.writerow(CSV_HEADERS)

        now = datetime.datetime.now().strftime('%Y-%m-%d %H:%M')
        # extract
        estimated_ride = client.estimate_ride(
            product_id=product_id,
            start_latitude=START_GEOCODE.latitude,
            start_longitude=START_GEOCODE.longitude,
            end_latitude=END_GEOCODE.latitude,
            end_longitude=END_GEOCODE.longitude,
        ).json

        # transform
        row = [
            now,
            START_PLACE[0],
            START_GEOCODE.latitude,
            START_GEOCODE.longitude,
            END_PLACE[0],
            END_GEOCODE.latitude,
            END_GEOCODE.longitude,
            estimated_ride['trip']['distance_estimate'],
            estimated_ride['trip']['duration_estimate'],
            estimated_ride['fare']['value'],
        ]

        # load
        writer.writerow(row)


def main():
    print('Starting script...')
    client = authenticate()
    product_id = get_product_id(client, PRODUCT_NAME)
    write_to_csv(client, product_id)
    print('Successfully wrote line into csv file...')


if __name__ == '__main__':
    main()

Great! Now you can run the script and see what it does. I saved it under squirrel_script.py.

$ python squirrel_script.py
Starting script...
Successfully wrote line into csv file...

And here is what I got inside the csv file:

date,start_place,start_latitude,end_longitude,end_place,end_latitude,end_longitude,distance_estimation,duration_estimation,price
2017-01-13 23:55,home,48.84712099999999,2.3058490000000003,office,48.8340459,2.2648741,4.9567672,540,9.35

Alright, now you need to find a way to launch the script periodically, so that your csv file will have enough rows to perform data visualization on it. My preferred way to do this is to use crontab on *NIX systems, but you can also do this in python using while and sleep.

Automating the ETL

Using crontab on *NIX systems

Crontab allows you to run a list of commands on a regular schedule. By default, your crontab is empty:

$ crontab -l

To add a scheduled command to your crontab, run:

$ crontab -e

This opens an editor. Enter the following line:

*/5 * * * * python /absolute/path/to/your/script/squirrel_script.py

This basically tells your system to run the squirrel_script.py every 5 minutes. If you wanna learn more about crontab, RTFM.

Now, if you check your crontab again, your task should be there:

$ crontab -l
*/5 * * * * python /absolute/path/to/your/script/squirrel_script.py

Your script will run every five minutes. If you did everything right, a new row should appear in your csv file every five minutes:

date,start_place,start_latitude,end_longitude,end_place,end_latitude,end_longitude,distance_estimation,duration_estimation,price
2017-01-13 23:55,home,48.84712099999999,2.3058490000000003,office,48.8340459,2.2648741,4.9567672,540,9.35
2017-01-14 00:00,home,48.84712099999999,2.3058490000000003,office,48.8340459,2.2648741,4.9567672,540,9.44
2017-01-14 00:05,home,48.84712099999999,2.3058490000000003,office,48.8340459,2.2648741,4.9567672,540,9.52

Visualization using D3.js

Now that you are extracting data from Uber, you might want to check how that data looks like. There are many tools out there that allow you to visualize your data: Chart.js, Tableau, Bokeh, D3.js and many others. I never used D3.js before, that is why I decided to use it for this article, because I love to learn new things.

So what do you do when you have limited knowledge in JavaScript but you still want to use D3.js to make a real-time data visualization? You are right, you Google it! I found this example online and adapted it to our case. The following code is basically loading the csv file and reloading it every 5 minutes, which gives you a real-time data visualization of the file the python script is feeding every 5 minutes thanks to the cronjob.
You can name the following file index.html for example:

<!DOCTYPE html>
<meta charset="utf-8">
<style> /* set the CSS */

body { font: 12px Arial;}

path {
    stroke: steelblue;
    stroke-width: 2;
    fill: none;
}

.axis path,
.axis line {
    fill: none;
    stroke: grey;
    stroke-width: 1;
    shape-rendering: crispEdges;
}

</style>
<body>

<!-- load the d3.js library -->
<script src="https://d3js.org/d3.v3.min.js"></script>

<script>

// Set the dimensions of the canvas / graph
var margin = {top: 30, right: 20, bottom: 30, left: 50},
    width = 600 - margin.left - margin.right,
    height = 270 - margin.top - margin.bottom;

// Parse the date / time
var parseDate = d3.time.format("%Y-%m-%d %H:%M").parse;

// Set the ranges
var x = d3.time.scale().range([0, width]);
var y = d3.scale.linear().range([height, 0]);

// Define the axes
var xAxis = d3.svg.axis().scale(x)
    .orient("bottom").ticks(5);

var yAxis = d3.svg.axis().scale(y)
    .orient("left").ticks(5);

// Define the line
var valueline = d3.svg.line()
    .x(function(d) { return x(d.date); })
    .y(function(d) { return y(d.price); });

// Adds the svg canvas
var svg = d3.select("body")
    .append("svg")
        .attr("width", width + margin.left + margin.right)
        .attr("height", height + margin.top + margin.bottom)
    .append("g")
        .attr("transform",
              "translate(" + margin.left + "," + margin.top + ")");

// Get the data
d3.csv("/squirrel_monitoring.csv", function(error, data) {
    data.forEach(function(d) {
        d.date = parseDate(d.date);
        d.price = +d.price;
    });

    // Scale the range of the data
    x.domain(d3.extent(data, function(d) { return d.date; }));
    y.domain([0, d3.max(data, function(d) { return d.price; })]);

    // Add the valueline path.
    svg.append("path")
        .attr("class", "line")
        .attr("d", valueline(data));

    // Add the X Axis
    svg.append("g")
        .attr("class", "x axis")
        .attr("transform", "translate(0," + height + ")")
        .call(xAxis);

    // Add the Y Axis
    svg.append("g")
        .attr("class", "y axis")
        .call(yAxis);

});

// Update data section
function updateData() {
    // Get the data again
    d3.csv("/squirrel_monitoring.csv", function(error, data) {
        data.forEach(function(d) {
            d.date = parseDate(d.date);
            d.price = +d.price;
        });

        // Scale the range of the data again
        x.domain(d3.extent(data, function(d) { return d.date; }));
        y.domain([0, d3.max(data, function(d) { return d.price; })]);

    // Select the section we want to apply our changes to
    var svg = d3.select("body").transition();

    // Make the changes
        svg.select(".line")   // change the line
            .duration(750)
            .attr("d", valueline(data));
        svg.select(".x.axis") // change the x axis
            .duration(750)
            .call(xAxis);
        svg.select(".y.axis") // change the y axis
            .duration(750)
            .call(yAxis);

    });
}

// This updates the data every 5 minutes
var inter = setInterval(function() {
    updateData();
}, 1000*60*5);

</script>
</body>

You need to serve the file over http for the reloading to work. You can do this easily using python:

$ python -m http.server 8080
Serving HTTP on 0.0.0.0 port 8080 (http://0.0.0.0:8080/) ...

Now go to http://localhost:8080/index.html on any web browser and if you did everything well, you should see something like this:

Yes, the chart above is monitoring the Uber prices from my apartment to the office in Paris in real time.

You can perform very basic analysis on what you see. Obviously, you will see prices going very high in the mornings when people are going to work and in the evenings when people are going back home. Now that you know how to automate ETL processes on public APIs, the sky is the limit. You can even extract tons of data from Uber's public API and use advanced Machine Learning on it like I did here...

Subscribe!

So you like traveling? Hacking? Squirrels? Subscribe to my newsletter!



blog comments powered by Disqus