15 Jun 2018

Neo4j: Querying the Strava Graph using Py2neo

Last week Nigel released v4 of Py2neo and given I was just getting ready to write some queries against my Strava activity graph I thought I’d give it a try.

If you want to learn how to create your own Strava graph you should read my previous post, but just to recap, this is the graph model that we created:

Let’s get to it!

tl;dr

the code in this post is available as a Jupyter notebook so if you want the code and nothing but the code head over there!

Importing libraries

Importing and creating a Graph object in Py2neo is as simple as executing the following lines of code:

from py2neo import Graph
graph = Graph("bolt://localhost:7687", auth=("neo4j", "neo"))

Find the most recent activities

We’ll start with a simple query where we find the most recent activities that I’ve done.

recent_runs_df = graph.run("""\
MATCH (r:Run)
WITH r { .id, .startDate, .name, .movingTime, .distance,
         pace: duration({seconds: r.movingTime.seconds / r.distance * 1000}) }
RETURN r.name,
       apoc.date.format(r.startDate.epochSeconds, 's', 'MMM d yyyy') AS dateOfRun,
       r.distance,
       apoc.date.format(r.movingTime.milliseconds, 'ms', 'HH:mm:ss') AS time,
       apoc.date.format(r.pace.milliseconds, "ms", "mm:ss") AS pacePerKm,
       r.startDate AS startDate
""").to_data_frame()

recent_runs_df \
  .sort_values(by = ["startDate"], ascending=False) \
  .drop(["startDate"], axis=1) \
  .head(10)

This query converts the Cypher output into a DataFrame with the to_data_frame function. There are other formats that we can use instead but I want to post process the results using Pandas so this format works best for me.

If we run that query we’ll get the following output:

	dateOfRun	pacePerKm	r.distance	r.name	time
605	Jun 13 2018	04:33	10780.2	Morning Run	00:49:11
604	Jun 11 2018	04:34	10035.8	Morning Run	00:45:53
603	Jun 9 2018	04:23	11043.2	Morning Run	00:48:29
602	Jun 8 2018	04:36	11281.2	Morning Run	00:51:57
586	Jun 6 2018	04:28	10884.9	Morning Run	00:48:38
585	Jun 4 2018	04:24	10030.5	Morning Run	00:44:14
584	Jun 2 2018	04:25	13039.8	Morning Run	00:57:41
601	Jun 1 2018	04:29	10701.7	Morning Run	00:48:02
600	May 30 2018	04:41	9828.4	Morning Run	00:46:06
599	May 28 2018	04:34	10193.0	Morning Run	00:46:42

We could also do the sorting of the data directly in Cypher, but I wanted to have all the activities available in the DataFrame so that we could play around with it directly in Pandas without having to re-run the query each time.

Find the longest activities

For example, we could find the activities ordered by distance:

recent_runs_df \
  .sort_values(by = ["r.distance"], ascending=False) \
  .drop(["startDate"], axis=1) \
  .head(10)

If we run that query we’ll get the following output:

	dateOfRun	pacePerKm	r.distance	r.name	time
594	May 19 2018	04:30	13635.2	Morning Run	01:01:27
584	Jun 2 2018	04:25	13039.8	Morning Run	00:57:41
590	May 12 2018	04:40	12963.9	Morning Run	01:00:33
597	May 25 2018	04:37	11690.0	Morning Run	00:54:02
572	May 5 2018	04:31	11674.4	Morning Run	00:52:48
531	Mar 17 2018	04:37	11299.0	Afternoon Run	00:52:19
602	Jun 8 2018	04:36	11281.2	Morning Run	00:51:57
548	Jan 27 2018	04:37	11064.7	Afternoon Run	00:51:05
603	Jun 9 2018	04:23	11043.2	Morning Run	00:48:29
586	Jun 6 2018	04:28	10884.9	Morning Run	00:48:38

We can also visualise the distances over time using matplotlib:

%matplotlib inline
recent_runs_df.plot(x="startDate", y="r.distance")

Estimated runs

In our graph we also have estimated attempts at different distances.

When you submit an activity Strava generates predicted best times for a range of distances less than the total distance for that activity. For example if you submit an activity of 6,000 metres, Strava will generated estimated times for 5k, 2 miles, 1 mile, 1k, and 400m.

We can write the following query to find my best 10k attempts:

estimated_effort_query = """\
MATCH (distance:Distance {name: {distance}})<-[:DISTANCE]-(effort),
      (effort)<-[:DISTANCE_EFFORT]-(run)

WITH run { .id, .startDate, .distance, pace: duration({
           seconds: run.elapsedTime.seconds / run.distance * 1000})},
     effort { .elapsedTime, pace: duration({
              seconds: effort.elapsedTime.seconds / distance.distance * 1000 })}

RETURN apoc.date.format(run.startDate.epochSeconds, 's', 'MMM d yyyy') AS dateOfRun,
       apoc.date.format(effort.elapsedTime.milliseconds, 'ms', 'mm:ss') AS time,
       apoc.date.format(effort.pace.milliseconds, "ms", "mm:ss") AS pacePerKm,
       apoc.math.round(toFloat(effort.pace.seconds) / 60, 2) AS pacePerKmFloat,
       apoc.date.format(run.pace.milliseconds, "ms", "mm:ss") AS overallPacePerKm,
       run.distance AS totalDistance,
       run.startDate AS startDate
ORDER BY effort.elapsedTime
LIMIT {limit}
"""

df_10k = graph \
  .run(estimated_effort_query, {"distance": "10k", "limit": 1000}) \
  .to_data_frame()

df_10k \
  .drop(["startDate", "pacePerKmFloat"], axis=1) \
  .head(10)

	dateOfRun	overallPacePerKm	pacePerKm	time	totalDistance
0	Jun 9 2018	04:24	04:24	44:05	11043.2
1	Jun 2 2018	04:25	04:25	44:12	13039.8
2	Jun 4 2018	04:27	04:27	44:35	10030.5
3	Jun 6 2018	04:28	04:28	44:49	10884.9
4	Jun 1 2018	04:29	04:29	44:59	10701.7
5	May 19 2018	04:31	04:30	45:02	13635.2
6	May 5 2018	04:33	04:32	45:26	11674.4
7	May 21 2018	04:33	04:34	45:41	10147.0
8	Jun 13 2018	04:34	04:34	45:48	10780.2
9	May 16 2018	04:34	04:34	45:49	10148.8

We can also create a matplotlib plot to see my 10k runs over time:

%matplotlib inline
df_10k.sort_values(by=["startDate"]).plot(x="startDate", y="pacePerKmFloat")

Combining estimated runs

What if we want to combine my efforts for multiple distances in a single row? The pandas merge function comes in handy here.

For example we could write the following code to create a DataFrame that combines my 10k and 5k runs:

import pandas as pd
cols = ["run.id", "dateOfRun_x", "time_x", "pacePerKm_x", "time_y", "pacePerKm_y"]

df_5k = graph \
  .run(estimated_effort_query, {"distance": "5k", "limit": 1000}) \
  .to_data_frame()

df_10k = graph \
  .run(estimated_effort_query, {"distance": "10k", "limit": 1000}) \
  .to_data_frame()

pd.merge(df_5k, df_10k, on="run.id")[cols].head(10)

	run.id	dateOfRun_x	time_x	pacePerKm_x	time_y	pacePerKm_y
0	1611168962	Jun 2 2018	21:24	04:16	44:12	04:25
1	1626636004	Jun 9 2018	21:34	04:18	44:05	04:24
2	1620188065	Jun 6 2018	22:00	04:24	44:49	04:28
3	1616050069	Jun 4 2018	22:01	04:24	44:35	04:27
4	1581401227	May 19 2018	22:15	04:27	45:02	04:30
5	1609355524	Jun 1 2018	22:16	04:27	44:59	04:29
6	1635264102	Jun 13 2018	22:20	04:28	45:48	04:34
7	1585759604	May 21 2018	22:27	04:29	45:41	04:34
8	1566256778	May 12 2018	22:30	04:30	46:11	04:37
9	1624450599	Jun 8 2018	22:30	04:30	46:11	04:37

That’s all for now. Hopefully that’s given you some ideas of how you can use the new version of Py2neo to query your graph data.

I only covered some examples in this blog post. If you want to see the full set you can look at the accompanying Jupyter notebook.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.