25 May 2023

DuckDB: Ingest a bunch of CSV files from GitHub

Jeff Sackmann’s tennis_atp repository is one of the best collections of tennis data and I wanted to ingest the ATP Tour singles matches using the DuckDB CLI. In this blog post we’ll learn how to do that.

Usually when I’m ingesting data into DuckDB I’ll specify the files that I want to ingest using the wildcard syntax. In this case that would mean running a query like this:

SQL CREATE OR REPLACE TABLE matches AS
SELECT *
FROM "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_*.csv"

But we can’t use that technique for files on GitHub because it’s not a file system. If we run the query above, we’ll get the following error message:

SQL Error: Invalid Error: HTTP Error: Unable to connect to URL "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_*.csv": 404 (Not Found)

So, instead, we need to construct a list of URLs, which we can do using a combination of the generate_series function that I covered in my last blog post and the list_transform function.

We’ll use generate_series to create a list of the years, like this:

SQL SELECT range(1968, 1970)

Table 1. Output
range(1968, 1970)
[1968, 1969]

And then we can use list_transform to map or project over each value to construct a URL:

SQL SELECT list_transform(
  range(1968, 1970),
  y ->  'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_' || y || '.csv'
) AS files;

Table 2. Output
files
[https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1968.csv, https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1969.csv]

And then finally we can bring it all together to ingest the files from 1968 until 2023:

SQL CREATE OR REPLACE TABLE matches AS
SELECT * FROM read_csv_auto(
    list_transform(
      range(1968, 2023),
      y ->  'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_' || y || '.csv'
    ),
    types={'winner_seed': 'VARCHAR', 'loser_seed': 'VARCHAR'}
);

It’ll take a few seconds, but once it’s done we can do an exploratory query to make sure everything was ingested:

SQL SELECT count(*)
FROM matches;

Table 3. Output
count(*)
188191

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.