· duckdb til

DuckDB: Ingest a bunch of CSV files from GitHub

Jeff Sackmann’s tennis_atp repository is one of the best collections of tennis data and I wanted to ingest the ATP Tour singles matches using the DuckDB CLI. In this blog post we’ll learn how to do that.

Usually when I’m ingesting data into DuckDB I’ll specify the files that I want to ingest using the wildcard syntax. In this case that would mean running a query like this:

CREATE OR REPLACE TABLE matches AS
SELECT *
FROM "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_*.csv"

But we can’t use that technique for files on GitHub because it’s not a file system. If we run the query above, we’ll get the following error message:

Error: Invalid Error: HTTP Error: Unable to connect to URL "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_*.csv": 404 (Not Found)

So, instead, we need to construct a list of URLs, which we can do using a combination of the generate_series function that I covered in my last blog post and the list_transform function.

We’ll use generate_series to create a list of the years, like this:

SELECT range(1968, 1970)
Table 1. Output
range(1968, 1970)

[1968, 1969]

And then we can use list_transform to map or project over each value to construct a URL:

SELECT list_transform(
  range(1968, 1970),
  y ->  'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_' || y || '.csv'
) AS files;
Table 2. Output
files

[https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1968.csv, https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_1969.csv]

And then finally we can bring it all together to ingest the files from 1968 until 2023:

CREATE OR REPLACE TABLE matches AS
SELECT * FROM read_csv_auto(
    list_transform(
      range(1968, 2023),
      y ->  'https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_' || y || '.csv'
    ),
    types={'winner_seed': 'VARCHAR', 'loser_seed': 'VARCHAR'}
);

It’ll take a few seconds, but once it’s done we can do an exploratory query to make sure everything was ingested:

SELECT count(*)
FROM matches;
Table 3. Output
count(*)

188191

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket