· duckdb parquet

Diffing Apache Parquet schemas with DuckDB

I’ve been playing around with DuckDB, the new hotness in the analytics space, over the last month, and my friend Michael Hunger asked whether you could use it to compute a diff of Apache Parquet schemas.

Challenge accepted!

Challenge Accepted!
Note

I’ve created a video showing how to do this on my YouTube channel, Learn Data with Mark, so if you prefer to consume content through that medium, I’ve embedded it below:

Querying the Parquet Schema

Let’s say we have three Parquet files that represent ATP tennis players - players1.parquet, players2.parquet, and players3.parquet. We can get the list of column names along with their data type and file name by running the following query in the DuckDB CLI:

SELECT file_name, name, type
FROM parquet_schema('players*.parquet')
ORDER BY file_name;

If we run this query, we’ll see the following results:

Table 1. Results
file_name name type

players1.parquet

duckdb_schema

BOOLEAN

players1.parquet

player_id

INT32

players1.parquet

name_first

BYTE_ARRAY

players1.parquet

name_last

BYTE_ARRAY

players1.parquet

hand

BYTE_ARRAY

players1.parquet

dob

INT32

players1.parquet

ioc

BYTE_ARRAY

players1.parquet

height

INT32

players1.parquet

wikidata_id

BYTE_ARRAY

players2.parquet

duckdb_schema

BOOLEAN

players2.parquet

player_id

INT32

players2.parquet

first_name

BYTE_ARRAY

players2.parquet

last_name

BYTE_ARRAY

players2.parquet

hand

BYTE_ARRAY

players2.parquet

dob

INT32

players2.parquet

ioc

BYTE_ARRAY

players2.parquet

height

INT32

players2.parquet

wikidata_id

BYTE_ARRAY

players3.parquet

duckdb_schema

BOOLEAN

players3.parquet

player_id

BYTE_ARRAY

players3.parquet

first_name

BYTE_ARRAY

players3.parquet

last_name

BYTE_ARRAY

players3.parquet

hand

BYTE_ARRAY

players3.parquet

dob

BYTE_ARRAY

players3.parquet

ioc

BYTE_ARRAY

players3.parquet

height

BYTE_ARRAY

From scanning this list, we can see that there are some differences in the schemas, but it’s not entirely obvious what those differences are. Let’s see if we can write some queries to figure it out.

Temporary Tables

But first, we’re going to create a temporary table per schema file to simplify the queries that we have to write. Temporary tables are stored in memory for the current session and will be gone should we exit the DuckDB CLI.

The following code creates the temporary tables:

CREATE OR REPLACE TEMP TABLE p1Schema AS
SELECT * FROM parquet_schema('players1.parquet');

CREATE OR REPLACE TEMP TABLE p2Schema AS
SELECT * FROM parquet_schema('players2.parquet');

CREATE OR REPLACE TEMP TABLE p3Schema AS
SELECT * FROM parquet_schema('players3.parquet');

Missing fields

Now we’re going to write a query that finds the fields that are in p1Schema, but not in p2Schema. We’ll use SQL’s EXCEPT clause to do this:

SELECT name, type FROM p1Schema
EXCEPT
SELECT name, type FROM p2Schema;

If we run this query, we’ll see the following results:

Table 2. Results
name type

name_first

BYTE_ARRAY

name_last

BYTE_ARRAY

There are a couple of fields missing in p2Schema that are in p1Schema. Let’s now run the inverse query to discover fields that are in p2Schema, but not in p1Schema:

SELECT name, type FROM p2Schema
EXCEPT
SELECT name, type FROM p1Schema;

The results are as follows:

Table 3. Results
name type

first_name

BYTE_ARRAY

last_name

BYTE_ARRAY

It looks like we did some basic field renaming between these two schemas! Let’s put both those queries together, along with a description that explains what’s going on for missing column:

WITH schema1 AS (
    SELECT name, type FROM p1Schema
), schema2 AS (
    SELECT name, type FROM p2Schema
), notInSchema1 AS (
    SELECT * FROM schema1 EXCEPT SELECT * FROM schema2
), notInSchema2 AS (
    SELECT * FROM  schema2 EXCEPT SELECT * FROM schema1
)

SELECT *, 'Missing in players2.parquet' AS description
FROM notInSchema1
UNION ALL
SELECT *, 'Missing in players1.parquet' AS description
FROM notInSchema2;

If we run this query, we’ll see this outpu

Table 4. Results
name type description

name_first

BYTE_ARRAY

Missing in players2.parquet

name_last

BYTE_ARRAY

Missing in players2.parquet

first_name

BYTE_ARRAY

Missing in players1.parquet

last_name

BYTE_ARRAY

Missing in players1.parquet

Now we can clearly see that the first two fields are missing in players2.parquet and the latter ones are missing in players1.parquet.

How about if we do the same thing for p2Schema and p3Schema?

WITH schema1 AS (
    SELECT name, type FROM p2Schema
), schema2 AS (
    SELECT name, type FROM p3Schema
), notInSchema1 AS (
    SELECT * FROM schema1 EXCEPT SELECT * FROM schema2
), notInSchema2 AS (
    SELECT * FROM  schema2 EXCEPT SELECT * FROM schema1
)

SELECT *, 'Missing in players3.parquet' AS description
FROM notInSchema1
UNION ALL
SELECT *, 'Missing in players2.parquet' AS description
FROM notInSchema2;

And let’s run that:

Table 5. Results
name type description

player_id

INT32

Missing in players3.parquet

dob

INT32

Missing in players3.parquet

height

INT32

Missing in players3.parquet

wikidata_id

BYTE_ARRAY

Missing in players3.parquet

player_id

BYTE_ARRAY

Missing in players2.parquet

dob

BYTE_ARRAY

Missing in players2.parquet

height

BYTE_ARRAY

Missing in players2.parquet

wikidata_id is missing in players3.parquet, but the other 3 fields are clearly in both schemas, so why are they showing up?! It turns out they have a different data type in each schema.

Same columns, different type

To return an easier to read representation of schemas that have the same fields/columns, with different types, we can write the following query:

SELECT schema1.name AS s1Name, schema1.type AS s1Type, schema2.type AS s2Type,
    schema1.file_name AS s1File, schema2.file_name AS s2File
FROM p2Schema AS schema1
INNER JOIN p3Schema AS schema2 ON schema2.name = schema1.name
WHERE s1Type <> s2Type;

And let’s see those results…​

Table 6. Results
s1Name s1Type s2Type s1File s2File

player_id

INT32

BYTE_ARRAY

players2.parquet

players3.parquet

dob

INT32

BYTE_ARRAY

players2.parquet

players3.parquet

height

INT32

BYTE_ARRAY

players2.parquet

players3.parquet

Conclusion

It turns out that computing Parquet schema diffs is another thing that DuckDB is really good at! So give it a try the next time you have some mismatched schemas, but you don’t know exactly what’s not matching!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket