· duckdb til

DuckDB: Generate dummy data with user defined functions (UDFs)

In the 0.8 release of DuckDB, they added functionality that lets you add your own functions when using the Python package I wanted to see if I could use it to generate dummy data so that’s what we’re going to do in this blog post.

Note

I’ve created a video showing how to do this on my YouTube channel, Learn Data with Mark, so if you prefer to consume content through that medium, I’ve embedded it below:

We’re going to do this using DuckDB’s Python package. We’ll install that, along with the Faker library, by running the following:

pip install duckdb faker

I then created a function to generate a fake person:

import faker
fake = faker.Faker()

def generate_person():
    person = {
        'name': fake.name(),
        'city': fake.city(),
        'state': fake.state(),
        'zip_code': fake.zipcode(),
        'country': fake.country(),
        'email': fake.email(),
        'job': fake.job(),
        'company': fake.company(),
        'ssn': fake.ssn(),
        'birthdate': fake.date_of_birth(),
        'phone_number': fake.phone_number()
    }
    return person

Now we need to create a DuckDB database and register the function, which we’ll do with the following code:

import duckdb
from duckdb.typing import *

con.create_function(
    'generate_person',
    generate_person,
    [],
    duckdb.struct_type({
        'name': 'VARCHAR',
        'city': 'VARCHAR',
        'state': 'VARCHAR',
        'zip_code': 'VARCHAR',
        'country': 'VARCHAR',
        'email': 'VARCHAR',
        'job': 'VARCHAR',
        'company': 'VARCHAR',
        'ssn': 'VARCHAR',
        'birthdate': 'DATE',
        'phone_number': 'VARCHAR'
    })
)

A dictionary in Python maps to the duckdb.struct_type type in DuckDB. We can then pass in a map of the fields and their data types.

We can then use that function in a query like this:

con.sql("""
CREATE OR REPLACE TABLE people AS
SELECT person.* FROM (
    SELECT generate_person(random()) AS person
    FROM generate_series(1,10000)
)
""")

When I ran this script it was returning the same person repeatedly, which makes me think the function was being cached. To work around that, I add a seed parameter to the function, resulting in the following code:

def generate_person(seed):
    person = {
        'name': fake.name(),
        'city': fake.city(),
        'state': fake.state(),
        'zip_code': fake.zipcode(),
        'country': fake.country(),
        'email': fake.email(),
        'job': fake.job(),
        'company': fake.company(),
        'ssn': fake.ssn(),
        'birthdate': fake.date_of_birth(),
        'phone_number': fake.phone_number()
    }
    return person

con.create_function(
    'generate_person',
    generate_person,
    [DOUBLE],
    duckdb.struct_type({
        'name': 'VARCHAR',
        'city': 'VARCHAR',
        'state': 'VARCHAR',
        'zip_code': 'VARCHAR',
        'country': 'VARCHAR',
        'email': 'VARCHAR',
        'job': 'VARCHAR',
        'company': 'VARCHAR',
        'ssn': 'VARCHAR',
        'birthdate': 'DATE',
        'phone_number': 'VARCHAR'
    })
)

con.sql("""
CREATE OR REPLACE TABLE people AS
SELECT person.* FROM (
    SELECT generate_person(random()) AS person
    FROM generate_series(1,10000)
)
""")

And now it works!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket