ClickHouse: A hacky way to default parameters in a view
ClickHouse recently added support for runtime provided parameters in views, so I wanted to try it when querying the MidJourney messages dataset. It worked pretty well, but I ran into problems when trying to define default parameters, which is what we’re going to explore in this blog post.
Let’s launch ClickHouse Local:
clickhouse -m --max_http_get_redirects=10 --output_format_pretty_row_numbers=0
We need to set max_http_get_redirects
so that it can handle redirects in the Hugging Face URL, and output_format_pretty_row_numbers
is so that it won’t put numbers in front of each result row.
I initially wrote the following view to query the data:
CREATE OR REPLACE VIEW midJourneyRemote AS
SELECT *, _file, _path
FROM url(
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' || {file:String},
Parquet
);
If we pass in a file name or wild card pattern, it works:
SELECT count()
FROM midJourneyRemote(file = '00000{0..1}.parquet');
┌─count()─┐
│ 2000000 │ -- 2.00 million
└─────────┘
1 row in set. Elapsed: 1.109 sec. Processed 1.00 million rows, 159.31 MB (901.97 thousand rows/s., 143.69 MB/s.)
Peak memory usage: 9.94 MiB.
But it doesn’t work if I don’t want to provide a file
parameter:
SELECT count()
FROM midJourneyRemote();
Received exception:
Code: 456. DB::Exception: Substitution `file` is not set. (UNKNOWN_QUERY_PARAMETER)
Or if I set file
to null
:
SELECT count()
FROM midJourneyRemote(file=null);
Received exception:
Code: 86. DB::HTTPException: Received error from remote server https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data. HTTP status code: 404 'Not Found', body length: 18446744073709551615 bytes: The table structure cannot be extracted from a Parquet format file. You can specify the structure manually. (RECEIVED_ERROR_FROM_REMOTE_IO_SERVER)
But it occurred to me that if I made the parameter Nullable
, I could then use the coalesce
function to set a default value.
This resulted in the following function definition:
CREATE OR REPLACE VIEW midJourneyRemote AS
SELECT *, _file, _path
FROM url(
'https://huggingface.co/datasets/vivym/midjourney-messages/resolve/main/data/' || coalesce({file:Nullable(String)}, '0000{00..55}.parquet'),
Parquet
);
Now, if we call the view with file
set to null
, it counts the rows in all the Parquet files:
SELECT count()
FROM midJourneyRemote(file=null);
┌──count()─┐
│ 55082563 │ -- 55.08 million
└──────────┘
1 row in set. Elapsed: 7.352 sec. Processed 54.08 million rows, 8.48 GB (7.36 million rows/s., 1.15 GB/s.)
Peak memory usage: 25.82 MiB.
And if we want to only query a subset of the Parquet files, we can still do that:
SELECT _file, count()
FROM midJourneyRemote(file='00000{0..9}.parquet')
GROUP BY ALL
ORDER BY _file;
┌─_file──────────┬─count()─┐
│ 000000.parquet │ 1000000 │
│ 000001.parquet │ 1000000 │
│ 000002.parquet │ 1000000 │
│ 000003.parquet │ 1000000 │
│ 000004.parquet │ 1000000 │
│ 000005.parquet │ 1000000 │
│ 000006.parquet │ 1000000 │
│ 000007.parquet │ 1000000 │
│ 000008.parquet │ 1000000 │
│ 000009.parquet │ 1000000 │
└────────────────┴─────────┘
10 rows in set. Elapsed: 0.388 sec. Processed 9.00 million rows, 0.00 B (23.21 million rows/s., 0.00 B/s.)
Peak memory usage: 1.30 MiB.
Not perfect, but job done!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.