· clickhouse til

ClickHouse: Float equality

I’ve been playing around with NumPy data in ClickHouse this week and wanted to share what I learnt when checking for equality of float values. Let’s get going!

Creating arrays

We’re going to use Python’s NumPy library to create 5 arrays containing 10 values each:

import numpy as np
rng = np.random.default_rng(seed=42)
rng.random(size=(5, 5))
Output
array([[0.28138389, 0.29359376, 0.66191651, 0.55703215, 0.78389821],
       [0.66431354, 0.40638686, 0.81402038, 0.16697292, 0.02271207],
       [0.09004786, 0.72235935, 0.46187723, 0.16127178, 0.50104478],
       [0.1523121 , 0.69632038, 0.44615628, 0.38102123, 0.30151209],
       [0.63028259, 0.36181261, 0.08764992, 0.1180059 , 0.96189766]])

By default, the type of the values is float64:

rng.random(size=(5, 10)).dtype
Output
dtype('float64')

But we’re going to cast that to float32 since that’s the format of the data in the file that I’ve been working with. We’ll then save the arrays to a file:

values = rng.random(size=(5, 10))
np.save('arrays.npy', values.astype(np.float32))

Querying in ClickHouse

Next, let’s download and launch ClickHouse Local:

curl https://clickhouse.com/ | sh
./clickhouse local -m

We can then read the arrays.npy file:

FROM file('arrays.npy', Npy)
SELECT array, toTypeName(array);
Output
┌─array────────────────────────────────────────────────────┬─toTypeName(array)─┐
│ [0.2813839,0.29359376,0.6619165,0.55703217,0.78389823]   │ Array(Float32)    │
│ [0.66431355,0.40638685,0.8140204,0.16697292,0.022712072] │ Array(Float32)    │
│ [0.09004786,0.72235936,0.46187723,0.16127178,0.50104475] │ Array(Float32)    │
│ [0.1523121,0.69632035,0.44615626,0.38102123,0.3015121]   │ Array(Float32)    │
│ [0.6302826,0.36181262,0.08764992,0.1180059,0.9618977]    │ Array(Float32)    │
└──────────────────────────────────────────────────────────┴───────────────────┘

Next, let’s see if we can return only the rows where the first item of the array is 0.99237555. My first attempt looked like this:

FROM file('arrays.npy', Npy)
SELECT array, toTypeName(array)
WHERE array[1] = 0.66431355
FORMAT Vertical;
Output
0 rows in set. Elapsed: 0.002 sec.

Computer says no! Let’s write a query that returns the type of our float literal:

SELECT 0.66431355 AS literal, toTypeName(literal);
Output
┌────literal─┬─toTypeName(0.66431355)─┐
│ 0.66431355 │ Float64                │
└────────────┴────────────────────────┘

Aha, it’s a Float64, which explains why it didn’t match our array of Float32 values. There are at least two ways that I know of to rectify this.

We can use the toFloat32 function to cast the float literal to the right type:

FROM file('arrays.npy', Npy)
SELECT array, toTypeName(array)
WHERE array[1] = toFloat32(0.66431355);

Or, we can cast the value to Float32

FROM file('arrays.npy', Npy)
SELECT array, toTypeName(array)
WHERE array[1] = CAST(0.66431355 AS Float32);

Either way, we get the following result.

Output
Row 1:
──────
┌─array────────────────────────────────────────────────────┬─toTypeName(array)─┐
│ [0.66431355,0.40638685,0.8140204,0.16697292,0.022712072] │ Array(Float32)    │
└──────────────────────────────────────────────────────────┴───────────────────┘

My takeaway when working with floats (or even decimals) is to make sure that you know the underlying data types that you’re working with.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket