pyarrow: pyarrow.lib.ArrowNotImplementedError: Filter argument must be boolean type
I wanted to filter a table in pyarrow table recently and ran into troubles when trying to use the filter syntax that I’m used to from DuckDB. In this blog post I’ll explain my mistake and how to fix it.
First, let’s install
pip install pyarrow
And now we’re going to create a table that has a few countries and their corresponding continents:
import pyarrow as pa countries = pa.Table.from_arrays( [ pa.array(['India', 'Pakistan', 'Belgium', 'Finland'], pa.string()), pa.array(['Asia', 'Asia', 'Europe', 'Europe'], pa.string()) ], names=['Country', 'Continent'] )
Let’s say we want to find just the rows where the continent is Europe. I initially tried to do that using the following syntax:
countries.filter("Continent = 'Europe'")
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/table.pxi", line 3154, in pyarrow.lib.Table.filter File "/Users/markhneedham/Library/Caches/pypoetry/virtualenvs/ch07-YVG4Qrie-py3.11/lib/python3.11/site-packages/pyarrow/compute.py", line 259, in wrapper return func.call(args, options, memory_pool) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_compute.pyx", line 367, in pyarrow._compute.Function.call File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Filter argument must be boolean type
Hmmm, that didn’t work so well.
Instead we need to construct a filter predicate using some functions from the
pyarrow.compute module, so let’s import that:
import pyarrow.compute as pc
And now we have (at least) two ways to write the filtering statement.
We could use
pc.equal like this:
pc.field like this:
countries.filter(pc.field("Continent") == "Europe")
Either way we get the same result:
pyarrow.Table Country: string Continent: string ---- Country: [["Belgium","Finland"]] Continent: [["Europe","Europe"]]
About the author
I'm currently working on real-time user-facing analytics with Apache Pinot at StarTree. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.