Trying to filter in dask.read_parquet tries to compare NoneType and str

Question:

I have a project where I pass the following load_args to read_parquet:

filters = {'filters': [('itemId', '=', '9403cfde-7fe5-4c9c-916c-41ff0b595c5c')]}

According to the documentation, a List[Tuple] like this should be accepted and I should get all partitions which match the predicate (or equivalently, filter out those that do not).

However, it gives me the following error:

│                                                                                  │
│ /home/user/project/venv/lib/python3.10/site-packages/dask/dataframe/io/parquet/  |
| core.py:1275 in apply_conjunction                                                │
|                                                                                  |
|   1264 |   for part, stats in zip(parts, statistics):                            |
|   1265 |   |   |   |   if "filter" in stats and stats["filter"]:                 |
|   1266 |   |   |   |   |  continue  # Filtered by engine                         |
|   1267 |   |   |   |   try:                                                      |
|   1268 |   |   |   |   |  c = toolz.groupby("name", stats["columns"])[column][0] |
|   1269 |   |   |   |   |  min = c["min"]                                         |
|   1270 |   |   |   |   |  max = c["max"]                                         |
|   1271 |   |   |   |   except KeyError:                                          |
│   1272 │   │   │   │   │   out_parts.append(part)                                │
│   1273 │   │   │   │   │   out_statistics.append(stats)                          │
│   1274 │   │   │   │   else:                                                     │
│ ❱ 1275 │   │   │   │   │   if (                                                  │
│   1276 │   │   │   │   │   │   operator in ("==", "=")                           │
│   1277 │   │   │   │   │   │   and min <= value <= max                           │
│   1278 │   │   │   │   │   │   or operator == "!="                               │
╰──────────────────────────────────────────────────────────────────────────────────╯
TypeError: '<=' not supported between instances of 'NoneType' and 'str'

It seems that read_parquet tries to compute min and max values for my str value that I wish to filter on, but I’m not sure that makes sense in this case. Even so, str values should be comparable (though it might not make a huge amount of sense in this case, seeing how the itemId is a random UUID).

Still, I expected this to work. What am I doing wrong?

Asked By: filpa

||

Answers:

The problem probably arises when min and max haven’t been redefined before, so they still refer to the built-in functions that compute the minimum and maximum of two numbers, which obviously can’t be compared with a string. Try using different name for these variables (as a rule of thumb, avoid too generic variable names which could be already defined in the standard library).

Answered By: pasthec

As discovered by aywandji in the aformentioned github issue, the problem comes from the way dask access the min/max metatada.

It is accessed with an integer (the ith column) BUT this index of a column’s name can change from one file to another in the same directory. (i.e. the filtered column is not at the same position in every file).

It is currently being patched and we hope it will get inserted in the next dask release!

From @filpa

It is fixed starting with the dask=2023.1.1 release, which was released on 2023-01-28.

Answered By: guillaume latour
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.