Python Dask – how to get row content on string match

Question:

I have a very large dataset (>1m entries), then I have a list of postcodes and I want to loop through the postcodes and create a list of matching output areas code from the dataset.

The dataset source: https://geoportal.statistics.gov.uk/datasets/06938ffe68de49de98709b0c2ea7c21a/about

The code:

import dask.dataframe as dd
df= dd.read_csv("PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv")
zipcodes = ["AB1 5YP","AB1 7FH"]

oa11cd_output = []
for zipcode in zipcodes:
    entry = df[df['pcds'] == postcode]
    oa11cd_output.append(entry['oa11cd'])

however when I try to even print the entry, I do not get the actual row content but something that looks like this:

Name: oa11cd, dtype: object
Dask Name: getitem, 5 graph layers
dd.Scalar<size-ag..., dtype=int32>
Dask Series Structure:
npartitions=6
    object
       ...
     ...  
       ...
       ...

Any idea how to get the actual content? Thank you

Asked By: Sam333

||

Answers:

The encoding seems to be "iso-8859-1". On top of that the type inference does not work for two (of the) columns (in this particular file) so you have to force it. See code below:

import dask.dataframe as dd
df= dd.read_csv("PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv", 
                dtype={'doterm': 'float64', 'ladnmw': 'object'}, 
                encoding="iso-8859-1")
zipcodes = ["AB1 5YP","AB1 7FH"]

oa11cd_output = []
for zipcode in zipcodes:
    entry = df[df['pcds'] == zipcode].compute().to_dict()
    oa11cd_output.append(entry['oa11cd'])

Some explanation on how I determined the encoding. First try was to do:

> file -i PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv
PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv: text/csv; charset=us-ascii

Ok, so file says it is us-ascii. But file does not read the whole file to make that evaluation (check -P parameter). I tried to increase how much file reads but process went out of memory. Let’s try to convert it:

> iconv -f US-ASCII -t UTF-8 PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv > converted.csv
iconv: illegal input sequence at position 198311200

Ok, so clearly not us-ascii. Let’s output a bit around that position:

> tail -c +198311200 PCD_OA_LSOA_MSOA_LAD_AUG19_UK_LU.csv| head -n 10 > problem_lines.txt
> file -i problem_lines.txt
problem_lines.txt: text/plain; charset=iso-8859-1

Problem solved!

Answered By: vladmihaisima
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.