How to treat NULL as a normal string with pandas?

Question:

I have a csv-file with a column with strings and I want to read it with pandas. In this file the string null occurs as an actual value and should not be regarded as a missing value.

Example:

import pandas as pd
from io import StringIO

data = u'strings,numbersnfoo,1nbar,2nnull,3'
print(pd.read_csv(StringIO(data)))

This gives the following output:

  strings  numbers
0     foo        1
1     bar        2
2     NaN        3

What can I do to get the value null as it is (and not as NaN) into the DataFrame? The file can be assumed to not contain any actually missing values.

Asked By: piripiri

||

Answers:

You can specify a converters argument for the string column.

pd.read_csv(StringIO(data), converters={'strings' : str})

  strings  numbers
0     foo        1
1     bar        2
2    null        3

This will by-pass pandas’ automatic parsing.


Another option is setting na_filter=False:

pd.read_csv(StringIO(data), na_filter=False)

  strings  numbers
0     foo        1
1     bar        2
2    null        3

This works for the entire DataFrame, so use with caution. I recommend first option if you want to surgically apply this to select columns instead.

Answered By: cs95

The reason this happens is that the string 'null' is treated as NaN on parsing, you can turn this off by passing keep_default_na=False in addition to @coldspeed’s answer:

In[49]:
data = u'strings,numbersnfoo,1nbar,2nnull,3'
df = pd.read_csv(io.StringIO(data), keep_default_na=False)
df

Out[49]: 
  strings  numbers
0     foo        1
1     bar        2
2    null        3

The full list is:

na_values : scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific
per-column NA values. By default the following values are interpreted
as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’,
‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’,
‘null’.

Answered By: EdChum

UPDATE: 2020-03-23 for Pandas 1+:

many thanks to @aiguofer for the adapted solution:

na_vals = pd.io.parsers.STR_NA_VALUES.difference({'NULL','null'})
df = pd.read_csv(io.StringIO(data), na_values=na_vals, keep_default_na=False)

Old answer:

we can dynamically exclude 'NULL' and 'null' from the set of default _NA_VALUES:

In [4]: na_vals = pd.io.common._NA_VALUES.difference({'NULL','null'})

In [5]: na_vals
Out[5]:
{'',
 '#N/A',
 '#N/A N/A',
 '#NA',
 '-1.#IND',
 '-1.#QNAN',
 '-NaN',
 '-nan',
 '1.#IND',
 '1.#QNAN',
 'N/A',
 'NA',
 'NaN',
 'n/a',
 'nan'}

and use it in read_csv():

df = pd.read_csv(io.StringIO(data), na_values=na_vals)

Other answers are better for reading in a csv without “null” being interpreted as Nan, but if you have a dataframe that you want “fixed”, this code will do so: df=df.fillna('null')

Answered By: Acccumulation
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.