How to treat NULL as a normal string with pandas?
Question:
I have a csv-file with a column with strings and I want to read it with pandas. In this file the string null
occurs as an actual value and should not be regarded as a missing value.
Example:
import pandas as pd
from io import StringIO
data = u'strings,numbersnfoo,1nbar,2nnull,3'
print(pd.read_csv(StringIO(data)))
This gives the following output:
strings numbers
0 foo 1
1 bar 2
2 NaN 3
What can I do to get the value null
as it is (and not as NaN) into the DataFrame? The file can be assumed to not contain any actually missing values.
Answers:
You can specify a converters
argument for the string
column.
pd.read_csv(StringIO(data), converters={'strings' : str})
strings numbers
0 foo 1
1 bar 2
2 null 3
This will by-pass pandas’ automatic parsing.
Another option is setting na_filter=False
:
pd.read_csv(StringIO(data), na_filter=False)
strings numbers
0 foo 1
1 bar 2
2 null 3
This works for the entire DataFrame, so use with caution. I recommend first option if you want to surgically apply this to select columns instead.
The reason this happens is that the string 'null'
is treated as NaN
on parsing, you can turn this off by passing keep_default_na=False
in addition to @coldspeed’s answer:
In[49]:
data = u'strings,numbersnfoo,1nbar,2nnull,3'
df = pd.read_csv(io.StringIO(data), keep_default_na=False)
df
Out[49]:
strings numbers
0 foo 1
1 bar 2
2 null 3
The full list is:
na_values : scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific
per-column NA values. By default the following values are interpreted
as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’,
‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’,
‘null’.
UPDATE: 2020-03-23 for Pandas 1+:
many thanks to @aiguofer for the adapted solution:
na_vals = pd.io.parsers.STR_NA_VALUES.difference({'NULL','null'})
df = pd.read_csv(io.StringIO(data), na_values=na_vals, keep_default_na=False)
Old answer:
we can dynamically exclude 'NULL'
and 'null'
from the set of default _NA_VALUES
:
In [4]: na_vals = pd.io.common._NA_VALUES.difference({'NULL','null'})
In [5]: na_vals
Out[5]:
{'',
'#N/A',
'#N/A N/A',
'#NA',
'-1.#IND',
'-1.#QNAN',
'-NaN',
'-nan',
'1.#IND',
'1.#QNAN',
'N/A',
'NA',
'NaN',
'n/a',
'nan'}
and use it in read_csv()
:
df = pd.read_csv(io.StringIO(data), na_values=na_vals)
Other answers are better for reading in a csv without “null” being interpreted as Nan
, but if you have a dataframe that you want “fixed”, this code will do so: df=df.fillna('null')
I have a csv-file with a column with strings and I want to read it with pandas. In this file the string null
occurs as an actual value and should not be regarded as a missing value.
Example:
import pandas as pd
from io import StringIO
data = u'strings,numbersnfoo,1nbar,2nnull,3'
print(pd.read_csv(StringIO(data)))
This gives the following output:
strings numbers
0 foo 1
1 bar 2
2 NaN 3
What can I do to get the value null
as it is (and not as NaN) into the DataFrame? The file can be assumed to not contain any actually missing values.
You can specify a converters
argument for the string
column.
pd.read_csv(StringIO(data), converters={'strings' : str})
strings numbers
0 foo 1
1 bar 2
2 null 3
This will by-pass pandas’ automatic parsing.
Another option is setting na_filter=False
:
pd.read_csv(StringIO(data), na_filter=False)
strings numbers
0 foo 1
1 bar 2
2 null 3
This works for the entire DataFrame, so use with caution. I recommend first option if you want to surgically apply this to select columns instead.
The reason this happens is that the string 'null'
is treated as NaN
on parsing, you can turn this off by passing keep_default_na=False
in addition to @coldspeed’s answer:
In[49]:
data = u'strings,numbersnfoo,1nbar,2nnull,3'
df = pd.read_csv(io.StringIO(data), keep_default_na=False)
df
Out[49]:
strings numbers
0 foo 1
1 bar 2
2 null 3
The full list is:
na_values : scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific
per-column NA values. By default the following values are interpreted
as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’,
‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’,
‘null’.
UPDATE: 2020-03-23 for Pandas 1+:
many thanks to @aiguofer for the adapted solution:
na_vals = pd.io.parsers.STR_NA_VALUES.difference({'NULL','null'})
df = pd.read_csv(io.StringIO(data), na_values=na_vals, keep_default_na=False)
Old answer:
we can dynamically exclude 'NULL'
and 'null'
from the set of default _NA_VALUES
:
In [4]: na_vals = pd.io.common._NA_VALUES.difference({'NULL','null'})
In [5]: na_vals
Out[5]:
{'',
'#N/A',
'#N/A N/A',
'#NA',
'-1.#IND',
'-1.#QNAN',
'-NaN',
'-nan',
'1.#IND',
'1.#QNAN',
'N/A',
'NA',
'NaN',
'n/a',
'nan'}
and use it in read_csv()
:
df = pd.read_csv(io.StringIO(data), na_values=na_vals)
Other answers are better for reading in a csv without “null” being interpreted as Nan
, but if you have a dataframe that you want “fixed”, this code will do so: df=df.fillna('null')