pandas – convert string into list of strings
Question:
I have this ‘file.csv’ file to read with pandas:
Title|Tags
T1|"[Tag1,Tag2]"
T1|"[Tag1,Tag2,Tag3]"
T2|"[Tag3,Tag1]"
using
df = pd.read_csv('file.csv', sep='|')
the output is:
Title Tags
0 T1 [Tag1,Tag2]
1 T1 [Tag1,Tag2,Tag3]
2 T2 [Tag3,Tag1]
I know that the column Tags
is a full string, since:
In [64]: df['Tags'][0][0]
Out[64]: '['
I need to read it as a list of strings like ["Tag1","Tag2"]
. I tried the solution provided in this question but no luck there, since I have the [
and ]
characters that actually mess up the things.
The expecting output should be:
In [64]: df['Tags'][0][0]
Out[64]: 'Tag1'
Answers:
I think you could use the json module.
import json
import pandas
df = pd.read_csv('file.csv', sep='|')
df['Tags'] = df['Tags'].apply(lambda x: json.loads(x))
So this will load your dataframe as before, then apply a lambda function to each of the items in the Tags
column. The lambda function calls json.loads()
which converts the string representation of the list to an actual list.
You can convert the string to a list using strip
and split
.
df_out = df.assign(Tags=df.Tags.str.strip('[]').str.split(','))
df_out.Tags[0][0]
Output:
'Tag1'
You can split the string manually:
>>> df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))
>>> df.Tags[0]
['Tag1', 'Tag2']
Or
df.Tags=df.Tags.str[1:-1].str.split(',').tolist()
Your df['Tags']
appears to be a list of strings. If you print that list you should get ["[tag1,tag2]","[Tag1,Tag2,Tag3]","[Tag3,Tag1]"]
this is why when you call the first element of the first element you’re actually getting the first single character of the string, rather than what you want.
You either need to parse that string afterward. Performing something like
df['Tags'][0] = df['Tags'][0].split(',')
But as you saw in your cited example this will give you a list that looks like
in: df['Tags'][0][0]
out: '[tag1'`
What you need is a way to parse the string editing out multiple characters. You can use a simple regex expression to do this. Something like:
import re
df['Tags'][0] = re.findall(r"[w']+", df['Tags'][0])
print(df['Tags'][0][0])
will print:
'tag1'
Using the other answer involving Pandas converters you might write a converter like this:
def clean(seq_string):
return re.findall(r"[w']+", seq_string)
If you don’t know regex, they can be quite powerful, but also unpredictable if you’re not sure on the content of your input strings. The expression used here r"[w']+"
will match any common word character alpha-numeric and underscores and treat everything else as a point for re.findall
to split the list at.
You could use the inbuilt ast.literal_eval
, it works for tuples as well as lists
import ast
import pandas as pd
df = pd.DataFrame({"mytuples": ["(1,2,3)"]})
print(df.iloc[0,0])
# >> '(1,2,3)'
df["mytuples"] = df["mytuples"].apply(ast.literal_eval)
print(df.iloc[0,0])
# >> (1,2,3)
EDIT: eval
should be avoided! If the the string being evaluated is os.system(‘rm -rf /’)
it will start deleting all the files on your computer (here). For ast.literal_eval
the string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None (here). Thanks @TrentonMcKinney 🙂
I have this ‘file.csv’ file to read with pandas:
Title|Tags
T1|"[Tag1,Tag2]"
T1|"[Tag1,Tag2,Tag3]"
T2|"[Tag3,Tag1]"
using
df = pd.read_csv('file.csv', sep='|')
the output is:
Title Tags
0 T1 [Tag1,Tag2]
1 T1 [Tag1,Tag2,Tag3]
2 T2 [Tag3,Tag1]
I know that the column Tags
is a full string, since:
In [64]: df['Tags'][0][0]
Out[64]: '['
I need to read it as a list of strings like ["Tag1","Tag2"]
. I tried the solution provided in this question but no luck there, since I have the [
and ]
characters that actually mess up the things.
The expecting output should be:
In [64]: df['Tags'][0][0]
Out[64]: 'Tag1'
I think you could use the json module.
import json
import pandas
df = pd.read_csv('file.csv', sep='|')
df['Tags'] = df['Tags'].apply(lambda x: json.loads(x))
So this will load your dataframe as before, then apply a lambda function to each of the items in the Tags
column. The lambda function calls json.loads()
which converts the string representation of the list to an actual list.
You can convert the string to a list using strip
and split
.
df_out = df.assign(Tags=df.Tags.str.strip('[]').str.split(','))
df_out.Tags[0][0]
Output:
'Tag1'
You can split the string manually:
>>> df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))
>>> df.Tags[0]
['Tag1', 'Tag2']
Or
df.Tags=df.Tags.str[1:-1].str.split(',').tolist()
Your df['Tags']
appears to be a list of strings. If you print that list you should get ["[tag1,tag2]","[Tag1,Tag2,Tag3]","[Tag3,Tag1]"]
this is why when you call the first element of the first element you’re actually getting the first single character of the string, rather than what you want.
You either need to parse that string afterward. Performing something like
df['Tags'][0] = df['Tags'][0].split(',')
But as you saw in your cited example this will give you a list that looks like
in: df['Tags'][0][0]
out: '[tag1'`
What you need is a way to parse the string editing out multiple characters. You can use a simple regex expression to do this. Something like:
import re
df['Tags'][0] = re.findall(r"[w']+", df['Tags'][0])
print(df['Tags'][0][0])
will print:
'tag1'
Using the other answer involving Pandas converters you might write a converter like this:
def clean(seq_string):
return re.findall(r"[w']+", seq_string)
If you don’t know regex, they can be quite powerful, but also unpredictable if you’re not sure on the content of your input strings. The expression used here r"[w']+"
will match any common word character alpha-numeric and underscores and treat everything else as a point for re.findall
to split the list at.
You could use the inbuilt ast.literal_eval
, it works for tuples as well as lists
import ast
import pandas as pd
df = pd.DataFrame({"mytuples": ["(1,2,3)"]})
print(df.iloc[0,0])
# >> '(1,2,3)'
df["mytuples"] = df["mytuples"].apply(ast.literal_eval)
print(df.iloc[0,0])
# >> (1,2,3)
EDIT: eval
should be avoided! If the the string being evaluated is os.system(‘rm -rf /’)
it will start deleting all the files on your computer (here). For ast.literal_eval
the string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None (here). Thanks @TrentonMcKinney 🙂