Searching for values in large dataframe with unnamed columns
Question:
I have a dataframe with ~300 columns in the following format:
| Column1 | Column2 | Column3 | Column5
| ------------| -------------- |-----------|----------
| Color=Blue | Location=USA | Name=Steve| N/A
| Location=USA| ID=123 | Name=Randy| Color=Purple
| ID=987 | Name=Gary | Color=Red | Location=Italy
What is the best way to process such a huge and irregular dataset if I’m only interested in specific attributes, such as ‘Color’ and ‘ID’?
An example output if I only wanted to see ‘ID’ could be something like:
| Column1 | Column2 | Column3 | Column5
| ------------| -------------- |-----------|----------
| ID=987= | ID=123 | |
Or maybe even a list of results would work:
ID=[987, 123]
Answers:
A possible solution:
a = df.where(df.map(lambda x: str(x).startswith("ID"))).values.flatten()
a[~pd.isnull(a)].tolist()
Alternatively:
import re
pattern = re.compile(r"^ID")
a = df.where(df.map(lambda x: bool(pattern.match(str(x))))).values.flatten()
a[~pd.isnull(a)].tolist()
Output:
['ID=123', 'ID=987']
For future process, maybe you can reorganize your dataframe?
piv = (pd.concat([df[c].str.split('=', expand=True) for c in df.columns])
.set_index(0, append=True)[1].dropna()
.unstack().rename_axis(columns=None))
Output:
>>> piv
Color ID Location Name
0 Blue NaN USA Steve
1 Purple 123 USA Randy
2 Red 987 Italy Gary
Usage:
>>> piv['ID'].dropna()
1 123
2 987
Name: ID, dtype: object
>>> piv['ID'].dropna().tolist()
['123', '987']
With quick numpy.char.find
method to find a pattern by a valid index (not -1
):
df.values[np.char.find(df.values.astype(str), 'ID=') != -1].tolist()
['ID=123', 'ID=987']
I have a dataframe with ~300 columns in the following format:
| Column1 | Column2 | Column3 | Column5
| ------------| -------------- |-----------|----------
| Color=Blue | Location=USA | Name=Steve| N/A
| Location=USA| ID=123 | Name=Randy| Color=Purple
| ID=987 | Name=Gary | Color=Red | Location=Italy
What is the best way to process such a huge and irregular dataset if I’m only interested in specific attributes, such as ‘Color’ and ‘ID’?
An example output if I only wanted to see ‘ID’ could be something like:
| Column1 | Column2 | Column3 | Column5
| ------------| -------------- |-----------|----------
| ID=987= | ID=123 | |
Or maybe even a list of results would work:
ID=[987, 123]
A possible solution:
a = df.where(df.map(lambda x: str(x).startswith("ID"))).values.flatten()
a[~pd.isnull(a)].tolist()
Alternatively:
import re
pattern = re.compile(r"^ID")
a = df.where(df.map(lambda x: bool(pattern.match(str(x))))).values.flatten()
a[~pd.isnull(a)].tolist()
Output:
['ID=123', 'ID=987']
For future process, maybe you can reorganize your dataframe?
piv = (pd.concat([df[c].str.split('=', expand=True) for c in df.columns])
.set_index(0, append=True)[1].dropna()
.unstack().rename_axis(columns=None))
Output:
>>> piv
Color ID Location Name
0 Blue NaN USA Steve
1 Purple 123 USA Randy
2 Red 987 Italy Gary
Usage:
>>> piv['ID'].dropna()
1 123
2 987
Name: ID, dtype: object
>>> piv['ID'].dropna().tolist()
['123', '987']
With quick numpy.char.find
method to find a pattern by a valid index (not -1
):
df.values[np.char.find(df.values.astype(str), 'ID=') != -1].tolist()
['ID=123', 'ID=987']