Searching for values in large dataframe with unnamed columns

Question

I have a dataframe with ~300 columns in the following format:

| Column1     | Column2        | Column3   | Column5
| ------------| -------------- |-----------|----------
| Color=Blue  | Location=USA   | Name=Steve| N/A
| Location=USA| ID=123         | Name=Randy| Color=Purple
| ID=987      | Name=Gary      | Color=Red | Location=Italy

What is the best way to process such a huge and irregular dataset if I’m only interested in specific attributes, such as ‘Color’ and ‘ID’?

An example output if I only wanted to see ‘ID’ could be something like:

| Column1     | Column2        | Column3   | Column5
| ------------| -------------- |-----------|----------
| ID=987=     | ID=123         |           |

Or maybe even a list of results would work:

ID=[987, 123]

Asked By: NRH

||

Source

Answer 1

A possible solution:

a = df.where(df.map(lambda x: str(x).startswith("ID"))).values.flatten()
a[~pd.isnull(a)].tolist()

Alternatively:

import re
pattern = re.compile(r"^ID")

a = df.where(df.map(lambda x: bool(pattern.match(str(x))))).values.flatten()
a[~pd.isnull(a)].tolist()

Output:

['ID=123', 'ID=987']

Answered By: PaulS

Answer 2

For future process, maybe you can reorganize your dataframe?

piv = (pd.concat([df[c].str.split('=', expand=True) for c in df.columns])
         .set_index(0, append=True)[1].dropna()
         .unstack().rename_axis(columns=None))

Output:

>>> piv
    Color   ID Location   Name
0    Blue  NaN      USA  Steve
1  Purple  123      USA  Randy
2     Red  987    Italy   Gary

Usage:

>>> piv['ID'].dropna()
1    123
2    987
Name: ID, dtype: object

>>> piv['ID'].dropna().tolist()
['123', '987']

Answered By: Corralien

Answer 3

With quick numpy.char.find method to find a pattern by a valid index (not -1):

df.values[np.char.find(df.values.astype(str), 'ID=') != -1].tolist()

['ID=123', 'ID=987']

Answered By: RomanPerekhrest

Searching for values in large dataframe with unnamed columns

Question:

Answers: