Select columns which contains a string in pyspark
Question:
I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:
df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']
I want to select the ones which contains ‘hello’ and also the column named ‘index’, so the result will be:
['hello_world','hello_country','hello_everyone','index']
I want something like df.select('hello*','index')
Thanks in advance 🙂
EDIT:
I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it
Answers:
This sample code does what you want:
hello_cols = []
for col in df.columns:
if(('index' in col) or ('hello' in col)):
hello_cols.append(col)
df.select(*hello_cols)
I’ve found a quick and elegant way:
selected = [s for s in df.columns if 'hello' in s]+['index']
df.select(selected)
With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.
You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.
i used Manrique answer and improvised.
sel_cols = [i for i in df.columns if i.startswith("colName")]
df = df.select(‘*’, *(F.col(x).alias(‘rename_text’ + x) for x in sel_cols))
I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:
df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']
I want to select the ones which contains ‘hello’ and also the column named ‘index’, so the result will be:
['hello_world','hello_country','hello_everyone','index']
I want something like df.select('hello*','index')
Thanks in advance 🙂
EDIT:
I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it
This sample code does what you want:
hello_cols = []
for col in df.columns:
if(('index' in col) or ('hello' in col)):
hello_cols.append(col)
df.select(*hello_cols)
I’ve found a quick and elegant way:
selected = [s for s in df.columns if 'hello' in s]+['index']
df.select(selected)
With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.
You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.
i used Manrique answer and improvised.
sel_cols = [i for i in df.columns if i.startswith("colName")]
df = df.select(‘*’, *(F.col(x).alias(‘rename_text’ + x) for x in sel_cols))