loop over two variables to create multiple year columns

Question:

If I have table

|a      | b     | c|
|"hello"|"world"| 1|

and the variables

start =2000 
end =2015

How do I in pyspark add 15 cols with 1st column m2000 and second m2001 etc and all these new cols have 0 so new dataframe is

|a      | b     | c|m2000 | m2001 | m2002 | ... | m2015|
|"hello"|"world"| 1| 0    | 0     | 0     | ... |   0  |

I have tried below but

   df = df.select(
        '*',
        *["0".alias(f'm{i}') for i in range(2000, 2016)]
    )
    df.show()

I get the error

AttributeError: 'str' object has no attribute 'alias'
Asked By: lunbox

||

Answers:

in pandas, you can do the following:

import pandas as pd

df = pd.Series({'a': 'Hello', 'b': 'World', 'c': 1}).to_frame().T
df[['m{}'.format(x) for x in range(2000, 2016)]] = 0
print(df)

I am not very familiar with the spark-synthax, but the approach should be near-identical.

What is happening:
The term ['m{}'.format(x) for x in range(2000, 2016)] is a list-comprehension that creates the list of desired column names. We assign the value 0 to these columns. Since the columns do not yet exist, they are added.

Answered By: C Hecht

You can simply use withColumn to add relevant columns.

from pyspark.sql.functions import col,lit

df = spark.createDataFrame(data=[("hello","world",1)],schema=["a","b","c"])

df.show()

+-----+-----+---+
|    a|    b|  c|
+-----+-----+---+
|hello|world|  1|
+-----+-----+---+

for i in range(2000, 2015):
    df = df.withColumn("m"+str(i), lit(0))

df.show()

+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|    a|    b|  c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world|  1|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
Answered By: Grisha Weintraub

You can use one-liner

df = df.select(df.columns + [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])

Full example:

df = spark.createDataFrame([["hello","world",1]],["a","b","c"])
df = df.select(df.columns + [F.lit(0).alias(f"m{i}") for i in range(2000, 2015)])

[Out]:
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|    a|    b|  c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world|  1|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
Answered By: Azhar Khan

Your code for generating extra columns is perfectly fine – just need to wrap the "0" in lit function, like this:

from pyspark.sql.functions import lit

df.select('*', *[lit("0").alias(f'm{i}') for i in range(2000, 2016)]).show()

+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|    a|    b|  c|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|m2015|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|world|  1|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
+-----+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

I would be cautious with calling withColumn method repeatadly – every new call to it, creates a new projection in Spark’s query execution plan and it can become very expensive computationally. Using just single select will always be better approach.

Answered By: Bartosz Gajda