use a specfic coumn values as a checker to change other column values in pyspark/pandas

Question:

If I have below table

|a      | id    | year|m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1    | 2001  | 0    | 0     | 0   | ... |   0  |
|"hello"| 1   | 2015  | 0    | 0     | 0   | ... |   0  |
|"hello"| 2   | 2002  | 0    | 0     | 0   | ... |   0  |
|"hello"| 2   | 2015  | 0    | 0     | 0   | ... |   0  |

How to I change the dataframe so it checks the year column in each row and changes the above example m2001 and m2015 to 1 and as id is 1 in both, the new table will look like below

|a      | id     |m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1    |  0   | 1     | 0     | ...  |   1  |
|"hello"| 2    |  0   | 0     | 1     | ...  |   1  |
Asked By: lunbox

||

Answers:

new = df.select('a','id','year',*[when((size(F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:])))))==1,1).otherwise(0).alias(x) for x in df.columns if x not in ['a','id','year']])

new.groupBy('a','id').agg(*[max(x).alias(x) for x in new.columns if x not in ['a','id','year']] ).show()

How it works

Collapse the columns into rows and pair them with the year columns value

 df.select('a','id','year',*[F.array(F.lit(col('year').astype('string')), lit(x[1:])).alias(x) for x in df.columns if x not in ['a','id','year']])

Find the distinct elements in the arrays in each column

df.select('a','id','year',*[F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:]))).alias(x) for x in df.columns if x not in ['a','id','year']])

Find the size of each array in the individual columns

df.select('a','id','year',*[size(F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:])))).alias(x) for x in df.columns if x not in ['a','id','year']])

Finally, where the size is not 1, means the column value and year didnt agree, so make it 0, else 1

Finally, groupby add find max value in each column

Answered By: wwnde

You can generate the range of columns by start and end year value. Then pivot by year and finally fill the columns for missing year data:

df = spark.createDataFrame(data=[ ["hello", 1, 2001], ["hello", 1, 2015], ["hello", 2, 2002], ["hello", 2, 2015] ], schema=["a", "id", "year"])

start = 2000
end = 2015

df = df.withColumn("myear", F.concat(F.lit("m"), "year"))
df = df.groupBy("a","id").pivot("myear").agg((F.count("myear")>0).cast("integer")).drop("myear")
df = df.fillna({c:0 for c in df.columns if c not in ["a", "id", "year"]})
df = df.select(["*"] + [F.lit(0).alias(f"m{i}") for i in range(start,end+1)])

[Out]:
+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|    a| id|m2001|m2002|m2015|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|m2015|
+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello|  1|    1|    0|    1|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
|hello|  2|    0|    1|    1|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|    0|
+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
Answered By: Azhar Khan