use a specfic coumn values as a checker to change other column values in pyspark/pandas
Question:
If I have below table
|a | id | year|m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1 | 2001 | 0 | 0 | 0 | ... | 0 |
|"hello"| 1 | 2015 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2002 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2015 | 0 | 0 | 0 | ... | 0 |
How to I change the dataframe so it checks the year column in each row and changes the above example m2001 and m2015 to 1 and as id is 1 in both, the new table will look like below
|a | id |m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1 | 0 | 1 | 0 | ... | 1 |
|"hello"| 2 | 0 | 0 | 1 | ... | 1 |
Answers:
new = df.select('a','id','year',*[when((size(F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:])))))==1,1).otherwise(0).alias(x) for x in df.columns if x not in ['a','id','year']])
new.groupBy('a','id').agg(*[max(x).alias(x) for x in new.columns if x not in ['a','id','year']] ).show()
How it works
Collapse the columns into rows and pair them with the year columns value
df.select('a','id','year',*[F.array(F.lit(col('year').astype('string')), lit(x[1:])).alias(x) for x in df.columns if x not in ['a','id','year']])
Find the distinct elements in the arrays in each column
df.select('a','id','year',*[F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:]))).alias(x) for x in df.columns if x not in ['a','id','year']])
Find the size of each array in the individual columns
df.select('a','id','year',*[size(F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:])))).alias(x) for x in df.columns if x not in ['a','id','year']])
Finally, where the size is not 1, means the column value and year didnt agree, so make it 0, else 1
Finally, groupby add find max value in each column
You can generate the range of columns by start and end year value. Then pivot by year and finally fill the columns for missing year data:
df = spark.createDataFrame(data=[ ["hello", 1, 2001], ["hello", 1, 2015], ["hello", 2, 2002], ["hello", 2, 2015] ], schema=["a", "id", "year"])
start = 2000
end = 2015
df = df.withColumn("myear", F.concat(F.lit("m"), "year"))
df = df.groupBy("a","id").pivot("myear").agg((F.count("myear")>0).cast("integer")).drop("myear")
df = df.fillna({c:0 for c in df.columns if c not in ["a", "id", "year"]})
df = df.select(["*"] + [F.lit(0).alias(f"m{i}") for i in range(start,end+1)])
[Out]:
+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| a| id|m2001|m2002|m2015|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|m2015|
+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello| 1| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
|hello| 2| 0| 1| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
If I have below table
|a | id | year|m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1 | 2001 | 0 | 0 | 0 | ... | 0 |
|"hello"| 1 | 2015 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2002 | 0 | 0 | 0 | ... | 0 |
|"hello"| 2 | 2015 | 0 | 0 | 0 | ... | 0 |
How to I change the dataframe so it checks the year column in each row and changes the above example m2001 and m2015 to 1 and as id is 1 in both, the new table will look like below
|a | id |m2000 | m2001 | m2002 | .... | m2015|
|"hello"| 1 | 0 | 1 | 0 | ... | 1 |
|"hello"| 2 | 0 | 0 | 1 | ... | 1 |
new = df.select('a','id','year',*[when((size(F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:])))))==1,1).otherwise(0).alias(x) for x in df.columns if x not in ['a','id','year']])
new.groupBy('a','id').agg(*[max(x).alias(x) for x in new.columns if x not in ['a','id','year']] ).show()
How it works
Collapse the columns into rows and pair them with the year columns value
df.select('a','id','year',*[F.array(F.lit(col('year').astype('string')), lit(x[1:])).alias(x) for x in df.columns if x not in ['a','id','year']])
Find the distinct elements in the arrays in each column
df.select('a','id','year',*[F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:]))).alias(x) for x in df.columns if x not in ['a','id','year']])
Find the size of each array in the individual columns
df.select('a','id','year',*[size(F.array_distinct(F.array(F.lit(col('year').astype('string')), lit(x[1:])))).alias(x) for x in df.columns if x not in ['a','id','year']])
Finally, where the size is not 1, means the column value and year didnt agree, so make it 0, else 1
Finally, groupby add find max value in each column
You can generate the range of columns by start and end year value. Then pivot by year and finally fill the columns for missing year data:
df = spark.createDataFrame(data=[ ["hello", 1, 2001], ["hello", 1, 2015], ["hello", 2, 2002], ["hello", 2, 2015] ], schema=["a", "id", "year"])
start = 2000
end = 2015
df = df.withColumn("myear", F.concat(F.lit("m"), "year"))
df = df.groupBy("a","id").pivot("myear").agg((F.count("myear")>0).cast("integer")).drop("myear")
df = df.fillna({c:0 for c in df.columns if c not in ["a", "id", "year"]})
df = df.select(["*"] + [F.lit(0).alias(f"m{i}") for i in range(start,end+1)])
[Out]:
+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| a| id|m2001|m2002|m2015|m2000|m2001|m2002|m2003|m2004|m2005|m2006|m2007|m2008|m2009|m2010|m2011|m2012|m2013|m2014|m2015|
+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|hello| 1| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
|hello| 2| 0| 1| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
+-----+---+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+