Sum of values per column in Spark

Question

I would need some help regarding Spark.

What I was doing is converting a dataframe I get in Spark to a Pandas dataframe (with the Spark2Pandas command), and then doing some work on it as follows:

Basically, I have a Pandas dataframe with 100 columns, each of them called "FirstP XX SecondP", being XX the number of the column going from 00 to 99.
So first of all, I remove the "FirstP " and " SecondP" part of the text for each column (So I stay just with the number, basically).
After that, with a for loop I create a new column, in which I will add the sum of all the values per column. Then, I drop the original column that had all the data.

The code I was using is as follows:

data.columns = data.columns.str.replace('FirstP ', '')
data.columns = data.columns.str.replace(' SecondP', '')
data = data.dropna(how='all') # Remove NaN
data = data.astype('float')
    
    
for column in data.columns:
      column_name = f'New {column}'
      data[column_name] = data[column].sum()
      data[column_name].fillna(method='ffill', inplace=True)
      data[column_name].fillna(value=0, inplace=True)
      data = data.drop([column], axis=1)

My problem is that converting the dataframe from Spark to Pandas, using toDF or Spark2Pandas commands, takes a lot of time because the dataframe is huge.
So I would like to do the same as before but directly on Spark, and then just convert my dataframe, that will contain the column name and the value of the sum per column, to Pandas.

This is, instead of converting a Spark dataframe that would have 100 columns and a huge amount of rows and then working with it, I would like to work with it directly on Spark, and then convert my dataframe that would have 100 columns but just 1 row.

My problem is that I’m not really familiar with Spark and I have tried to do it but I don’t manage to make it work.

I tried changing the names of the columns using this:

def change_names(x):
  for column in x.columns:
      column = column.replace('FirstP ', '')
      column = column.replace(' SecondP', '')
   return x
spark_change_name = F.udf(change_names)
df1 = spark_change_name(res)

res = res.select(
[F.col(col).alias(col.replace('FirstP ', '')) for col in
res.columns])

None of them seem to work.

Could any of you please give me a hand with that?

Asked By: Sara.SP92

||

Source

Answer 1

If only one row with totals is required, each original column can be changed to "sum(column)", with alias received by splitting by space, in Scala:

// original data
val df = Seq(
  (1, 2),
  (3, 4)
).toDF("FirstP 00 SecondP", "FirstP 01 SecondP")

// get columns
val totalColumns = df.columns.map(colName => sum(colName).alias(colName.split(" ")(1)))

val result = df.select(totalColumns: _*)

Result:

+---+---+
|00 |01 |
+---+---+
|4  |6  |
+---+---+

Answered By: pasha701

Sum of values per column in Spark

Question:

Answers: