Add new rows to pyspark Dataframe
Question:
Am very new pyspark but familiar with pandas.
I have a pyspark Dataframe
# instantiate Spark
spark = SparkSession.builder.getOrCreate()
# make some test data
columns = ['id', 'dogs', 'cats']
vals = [
(1, 2, 0),
(2, 0, 1)
]
# create DataFrame
df = spark.createDataFrame(vals, columns)
wanted to add new Row (4,5,7) so it will output:
df.show()
+---+----+----+
| id|dogs|cats|
+---+----+----+
| 1| 2| 0|
| 2| 0| 1|
| 4| 5| 7|
+---+----+----+
Answers:
From something I did, using union, showing a block partial coding – you need to adapt of course to your own situation:
val dummySchema = StructType(
StructField("phrase", StringType, true) :: Nil)
var dfPostsNGrams2 = spark.createDataFrame(sc.emptyRDD[Row], dummySchema)
for (i <- i_grams_Cols) {
val nameCol = col({i})
dfPostsNGrams2 = dfPostsNGrams2.union(dfPostsNGrams.select(explode({nameCol}).as("phrase")).toDF )
}
union of DF with itself is the way to go.
As thebluephantom has already said union is the way to go. I’m just answering your question to give you a pyspark example:
# if not already created automatically, instantiate Sparkcontext
spark = SparkSession.builder.getOrCreate()
columns = ['id', 'dogs', 'cats']
vals = [(1, 2, 0), (2, 0, 1)]
df = spark.createDataFrame(vals, columns)
newRow = spark.createDataFrame([(4,5,7)], columns)
appended = df.union(newRow)
appended.show()
Please have also a lookat the databricks FAQ: https://kb.databricks.com/data/append-a-row-to-rdd-or-dataframe.html
Another alternative would be to utilize the partitioned parquet format, and add an extra parquet file for each dataframe you want to append. This way you can create (hundreds, thousands, millions) of parquet files, and spark will just read them all as a union when you read the directory later.
This example uses pyarrow
Note I also showed how to write a single parquet (example.parquet) that isn’t partitioned, if you already know where you want to put the single parquet file.
import pyarrow.parquet as pq
import pandas as pd
headers=['A', 'B', 'C']
row1 = ['a1', 'b1', 'c1']
row2 = ['a2', 'b2', 'c2']
df1 = pd.DataFrame([row1], columns=headers)
df2 = pd.DataFrame([row2], columns=headers)
df3 = df1.append(df2, ignore_index=True)
table = pa.Table.from_pandas(df3)
pq.write_table(table, 'example.parquet', flavor='spark')
pq.write_to_dataset(table, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
# Adding a new partition (B=b2/C=c3
row3 = ['a3', 'b3', 'c3']
df4 = pd.DataFrame([row3], columns=headers)
table2 = pa.Table.from_pandas(df4)
pq.write_to_dataset(table2, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
# Add another parquet file to the B=b2/C=c2 partition
# Note this does not overwrite existing partitions, it just appends a new .parquet file.
# If files already exist, then you will get a union result of the two (or multiple) files when you read the partition
row5 = ['a5', 'b2', 'c2']
df5 = pd.DataFrame([row5], columns=headers)
table3 = pa.Table.from_pandas(df5)
pq.write_to_dataset(table3, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
Reading the output afterwards
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.appName("testing parquet read")
.getOrCreate())
df_spark = spark.read.parquet('test_part_file')
df_spark.show(25, False)
You should see something like this
+---+---+---+
|A |B |C |
+---+---+---+
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
+---+---+---+
If you run the same thing end to end again, you should see duplicates like this (since all of the previous parquet files are still there, spark unions them).
+---+---+---+
|A |B |C |
+---+---+---+
|a2 |b2 |c2 |
|a5 |b2 |c2 |
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
|a3 |b3 |c3 |
+---+---+---+
To append row to dataframe one can use collect method also. collect() function converts dataframe to list and you can directly append data to list and again convert list to dataframe.
my spark dataframe called df is like
+---+----+------+
| id|name|gender|
+---+----+------+
| 1| A| M|
| 2| B| F|
| 3| C| M|
+---+----+------+
convert this dataframe to list using collect
collect_df = df.collect()
print(collect_df)
[Row(id=1, name='A', gender='M'),
Row(id=2, name='B', gender='F'),
Row(id=3, name='C', gender='M')]
append new row to this list
collect_df.append({"id" : 5, "name" : "E", "gender" : "F"})
print(collect_df)
[Row(id=1, name='A', gender='M'),
Row(id=2, name='B', gender='F'),
Row(id=3, name='C', gender='M'),
{'id': 5, 'name': 'E', 'gender': 'F'}]
convert this list to dataframe
added_row_df = spark.createDataFrame(collect_df)
added_row_df.show()
+---+----+------+
| id|name|gender|
+---+----+------+
| 1| A| M|
| 2| B| F|
| 3| C| M|
| 5| E| F|
+---+----+------+
Am very new pyspark but familiar with pandas.
I have a pyspark Dataframe
# instantiate Spark
spark = SparkSession.builder.getOrCreate()
# make some test data
columns = ['id', 'dogs', 'cats']
vals = [
(1, 2, 0),
(2, 0, 1)
]
# create DataFrame
df = spark.createDataFrame(vals, columns)
wanted to add new Row (4,5,7) so it will output:
df.show()
+---+----+----+
| id|dogs|cats|
+---+----+----+
| 1| 2| 0|
| 2| 0| 1|
| 4| 5| 7|
+---+----+----+
From something I did, using union, showing a block partial coding – you need to adapt of course to your own situation:
val dummySchema = StructType(
StructField("phrase", StringType, true) :: Nil)
var dfPostsNGrams2 = spark.createDataFrame(sc.emptyRDD[Row], dummySchema)
for (i <- i_grams_Cols) {
val nameCol = col({i})
dfPostsNGrams2 = dfPostsNGrams2.union(dfPostsNGrams.select(explode({nameCol}).as("phrase")).toDF )
}
union of DF with itself is the way to go.
As thebluephantom has already said union is the way to go. I’m just answering your question to give you a pyspark example:
# if not already created automatically, instantiate Sparkcontext
spark = SparkSession.builder.getOrCreate()
columns = ['id', 'dogs', 'cats']
vals = [(1, 2, 0), (2, 0, 1)]
df = spark.createDataFrame(vals, columns)
newRow = spark.createDataFrame([(4,5,7)], columns)
appended = df.union(newRow)
appended.show()
Please have also a lookat the databricks FAQ: https://kb.databricks.com/data/append-a-row-to-rdd-or-dataframe.html
Another alternative would be to utilize the partitioned parquet format, and add an extra parquet file for each dataframe you want to append. This way you can create (hundreds, thousands, millions) of parquet files, and spark will just read them all as a union when you read the directory later.
This example uses pyarrow
Note I also showed how to write a single parquet (example.parquet) that isn’t partitioned, if you already know where you want to put the single parquet file.
import pyarrow.parquet as pq
import pandas as pd
headers=['A', 'B', 'C']
row1 = ['a1', 'b1', 'c1']
row2 = ['a2', 'b2', 'c2']
df1 = pd.DataFrame([row1], columns=headers)
df2 = pd.DataFrame([row2], columns=headers)
df3 = df1.append(df2, ignore_index=True)
table = pa.Table.from_pandas(df3)
pq.write_table(table, 'example.parquet', flavor='spark')
pq.write_to_dataset(table, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
# Adding a new partition (B=b2/C=c3
row3 = ['a3', 'b3', 'c3']
df4 = pd.DataFrame([row3], columns=headers)
table2 = pa.Table.from_pandas(df4)
pq.write_to_dataset(table2, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
# Add another parquet file to the B=b2/C=c2 partition
# Note this does not overwrite existing partitions, it just appends a new .parquet file.
# If files already exist, then you will get a union result of the two (or multiple) files when you read the partition
row5 = ['a5', 'b2', 'c2']
df5 = pd.DataFrame([row5], columns=headers)
table3 = pa.Table.from_pandas(df5)
pq.write_to_dataset(table3, root_path="test_part_file", partition_cols=['B', 'C'], flavor='spark')
Reading the output afterwards
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.appName("testing parquet read")
.getOrCreate())
df_spark = spark.read.parquet('test_part_file')
df_spark.show(25, False)
You should see something like this
+---+---+---+
|A |B |C |
+---+---+---+
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
+---+---+---+
If you run the same thing end to end again, you should see duplicates like this (since all of the previous parquet files are still there, spark unions them).
+---+---+---+
|A |B |C |
+---+---+---+
|a2 |b2 |c2 |
|a5 |b2 |c2 |
|a5 |b2 |c2 |
|a2 |b2 |c2 |
|a1 |b1 |c1 |
|a1 |b1 |c1 |
|a3 |b3 |c3 |
|a3 |b3 |c3 |
+---+---+---+
To append row to dataframe one can use collect method also. collect() function converts dataframe to list and you can directly append data to list and again convert list to dataframe.
my spark dataframe called df is like
+---+----+------+
| id|name|gender|
+---+----+------+
| 1| A| M|
| 2| B| F|
| 3| C| M|
+---+----+------+
convert this dataframe to list using collect
collect_df = df.collect()
print(collect_df)
[Row(id=1, name='A', gender='M'),
Row(id=2, name='B', gender='F'),
Row(id=3, name='C', gender='M')]
append new row to this list
collect_df.append({"id" : 5, "name" : "E", "gender" : "F"})
print(collect_df)
[Row(id=1, name='A', gender='M'),
Row(id=2, name='B', gender='F'),
Row(id=3, name='C', gender='M'),
{'id': 5, 'name': 'E', 'gender': 'F'}]
convert this list to dataframe
added_row_df = spark.createDataFrame(collect_df)
added_row_df.show()
+---+----+------+
| id|name|gender|
+---+----+------+
| 1| A| M|
| 2| B| F|
| 3| C| M|
| 5| E| F|
+---+----+------+