I am using pyspark to read a parquet file like below:
my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**')
Then when I do
my_df.take(5), it will show
[Row(...)], instead of a table format like when we use the pandas data frame.
Is it possible to display the data frame in a table format like pandas data frame? Thanks!
Yes: call the
toPandas method on your dataframe and you’ll get an actual pandas dataframe !
The show method does what you’re looking for.
For example, given the following dataframe of 3 rows, I can print just the first two rows like this:
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v')) df.show(n=2)
+---+---+ | k| v| +---+---+ |foo| 1| |bar| 2| +---+---+ only showing top 2 rows
As mentioned by @Brent in the comment of @maxymoo’s answer, you can try
to get a prettier table in Jupyter. But this can take some time to run if you are not caching the spark dataframe. Also,
.limit() will not keep the order of original spark dataframe.
Let’s say we have the following Spark DataFrame:
df = sqlContext.createDataFrame( [ (1, "Mark", "Brown"), (2, "Tom", "Anderson"), (3, "Joshua", "Peterson") ], ('id', 'firstName', 'lastName') )
There are typically three different ways you can use to print the content of the dataframe:
Print Spark DataFrame
The most common way is to use
>>> df.show() +---+---------+--------+ | id|firstName|lastName| +---+---------+--------+ | 1| Mark| Brown| | 2| Tom|Anderson| | 3| Joshua|Peterson| +---+---------+--------+
Print Spark DataFrame vertically
Say that you have a fairly large number of columns and your dataframe doesn’t fit in the screen. You can print the rows vertically – For example, the following command will print the top two rows, vertically, without any truncation.
>>> df.show(n=2, truncate=False, vertical=True) -RECORD 0------------- id | 1 firstName | Mark lastName | Brown -RECORD 1------------- id | 2 firstName | Tom lastName | Anderson only showing top 2 rows
Convert to Pandas and print Pandas DataFrame
>>> df_pd = df.toPandas() >>> print(df_pd) id firstName lastName 0 1 Mark Brown 1 2 Tom Anderson 2 3 Joshua Peterson
Note that this is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory. If this is the case, the following configuration will help when converting a large spark dataframe to a pandas one:
For more details you can refer to my blog post Speeding up the conversion between PySpark and Pandas DataFrames
If you are using Jupyter, this is what worked for me:
dsp = users
This shows well-formated HTML table, you can also draw some simple charts on it straight away. For more documentation of %%display, type %%help.
Maybe something like this is a tad more elegant:
df.display() # OR df.select('column1').display()
By default show() function prints 20 records of DataFrame. You can define number of rows you want to print by providing argument to show() function. You never know, what will be the total number of rows DataFrame will have. So, we can pass df.count() as argument to show function, which will print all records of DataFrame.
df.show() --> prints 20 records by default df.show(30) --> prints 30 records according to argument df.show(df.count()) --> get total row count and pass it as argument to show