Show distinct column values in pyspark dataframe
Question:
With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique()
.
I want to list out all the unique values in a pyspark dataframe column.
Not the SQL type way (registertemplate then SQL query for distinct values).
Also I don’t need groupby
then countDistinct
, instead I want to check distinct VALUES in that column.
Answers:
Let’s assume we’re working with the following representation of data (two columns, k
and v
, where k
contains three entries, two unique:
+---+---+
| k| v|
+---+---+
|foo| 1|
|bar| 2|
|foo| 3|
+---+---+
With a Pandas dataframe:
import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()
This returns an ndarray
, i.e. array(['foo', 'bar'], dtype=object)
You asked for a “pyspark dataframe alternative for pandas df[‘col’].unique()”. Now, given the following Spark dataframe:
s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))
If you want the same result from Spark, i.e. an ndarray
, use toPandas()
:
s_df.toPandas()['k'].unique()
Alternatively, if you don’t need an ndarray
specifically and just want a list of the unique values of column k
:
s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()
Finally, you can also use a list comprehension as follows:
[i.k for i in s_df.select('k').distinct().collect()]
You can use df.dropDuplicates(['col1','col2'])
to get only distinct rows based on colX in the array.
This should help to get distinct values of a column:
df.select('column1').distinct().collect()
Note that .collect()
doesn’t have any built-in limit on how many values can return so this might be slow — use .show()
instead or add .limit(20)
before .collect()
to manage this.
collect_set
can help to get unique values from a given column of pyspark.sql.DataFrame
:
df.select(F.collect_set("column").alias("column")).first()["column"]
In addition to the dropDuplicates
option there is the method named as we know it in pandas
drop_duplicates
:
drop_duplicates() is an alias for dropDuplicates().
Example
s_df = sqlContext.createDataFrame([("foo", 1),
("foo", 1),
("bar", 2),
("foo", 3)], ('k', 'v'))
s_df.show()
+---+---+
| k| v|
+---+---+
|foo| 1|
|foo| 1|
|bar| 2|
|foo| 3|
+---+---+
Drop by subset
s_df.drop_duplicates(subset = ['k']).show()
+---+---+
| k| v|
+---+---+
|bar| 2|
|foo| 1|
+---+---+
s_df.drop_duplicates().show()
+---+---+
| k| v|
+---+---+
|bar| 2|
|foo| 3|
|foo| 1|
+---+---+
If you want to select ALL(columns) data as distinct frrom a DataFrame (df), then
df.select('*').distinct().show(10,truncate=False)
you could do
distinct_column = 'somecol'
distinct_column_vals = df.select(distinct_column).distinct().collect()
distinct_column_vals = [v[distinct_column] for v in distinct_column_vals]
Run this first
df.createOrReplaceTempView('df')
Then run
spark.sql("""
SELECT distinct
column name
FROM
df
""").show()
If you want to see the distinct values of a specific column in your dataframe, you would just need to write the following code. It would show the 100 distinct values (if 100 values are available) for the colname
column in the df
dataframe.
df.select('colname').distinct().show(100, False)
If you want to do something fancy on the distinct values, you can save the distinct values in a vector:
a = df.select('colname').distinct()
Let us suppose that your original DataFrame is called df
. Then, you can use:
df1 = df.groupBy('column_1').agg(F.count('column_1').alias('trip_count'))
df2 = df1.sort(df1.trip_count.desc()).show()
Similar to other answer, but the question doesn’t seem to want Row objects returned, but instead actual values.
The ideal one-liner is
df.select('column').distinct().collect().toPandas().column.to_list()
assuming that running the .collect() isn’t going to be too big for memory.
I recommend a df.select('column').distinct().count()
first to estimate size, and make sure it’s not too huge beforehand.
With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique()
.
I want to list out all the unique values in a pyspark dataframe column.
Not the SQL type way (registertemplate then SQL query for distinct values).
Also I don’t need groupby
then countDistinct
, instead I want to check distinct VALUES in that column.
Let’s assume we’re working with the following representation of data (two columns, k
and v
, where k
contains three entries, two unique:
+---+---+
| k| v|
+---+---+
|foo| 1|
|bar| 2|
|foo| 3|
+---+---+
With a Pandas dataframe:
import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()
This returns an ndarray
, i.e. array(['foo', 'bar'], dtype=object)
You asked for a “pyspark dataframe alternative for pandas df[‘col’].unique()”. Now, given the following Spark dataframe:
s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))
If you want the same result from Spark, i.e. an ndarray
, use toPandas()
:
s_df.toPandas()['k'].unique()
Alternatively, if you don’t need an ndarray
specifically and just want a list of the unique values of column k
:
s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()
Finally, you can also use a list comprehension as follows:
[i.k for i in s_df.select('k').distinct().collect()]
You can use df.dropDuplicates(['col1','col2'])
to get only distinct rows based on colX in the array.
This should help to get distinct values of a column:
df.select('column1').distinct().collect()
Note that .collect()
doesn’t have any built-in limit on how many values can return so this might be slow — use .show()
instead or add .limit(20)
before .collect()
to manage this.
collect_set
can help to get unique values from a given column of pyspark.sql.DataFrame
:
df.select(F.collect_set("column").alias("column")).first()["column"]
In addition to the dropDuplicates
option there is the method named as we know it in pandas
drop_duplicates
:
drop_duplicates() is an alias for dropDuplicates().
Example
s_df = sqlContext.createDataFrame([("foo", 1),
("foo", 1),
("bar", 2),
("foo", 3)], ('k', 'v'))
s_df.show()
+---+---+
| k| v|
+---+---+
|foo| 1|
|foo| 1|
|bar| 2|
|foo| 3|
+---+---+
Drop by subset
s_df.drop_duplicates(subset = ['k']).show()
+---+---+
| k| v|
+---+---+
|bar| 2|
|foo| 1|
+---+---+
s_df.drop_duplicates().show()
+---+---+
| k| v|
+---+---+
|bar| 2|
|foo| 3|
|foo| 1|
+---+---+
If you want to select ALL(columns) data as distinct frrom a DataFrame (df), then
df.select('*').distinct().show(10,truncate=False)
you could do
distinct_column = 'somecol'
distinct_column_vals = df.select(distinct_column).distinct().collect()
distinct_column_vals = [v[distinct_column] for v in distinct_column_vals]
Run this first
df.createOrReplaceTempView('df')
Then run
spark.sql("""
SELECT distinct
column name
FROM
df
""").show()
If you want to see the distinct values of a specific column in your dataframe, you would just need to write the following code. It would show the 100 distinct values (if 100 values are available) for the colname
column in the df
dataframe.
df.select('colname').distinct().show(100, False)
If you want to do something fancy on the distinct values, you can save the distinct values in a vector:
a = df.select('colname').distinct()
Let us suppose that your original DataFrame is called df
. Then, you can use:
df1 = df.groupBy('column_1').agg(F.count('column_1').alias('trip_count'))
df2 = df1.sort(df1.trip_count.desc()).show()
Similar to other answer, but the question doesn’t seem to want Row objects returned, but instead actual values.
The ideal one-liner is
df.select('column').distinct().collect().toPandas().column.to_list()
assuming that running the .collect() isn’t going to be too big for memory.
I recommend a df.select('column').distinct().count()
first to estimate size, and make sure it’s not too huge beforehand.