window-functions

Create ranking within set of rows resulting from GROUP BY

Create ranking within set of rows resulting from GROUP BY Question: I have the following table CREATE TABLE "results" ( "player" INTEGER, "tournament" INTEGER, "year" INTEGER, "course" INTEGER, "round" INTEGER, "score" INTEGER, ); With the following data sample for a single tournament / year / round-combination. 1 33 2016 895 1 20 2 33 2016 …

Total answers: 1

PySpark – assigning group id based on group member count

PySpark – assigning group id based on group member count Question: I have a dataframe where I want to assign id in for each window partition and for each 5 rows. Meaning, the id should increase/change when the partition has a different value or the number of rows in a partition is more than 5. …

Total answers: 2

Order results by number of other rows with the same column value?

Order results by number of other rows with the same column value? Question: I have a table with the columns id, GENUS, SPECIES. The entries of the table many have multiple of the same GENUS but one unique SPECIES per. id, GENUS, SPECIES 0 , Homo, Sapiens 1 , Homo, Habilis 2 , Canis, Familiaris …

Total answers: 1

How to calculate running total per customer for previous 365 days in pandas

How to calculate running total per customer for previous 365 days in pandas Question: I am trying to calculate a running total per customer for the previous 365 days using pandas but my code isn’t working. My intended output would be something like this: date customer daily_total_per_customer rolling_total 2016-07-29 1 100 100 2016-08-01 1 50 …

Total answers: 1

How to write pandas' merge_asof equivalence in PySpark

How to write pandas' merge_asof equivalence in PySpark Question: I am trying to write a merge_asof of pandas in Spark. Here is a sample example: from datetime import datetime df1 = spark.createDataFrame( [ (datetime(2019,2,3,13,30,0,23),"GOOG",720.5,720.93), (datetime(2019,2,3,13,30,0,23),"MSFT",51.95,51.96), (datetime(2019,2,3,13,30,0,20),"MSFT",51.97,51.98), (datetime(2019,2,3,13,30,0,41),"MSFT",51.99,52.0), (datetime(2019,2,3,13,30,0,48),"GOOG",720.5,720.93), (datetime(2019,2,3,13,30,0,49),"AAPL",97.99,98.01), (datetime(2019,2,3,13,30,0,72),"GOOG",720.5,720.88), (datetime(2019,2,3,13,30,0,75),"MSFT",52.1,52.03) ], ("time", "ticker", "bid", "ask") ) df2 = spark.createDataFrame( [ (datetime(2019,2,3,13,30,0,23),"MSFT",51.95,75), (datetime(2019,2,3,13,30,0,38),"MSFT",51.95,155), (datetime(2019,2,3,13,30,0,48),"GOOG",720.77,100), (datetime(2019,2,3,13,30,0,48),"GOOG",720.92,100), …

Total answers: 1

How to write SQL window functions in pandas

How to write SQL window functions in pandas Question: Is there an idiomatic equivalent to SQL’s window functions in Pandas? For example, what’s the most compact way to write the equivalent of this in Pandas? SELECT state_name, state_population, SUM(state_population) OVER() AS national_population FROM population ORDER BY state_name Or this?: SELECT state_name, state_population, region, SUM(state_population) OVER(PARTITION …

Total answers: 2

Spark SQL Row_number() PartitionBy Sort Desc

Spark SQL Row_number() PartitionBy Sort Desc Question: I’ve successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code: from pyspark import HiveContext from pyspark.sql.types import * from pyspark.sql import Row, functions as F from pyspark.sql.window import Window …

Total answers: 6

Pandas get topmost n records within each group

Pandas get topmost n records within each group Question: Suppose I have pandas DataFrame like this: df = pd.DataFrame({‘id’:[1,1,1,2,2,2,2,3,4], ‘value’:[1,2,3,1,2,3,4,1,1]}) which looks like: id value 0 1 1 1 1 2 2 1 3 3 2 1 4 2 2 5 2 3 6 2 4 7 3 1 8 4 1 I want to …

Total answers: 6