How to create unique index in Dask DataFrame?
Question:
Imagine I have a Dask
DataFrame from read_csv
or created another way.
How can I make a unique index for the dask dataframe?
Note:
reset_index
builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,… ) for Partition 1,
(0,1,2,3,4,5,… ) for Partition 2, (0,1,2,3,4,5,… ) for Partition 3 and so on.
I would like a unique index for every row in the dataframe (across all partitions).
Answers:
This is my approach (function) for building a unique index with map_partitions and truly random numbers, as simply reset_index creates a monotonically ascending index in each Partition!
import sys
import random
from dask.distributed import Client
client = Client()
def createDDF_u_idx(ddf):
def create_u_idx(df):
rng = random.SystemRandom()
p_id = str(rng.randint(0, sys.maxsize))
df['idx'] = [p_id + 'a' + str(x) for x in range(df.index.size)]
return df
cols_meta = {c: str(ddf[c].dtype) for c in ddf.columns}
ddf = ddf.map_partitions(lambda df: create_u_idx(df), meta={**cols_meta, 'idx': 'str'})
ddf = client.persist(ddf) # compute up to here, keep results in memory
ddf = ddf.set_index('idx')
return ddf
The accepted answer creates a random index, while the approach below creates a monotonically increasing index:
import dask.dataframe as dd
import pandas as pd
# save some data into unindexed csv
num_rows = 15
df = pd.DataFrame(range(num_rows), columns=['x'])
df.to_csv('dask_test.csv', index=False)
# read from csv
ddf = dd.read_csv('dask_test.csv', blocksize=10)
# assume that rows are already ordered (so no sorting is needed)
# then can modify the index using the lengths of partitions
cumlens = ddf.map_partitions(len).compute().cumsum()
# since processing will be done on a partition-by-partition basis, save them
# individually
new_partitions = [ddf.partitions[0]]
for npart, partition in enumerate(ddf.partitions[1:].partitions):
partition.index = partition.index + cumlens[npart]
new_partitions.append(partition)
# this is our new ddf
ddf = dd.concat(new_partitions)
This code is based on an answer to a different question: Process dask dataframe by chunks of rows
Imagine I have a Dask
DataFrame from read_csv
or created another way.
How can I make a unique index for the dask dataframe?
Note:
reset_index
builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,… ) for Partition 1,
(0,1,2,3,4,5,… ) for Partition 2, (0,1,2,3,4,5,… ) for Partition 3 and so on.
I would like a unique index for every row in the dataframe (across all partitions).
This is my approach (function) for building a unique index with map_partitions and truly random numbers, as simply reset_index creates a monotonically ascending index in each Partition!
import sys
import random
from dask.distributed import Client
client = Client()
def createDDF_u_idx(ddf):
def create_u_idx(df):
rng = random.SystemRandom()
p_id = str(rng.randint(0, sys.maxsize))
df['idx'] = [p_id + 'a' + str(x) for x in range(df.index.size)]
return df
cols_meta = {c: str(ddf[c].dtype) for c in ddf.columns}
ddf = ddf.map_partitions(lambda df: create_u_idx(df), meta={**cols_meta, 'idx': 'str'})
ddf = client.persist(ddf) # compute up to here, keep results in memory
ddf = ddf.set_index('idx')
return ddf
The accepted answer creates a random index, while the approach below creates a monotonically increasing index:
import dask.dataframe as dd
import pandas as pd
# save some data into unindexed csv
num_rows = 15
df = pd.DataFrame(range(num_rows), columns=['x'])
df.to_csv('dask_test.csv', index=False)
# read from csv
ddf = dd.read_csv('dask_test.csv', blocksize=10)
# assume that rows are already ordered (so no sorting is needed)
# then can modify the index using the lengths of partitions
cumlens = ddf.map_partitions(len).compute().cumsum()
# since processing will be done on a partition-by-partition basis, save them
# individually
new_partitions = [ddf.partitions[0]]
for npart, partition in enumerate(ddf.partitions[1:].partitions):
partition.index = partition.index + cumlens[npart]
new_partitions.append(partition)
# this is our new ddf
ddf = dd.concat(new_partitions)
This code is based on an answer to a different question: Process dask dataframe by chunks of rows