How to create unique index in Dask DataFrame?

Question:

Imagine I have a Dask DataFrame from read_csv or created another way.

How can I make a unique index for the dask dataframe?

Note:

reset_index builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,… ) for Partition 1,
(0,1,2,3,4,5,… ) for Partition 2, (0,1,2,3,4,5,… ) for Partition 3 and so on.

I would like a unique index for every row in the dataframe (across all partitions).

Asked By: Spar

||

Answers:

This is my approach (function) for building a unique index with map_partitions and truly random numbers, as simply reset_index creates a monotonically ascending index in each Partition!

import sys
import random
from dask.distributed import Client

client = Client()

def createDDF_u_idx(ddf):

    def create_u_idx(df):
        rng = random.SystemRandom()
        p_id = str(rng.randint(0, sys.maxsize))

        df['idx'] = [p_id + 'a' + str(x) for x in range(df.index.size)]

        return df
    cols_meta = {c: str(ddf[c].dtype) for c in ddf.columns}
    ddf = ddf.map_partitions(lambda df: create_u_idx(df), meta={**cols_meta, 'idx': 'str'})
    ddf = client.persist(ddf)  # compute up to here, keep results in memory
    ddf = ddf.set_index('idx')

    return ddf
Answered By: Spar

The accepted answer creates a random index, while the approach below creates a monotonically increasing index:

import dask.dataframe as dd
import pandas as pd

# save some data into unindexed csv
num_rows = 15
df = pd.DataFrame(range(num_rows), columns=['x'])
df.to_csv('dask_test.csv', index=False)

# read from csv
ddf = dd.read_csv('dask_test.csv', blocksize=10)

# assume that rows are already ordered (so no sorting is needed)
# then can modify the index using the lengths of partitions
cumlens = ddf.map_partitions(len).compute().cumsum()

# since processing will be done on a partition-by-partition basis, save them
# individually
new_partitions = [ddf.partitions[0]]
for npart, partition in enumerate(ddf.partitions[1:].partitions):
    partition.index = partition.index + cumlens[npart]
    new_partitions.append(partition)

# this is our new ddf
ddf = dd.concat(new_partitions)

This code is based on an answer to a different question: Process dask dataframe by chunks of rows

Answered By: SultanOrazbayev