How ignore specific range of rows in a dataframe

Question:

I have a dataframe with 1000000 rows and I want to ignore 8000 rows in first 40000 rows and then ignore 80000 rows in next 40000 rows. How can I achieve this ?

As an example:
Drop 1 to 8000, 40001 to 48000, 80001 to 88000 rows and so on.

Asked By: user_scc

||

Answers:

Approach

  • Adapted Numpy slicing function: Dynamically create slice indices np.r an answer that uses a mask rather than np.r_ so can be done dynamically
  • Two solutions
    • Looping solution (to illustrate method)
    • Vectorized solution (for performance) using
      • [numpy.ma.masked_where(https://numpy.org/doc/stable/reference/generated/numpy.ma.masked_where.html) to generate mask array and
      • numpy.ma.getmask to get mask
  • Note: Dataframes are 0-index, so first 40,000 rows has indexing 0 to 39,999 rather than 1 to 40,000

Looping Solution

def drop_rows(df, chunksize, separation):
    '''
        Drops rows in dataframe
        
        rows 0               to    chunksizse
        rows 1*separation    to    1*separation + chunksize
        rows 2*separation    to    2*separation + chunksizse
                 ...
    '''
    # Create mask which is True for rows we want to drop
    n = len(df.index)                     # number of rows in dataframe
    mask = np.zeros(n, dtype=bool)
    for start in np.arange(0, n, separation):
        stop = start + chunksize
        mask[start:stop] = True
    
    return df.drop(df[mask].index)             # drop rows based upon indexes where mask is true

Vectorized Solution

def drop_rows_vect(df, chunksize, separation):
    '''
        Drops rows in dataframe
        
        rows 0               to    chunksizse
        rows 1*separation    to    1*separation + chunksize
        rows 2*separation    to    2*separation + chunksizse
                 ...
    '''
    # Create mask which is True for rows we want to drop
    mask = np.ma.getmaskarray(np.ma.masked_where(df.index % separation < chunksize, df.index))
    
    return df.drop(df[mask].index)   # drop rows based upon indexes where mask is true

Test

Create random dataframe with two columns

data = np.random.randint(100, size=(40, 2))
df = pd.DataFrame(data, columns = ['A', 'B'])

Drop rows using the two methods

# Drop first 8 rows in chunks with a separation of 10
df_looping = drop_rows(df, chunksize = 8, separation = 10)
df_vect = drop_rows_vect(df, chunksize = 8, separation = 10)

Two methods produces the same result

print(df_looping.equals(df_vect )
# Output: True

Show result

print(df_vect)
# Output

    A   B
2   69  48
3   61  45
4   15  29
7   30  42
8   54  46
9   22  78
Answered By: DarrylG
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.