How to remove duplicates in a row that is separated by a space | Pandas

Question:

I just finished cleaning my data and I noticed that there are columns that have duplicate values in the rows that are separated by a space. I was wondering what is the best way to remove the space after the first value and then all values after it.

There are some columns that have multiple duplicates such as 42.35 42.25 23.12 in the same cell.

The data is large with 1500 columns and around 4000 rows so there is no way I could manually fix the issue.

Thanks!

Here is an example

    PAG
0   54.36
1   50.3
2   46.81
3   47.71
4   42.35 42.35          <------ This is one of the many rows (cell) that have this duplicate. 
5   43.91
6   43.54

Here is what the data looks like and the dtypes
Dtype

MthCalDt      int64
A           float64
AA          float64
AAL         float64
AAON        float64
             ...   
ZD          float64
ZEN         float64
ZION        float64
ZTS         float64
ZWS         float64
Length: 1583, dtype: object
cleaned_data = pd.read_csv("Wrds_Datawrds_data_clean.csv")
cleaned_data.head()

    MthCalDt    A         AA    AAL     AAON    AAP     AAPL    AAT     AAWW    AB  ...     YUMC    YY         Z      ZBH     ZBRA    ZD    ZEN     ZION    ZTS     ZWS
0   20170131    48.97   36.45   44.25   33.95   164.24  121.35  42.93   52.75   23.35   ... 27.48   41.08   35.38   118.33  83.67   83.81   23.93   42.19   54.94   22.09
1   20170228    51.30   34.59   46.36   33.65   156.61  136.99  44.00   56.85   23.70   ... 26.59   44.29   33.94   117.08  90.71   81.42   27.23   44.90   53.31   22.17
2   20170331    52.87   34.40   42.30   35.35   148.26  143.66  41.84   55.45   22.85   ... 27.20   46.11   33.67   122.11  91.25   83.91   28.04   42.00   53.37   23.08
3   20170428    55.05   33.73   42.62   36.65   142.14  143.65  42.83   58.00   22.90   ... 34.12   48.97   39.00   119.65  94.27   90.24   28.75   40.03   56.11   24.40
4   20170531    60.34   32.94   48.41   36.18   133.63  152.76  39.05   48.70   22.55   ... 38.41   58.34   43.52   119.21  104.34  84.62   25.98   40.07   62.28   22.80

What I tried So Far Comes up with the error could not convert a string into float. The error is caused by the duplicate with a space in one of the rows (Cell)

cleaned_data = pd.read_csv("Wrds_Datawrds_data_clean.csv")
cleaned_data.astype(float)
cleaned_data.applymap(lambda x: x.split(" ")[0])


error: 
could not convert string to float: '126.49 126.49'    <--- This is referencing a duplicate cell that is considered a string since there is a space. 

Here is the newer code with regex that tries to replace cells with spaces but there are still cells with duplicates.

cleaned_data = pd.read_csv("Wrds_Datawrds_data_clean.csv")
columns = cleaned_data.columns.copy().drop('MthCalDt')
cleaned_data[columns] = cleaned_data[columns].replace(r"^(d+.d+) .*", r"1", regex=True).astype(float)
cleaned_data



Error: 

Cell In [114], line 3
      1 cleaned_data = pd.read_csv("Wrds_Datawrds_data_clean.csv")
      2 columns = cleaned_data.columns.copy().drop('MthCalDt')
----> 3 cleaned_data[columns] = cleaned_data[columns].replace(r"^(d+.d+) .*", r"1", regex=True).astype(float)
      4 cleaned_data


169     # Explicit copy, or required since NumPy can't view from / to object.
--> 170     return arr.astype(dtype, copy=True)
    172 return arr.astype(dtype, copy=copy)

ValueError: could not convert string to float: '60 55.66'

Here is the Step By Step Guide for what I did when cleaning the data to show if I made a mistake when cleaning the data.

Step 1. Showing what the original Data looked like.

df1 = pd.read_csv("Wrds_Datawrds_data_raw.csv")

output:
    Ticker  MthCalDt    MthPrc
0   JJSF    20170131    127.57
1   JJSF    20170228    133.8
2   JJSF    20170331    135.56
3   JJSF    20170428    134.58
4   JJSF    20170531    130.1

Step 2: Making the Tickers into columns and MthPrc as the Values indexed in MthCalDt

df1 = pd.read_csv("Wrds_Datawrds_data_raw.csv")
df2 = df1.pivot_table(index='MthCalDt', columns="Ticker", values="MthPrc", aggfunc=lambda x: ' '.join(x.dropna()))
df2.to_csv("Wrds_Datawrds_data_clean.csv")
Output:

Data Output

Asked By: Trevor Seibert

||

Answers:

  • applymap: apply in every cell
  • x.split(" ")[0]: choose the first value split by a space
df.applymap(lambda x: float(str(x).split(" ")[0]))
Answered By: Lowin Li

You could replace with a regex that looks for digits followed by a space.

df.replace(r"^(d+.d+) .*", r"1", regex=True, inplace=True)

You could also take the opportunity to change the values to floats while you are at it. In your example, you have many columns that should be floats and one that looks like it should stay like it is. So, just to keep life complicated, I’ll drop the columns that should not be visited in the conversion.

columns = df.columns.copy().drop('MthCalDt')
df[columns] = df[columns].replace(r"^(d+.d+) .*", r"1", regex=True).astype(float)
Answered By: tdelaney
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.