Drop rows in a pandas DataFrame up to a certain value
Question:
I’m currently working with a pandas data frame, with approximately 80000 rows, like the following one:
artist
date
Drake
2014-10-12
Kendrick Lamar
2014-10-12
Ed Sheeran
2014-10-12
Maroon 5
2014-10-12
Rihanna
2014-10-19
Foo Fighters
2014-10-19
Bad Bunny
2014-10-19
Eminem
2014-10-19
Drake
2014-10-26
Eminem
2014-10-26
Taylor Swift
2014-10-26
Kendrick Lamar
2014-10-26
Rihanna
2014-11-02
Ed Sheeran
2014-11-02
Kanye West
2014-11-02
Lime Cordiale
2014-11-02
I only want to keep the rows that have a date greater or equal to 2014-10-26
. The result should be something like the following table:
artist
date
Drake
2014-10-26
Eminem
2014-10-26
Taylor Swift
2014-10-26
Kendrick Lamar
2014-10-26
Rihanna
2014-11-02
Ed Sheeran
2014-11-02
Kanye West
2014-11-02
Lime Cordiale
2014-11-02
I tried using pandas .drop()
method like in the following line:
dataset = pd.read_csv("charts.csv")
dataset = pd.DataFrame(dataset)
dataset = dataset.drop(dataset.loc[dataset['date'] <= "2014-10-19", :])
but after executing I get the following error:
KeyError: "['track_id', 'name', 'country', 'date', 'position', 'streams', 'artists', 'artist_genres', 'duration', 'explicit'] not found in axis"
Answers:
not sure what error you got you must have to mentioned error log.
Anyway
You can use index for drop rows, get index by filter data and then drop it
indexx = dataset[ dataset['date'] <= "2014-10-19" ].index
dataset.drop(indexx , inplace=True)
You could use:
last_date_to_drop = pd.to_datetime("2014-10-19")
dataset["date"] = pd.to_datetime(dataset["date"])
dataset = dataset.loc[dataset["date"].gt(last_date_to_drop)].copy()
You don’t need to sort or drop. Just subset the dataframe and copy as above.
Also drop is not what you think it will do. It won’t drop by row values, it drops by column or index labels.
import pandas as pd
df = pd.DataFrame({'artist':['Drake', 'Kendrick Lamar', 'Kendrick Lamar', 'Drake'],
'date':['2014-10-12', '2014-10-12', '2014-10-26', '2014-10-26']})
# Be cautious : sort first
df = (df.sort_values(by='date', key=lambda t: pd.to_datetime(t, format='%Y-%m-%d'))
.drop_duplicates(subset=['artist'], keep='last'))
print(df)
# artist date
# 2 Kendrick Lamar 2014-10-26
# 3 Drake 2014-10-26
I’m currently working with a pandas data frame, with approximately 80000 rows, like the following one:
artist | date |
---|---|
Drake | 2014-10-12 |
Kendrick Lamar | 2014-10-12 |
Ed Sheeran | 2014-10-12 |
Maroon 5 | 2014-10-12 |
Rihanna | 2014-10-19 |
Foo Fighters | 2014-10-19 |
Bad Bunny | 2014-10-19 |
Eminem | 2014-10-19 |
Drake | 2014-10-26 |
Eminem | 2014-10-26 |
Taylor Swift | 2014-10-26 |
Kendrick Lamar | 2014-10-26 |
Rihanna | 2014-11-02 |
Ed Sheeran | 2014-11-02 |
Kanye West | 2014-11-02 |
Lime Cordiale | 2014-11-02 |
I only want to keep the rows that have a date greater or equal to 2014-10-26
. The result should be something like the following table:
artist | date |
---|---|
Drake | 2014-10-26 |
Eminem | 2014-10-26 |
Taylor Swift | 2014-10-26 |
Kendrick Lamar | 2014-10-26 |
Rihanna | 2014-11-02 |
Ed Sheeran | 2014-11-02 |
Kanye West | 2014-11-02 |
Lime Cordiale | 2014-11-02 |
I tried using pandas .drop()
method like in the following line:
dataset = pd.read_csv("charts.csv")
dataset = pd.DataFrame(dataset)
dataset = dataset.drop(dataset.loc[dataset['date'] <= "2014-10-19", :])
but after executing I get the following error:
KeyError: "['track_id', 'name', 'country', 'date', 'position', 'streams', 'artists', 'artist_genres', 'duration', 'explicit'] not found in axis"
not sure what error you got you must have to mentioned error log.
Anyway
You can use index for drop rows, get index by filter data and then drop it
indexx = dataset[ dataset['date'] <= "2014-10-19" ].index
dataset.drop(indexx , inplace=True)
You could use:
last_date_to_drop = pd.to_datetime("2014-10-19")
dataset["date"] = pd.to_datetime(dataset["date"])
dataset = dataset.loc[dataset["date"].gt(last_date_to_drop)].copy()
You don’t need to sort or drop. Just subset the dataframe and copy as above.
Also drop is not what you think it will do. It won’t drop by row values, it drops by column or index labels.
import pandas as pd
df = pd.DataFrame({'artist':['Drake', 'Kendrick Lamar', 'Kendrick Lamar', 'Drake'],
'date':['2014-10-12', '2014-10-12', '2014-10-26', '2014-10-26']})
# Be cautious : sort first
df = (df.sort_values(by='date', key=lambda t: pd.to_datetime(t, format='%Y-%m-%d'))
.drop_duplicates(subset=['artist'], keep='last'))
print(df)
# artist date
# 2 Kendrick Lamar 2014-10-26
# 3 Drake 2014-10-26