Pandas adding new columns performance issue
Question:
I’m trying to add 2 new columns to extract the day and the month from full date, my problem is currently my data set has about 1.2 M record and expected to be over 20 m at the end of the year, and adding the columns take very long time, so I’m asking what the best practice to do.
I’m using aqlite
and here is my code:
cnx = sqlite3.connect('data/firstline.db')
df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['day'] = pd.DatetimeIndex(df['Open_Date']).day
df['month'] = pd.DatetimeIndex(df['Open_Date']).month
df['Product_Name'].replace('', np.nan, inplace=True)
df['Product_Name'].fillna("N", inplace = True)
df['product_Type'].replace('', np.nan, inplace=True)
df['product_Type'].fillna("A", inplace = True)
df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']
Answers:
If no missing data in original DataFrame solution should be simplify a bit.
Also I think inplace
is not good practice, check this and this.
Also combine of all columns is nice solution, one of fastest, check this.
df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['Open_Date'] = pd.to_datetime(df['Open_Date'])
df['day'] = df['Open_Date'].dt.day
df['month'] = df['Open_Date'].dt.month
df['Product_Name'] = df['Product_Name'].replace('', 'N')
df['product_Type'] = df['product_Type'].replace('', 'A')
df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']
If missing values:
df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['Open_Date'] = pd.to_datetime(df['Open_Date'])
df['day'] = df['Open_Date'].dt.day
df['month'] = df['Open_Date'].dt.month
df['Product_Name'] = df['Product_Name'].replace('', np.nan).fillna("N")
df['product_Type'] = df['product_Type'].replace('', np.nan).fillna("A")
df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']
I’m trying to add 2 new columns to extract the day and the month from full date, my problem is currently my data set has about 1.2 M record and expected to be over 20 m at the end of the year, and adding the columns take very long time, so I’m asking what the best practice to do.
I’m using aqlite
and here is my code:
cnx = sqlite3.connect('data/firstline.db')
df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['day'] = pd.DatetimeIndex(df['Open_Date']).day
df['month'] = pd.DatetimeIndex(df['Open_Date']).month
df['Product_Name'].replace('', np.nan, inplace=True)
df['Product_Name'].fillna("N", inplace = True)
df['product_Type'].replace('', np.nan, inplace=True)
df['product_Type'].fillna("A", inplace = True)
df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']
If no missing data in original DataFrame solution should be simplify a bit.
Also I think inplace
is not good practice, check this and this.
Also combine of all columns is nice solution, one of fastest, check this.
df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['Open_Date'] = pd.to_datetime(df['Open_Date'])
df['day'] = df['Open_Date'].dt.day
df['month'] = df['Open_Date'].dt.month
df['Product_Name'] = df['Product_Name'].replace('', 'N')
df['product_Type'] = df['product_Type'].replace('', 'A')
df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']
If missing values:
df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['Open_Date'] = pd.to_datetime(df['Open_Date'])
df['day'] = df['Open_Date'].dt.day
df['month'] = df['Open_Date'].dt.month
df['Product_Name'] = df['Product_Name'].replace('', np.nan).fillna("N")
df['product_Type'] = df['product_Type'].replace('', np.nan).fillna("A")
df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']