Pandas adding new columns performance issue

Question:

I’m trying to add 2 new columns to extract the day and the month from full date, my problem is currently my data set has about 1.2 M record and expected to be over 20 m at the end of the year, and adding the columns take very long time, so I’m asking what the best practice to do.

I’m using aqlite
and here is my code:

cnx = sqlite3.connect('data/firstline.db')
df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['day'] = pd.DatetimeIndex(df['Open_Date']).day
df['month'] = pd.DatetimeIndex(df['Open_Date']).month

df['Product_Name'].replace('', np.nan, inplace=True)
df['Product_Name'].fillna("N", inplace = True) 

df['product_Type'].replace('', np.nan, inplace=True)
df['product_Type'].fillna("A", inplace = True) 

df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']
Asked By: Islam Fahmy

||

Answers:

If no missing data in original DataFrame solution should be simplify a bit.

Also I think inplace is not good practice, check this and this.

Also combine of all columns is nice solution, one of fastest, check this.

df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['Open_Date'] = pd.to_datetime(df['Open_Date'])

df['day'] = df['Open_Date'].dt.day
df['month'] = df['Open_Date'].dt.month

df['Product_Name'] = df['Product_Name'].replace('', 'N')
df['product_Type'] = df['product_Type'].replace('', 'A')


df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']

If missing values:

df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['Open_Date'] = pd.to_datetime(df['Open_Date'])

df['day'] = df['Open_Date'].dt.day
df['month'] = df['Open_Date'].dt.month

df['Product_Name'] = df['Product_Name'].replace('', np.nan).fillna("N")
df['product_Type'] = df['product_Type'].replace('', np.nan).fillna("A")


df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.