How can I check if a record exists when passing a dataframe to SQL in pandas?

Question:

Background

I’m building an application that passes data from a CSV to a MS SQL database. This database is being used as a repository for all my enterprise’s records of this type (phone calls). When I run the application, it reads the CSV and converts it to a Pandas dataframe, which I then use SQLAlchemy and pyodbc to append the records to my table in SQL.

However, due to the nature of the content I’m working with, there is oftentimes data that we already have imported to the table. I am looking for a way to check if my primary key exists (a column in my SQL table and in my dataframe) before appending each record to the table.

Current code

# save dataframe to mssql DB
engine = sql.create_engine('mssql+pyodbc://CTR-HV-DEVSQL3/MasterCallDb')
df.to_sql('Calls', engine, if_exists='append')

Sample data

My CSV is imported as a pandas dataframe (primary key is FileName, its always unique), then passed to MS SQL. This is my dataframe (df):

+---+------------+-------------+
|   |  FileName  |    Name     |
+---+------------+-------------+
| 1 | 123.flac   | Robert      |
| 2 | 456.flac   | Michael     |
| 3 | 789.flac   | Joesph      |
+---+------------+-------------+

Any ideas? Thanks!

Asked By: rchav9

||

Answers:

What if you iterated through rows DataFrame.iterrows() and then on each iteration used ON DUPLICATE for your key value FileName to not add it again.

Answered By: Benloper

Assuming you have no memory constraints and you’re not inserting null values, you could:

sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = pd.concat((df, sql_df)).drop_duplicates(subset=['pk_1', 'pk_2', 'pk_3'], keep=False)
df = df.dropna()
df.to_sql('my_table', con=con, if_exists='append')

Depending on the application you could also reduce the size of sql_df by changing the query.

Update – Better overall and can insert null values:

sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = df.loc[df[pks].merge(sql_df[pks], on=pks, how='left', indicator=True)['_merge'] == 'left_only']
# df = df.drop_duplicates(subset=pks) # add it if you want to drop any duplicates that you may insert
df.to_sql('my_table', con=con, if_exists='append')
Answered By: Lucas

You can check if is empty, like this:

sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
    
if sql_df.empty:
    print("Is empty")
else:
     print("Is not empty")

you can set parameter index=False see example bellow

data.to_sql('book_details', con = engine, if_exists = 'append', chunksize = 1000, index=False)**

If it is not set, then the command automatically adds the indexcolumn

book_details is the name of the table we want to insert our dataframe into.

Result

[SQL: INSERT INTO book_details (`index`, book_id, title, price) VALUES (%(index)s, %(book_id)s, %(title)s, %(price)s)]
[parameters: ({'index': 0, 'book_id': 55, 'title': 'Programming', 'price': 29},
{'index': 1, 'book_id': 66, 'title': 'Learn', 'price': 23},
{'index': 2, 'book_id': 77, 'title': 'Data Science', 'price': 27})]

Therefore, it needs to be in the table!!!

Answered By: Master EnergyNet