Most efficient way of storing and updating data points / tables / csv tables in a table or database using python maybe SQL?
Question:
I’m just learning as I go so I’m not really sure how to achieve what I want because I don’t really know all the tools available at my disposal. I’m not sure how to ask this question but here goes.
I’m collecting data from various sensors/sources using google cloud compute engines.
The data is saved into csv files using python and pandas in the following format.
Data1.csv
Date Data1
01-Jan-2000 122.1
...
09-Oct-2020 991.2
Data2.csv
Date Data2
01-Jan-2000 101.1
...
09-Oct-2020 331.2
There’s always a date column and there’s always a Data column which is unique.
Sometimes the dates are daily, sometimes, they’re weekly. Different data sets have different resolutions, frequencies, etc.
When I get the Data table, I always get the entire history because sometimes there might be changes in the historical data which need to be updated.
Right now, what I do is save all the csv files to the servers, and download them to my computer one by one and merge them using python using full outer joins.
df = pd.merge(Data1df,Data2df,how='outer',left_index=True,right_index=True)
I do this one by one in a loop until I create one big dataframe.
bigframe.csv
Date Data1 Data2 ......
01-Jan-2000 122.1 101.1 .....
...
09-Oct-2020 991.2 331.2 ....
I need the bigframe table in one bigframe to work on.
When I update the data, which is on a daily basis, I redo all the steps above, one by one and recreate the bigframe table again.
I feel what I’m doing is very inefficient.
I need advice on how do make this more efficient and I have no idea where to start.
I have never used SQL or any kind of database before. Someone told me that it’s a more efficient way of storing / manipulating / updating my data.
I want to know if I do everything that I was doing previously in the example I described.
Can a database be created where I can save dataframes/csv tables as columns in SQL tables?
Is there a way to add the csv data as a new column into an SQL table directly without saving it to my computer?
I’m also worried about different servers accessing the database and updating values at the same time, will there be a clash when 2 different "users" update the same table (but different columns, never the same column).
If the column exists is there a way to update that column easily?
Answers:
That is a lot of questions in one go. I will try to provide some answers.
-
Yes you can move directly from CSV to the database. You want COPY or in the psql
client copy. copy
works from the point of view of the client user while COPY
works from the server user. Though since you are using Python and presumably psycopg2
you have access to COPY in Python client. In either case you will need to CREATE TABLE
for the data ahead of time. That leads to 2).
-
I would CREATE TABLE
some holding tables for the daily data dumps. You would do TRUNCATE
table and then COPY
the entire new set into table. COPY
is very fast. A caveat is that it is all or nothing, if there an error the entire COPY
will roll back.
-
CREATE TABLE
the final tables for the data. Then you could use joins to INSERT
, UPDATE
, DELETE
on the final tables using the information in the holding tables.
I’m just learning as I go so I’m not really sure how to achieve what I want because I don’t really know all the tools available at my disposal. I’m not sure how to ask this question but here goes.
I’m collecting data from various sensors/sources using google cloud compute engines.
The data is saved into csv files using python and pandas in the following format.
Data1.csv
Date Data1
01-Jan-2000 122.1
...
09-Oct-2020 991.2
Data2.csv
Date Data2
01-Jan-2000 101.1
...
09-Oct-2020 331.2
There’s always a date column and there’s always a Data column which is unique.
Sometimes the dates are daily, sometimes, they’re weekly. Different data sets have different resolutions, frequencies, etc.
When I get the Data table, I always get the entire history because sometimes there might be changes in the historical data which need to be updated.
Right now, what I do is save all the csv files to the servers, and download them to my computer one by one and merge them using python using full outer joins.
df = pd.merge(Data1df,Data2df,how='outer',left_index=True,right_index=True)
I do this one by one in a loop until I create one big dataframe.
bigframe.csv
Date Data1 Data2 ......
01-Jan-2000 122.1 101.1 .....
...
09-Oct-2020 991.2 331.2 ....
I need the bigframe table in one bigframe to work on.
When I update the data, which is on a daily basis, I redo all the steps above, one by one and recreate the bigframe table again.
I feel what I’m doing is very inefficient.
I need advice on how do make this more efficient and I have no idea where to start.
I have never used SQL or any kind of database before. Someone told me that it’s a more efficient way of storing / manipulating / updating my data.
I want to know if I do everything that I was doing previously in the example I described.
Can a database be created where I can save dataframes/csv tables as columns in SQL tables?
Is there a way to add the csv data as a new column into an SQL table directly without saving it to my computer?
I’m also worried about different servers accessing the database and updating values at the same time, will there be a clash when 2 different "users" update the same table (but different columns, never the same column).
If the column exists is there a way to update that column easily?
That is a lot of questions in one go. I will try to provide some answers.
-
Yes you can move directly from CSV to the database. You want COPY or in the
psql
client copy.copy
works from the point of view of the client user whileCOPY
works from the server user. Though since you are using Python and presumablypsycopg2
you have access to COPY in Python client. In either case you will need toCREATE TABLE
for the data ahead of time. That leads to 2). -
I would
CREATE TABLE
some holding tables for the daily data dumps. You would doTRUNCATE
table and thenCOPY
the entire new set into table.COPY
is very fast. A caveat is that it is all or nothing, if there an error the entireCOPY
will roll back. -
CREATE TABLE
the final tables for the data. Then you could use joins toINSERT
,UPDATE
,DELETE
on the final tables using the information in the holding tables.