Most efficient way of storing and updating data points / tables / csv tables in a table or database using python maybe SQL?

Question:

I’m just learning as I go so I’m not really sure how to achieve what I want because I don’t really know all the tools available at my disposal. I’m not sure how to ask this question but here goes.

I’m collecting data from various sensors/sources using google cloud compute engines.

The data is saved into csv files using python and pandas in the following format.

Data1.csv

Date         Data1
01-Jan-2000  122.1
...
09-Oct-2020  991.2

Data2.csv

Date         Data2
01-Jan-2000  101.1
...
09-Oct-2020  331.2

There’s always a date column and there’s always a Data column which is unique.

Sometimes the dates are daily, sometimes, they’re weekly. Different data sets have different resolutions, frequencies, etc.

When I get the Data table, I always get the entire history because sometimes there might be changes in the historical data which need to be updated.

Right now, what I do is save all the csv files to the servers, and download them to my computer one by one and merge them using python using full outer joins.

df = pd.merge(Data1df,Data2df,how='outer',left_index=True,right_index=True)

I do this one by one in a loop until I create one big dataframe.

bigframe.csv

Date         Data1 Data2 ......
01-Jan-2000  122.1 101.1 .....
...
09-Oct-2020  991.2 331.2 ....

I need the bigframe table in one bigframe to work on.

When I update the data, which is on a daily basis, I redo all the steps above, one by one and recreate the bigframe table again.

I feel what I’m doing is very inefficient.

I need advice on how do make this more efficient and I have no idea where to start.

I have never used SQL or any kind of database before. Someone told me that it’s a more efficient way of storing / manipulating / updating my data.

I want to know if I do everything that I was doing previously in the example I described.

Can a database be created where I can save dataframes/csv tables as columns in SQL tables?

Is there a way to add the csv data as a new column into an SQL table directly without saving it to my computer?

I’m also worried about different servers accessing the database and updating values at the same time, will there be a clash when 2 different "users" update the same table (but different columns, never the same column).

If the column exists is there a way to update that column easily?

Asked By: anarchy

||

Answers:

That is a lot of questions in one go. I will try to provide some answers.

  1. Yes you can move directly from CSV to the database. You want COPY or in the psql client copy. copy works from the point of view of the client user while COPY works from the server user. Though since you are using Python and presumably psycopg2 you have access to COPY in Python client. In either case you will need to CREATE TABLE for the data ahead of time. That leads to 2).

  2. I would CREATE TABLE some holding tables for the daily data dumps. You would do TRUNCATE table and then COPY the entire new set into table. COPY is very fast. A caveat is that it is all or nothing, if there an error the entire COPY will roll back.

  3. CREATE TABLE the final tables for the data. Then you could use joins to INSERT, UPDATE, DELETE on the final tables using the information in the holding tables.

Answered By: Adrian Klaver