How to import a mysqldump into Pandas
Question:
I am interested if there is a simple way to import a mysqldump into Pandas.
I have a few small (~110MB) tables and I would like to have them as DataFrames.
I would like to avoid having to put the data back into a database since that would require installation/connection to such a data base. I have the .sql files and want to import the contained tables into Pandas. Does any module exist to do this?
If versioning matters the .sql files all list "MySQL dump 10.13 Distrib 5.6.13, for Win32 (x86)" as the system the dump was produced in.
Background in hindsight
I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv, .csv or json from a third party and to do some analysis which would be given back. A new third party gave all their data in .sql format and this broke my workflow since I would need a lot of overhead to get it into a format which my programs could take as input. We ended up asking them to send the data in a different format but for business/reputation reasons wanted to look for a work around first.
Edit: Below is Sample MYSQLDump File With two tables.
/*
MySQL - 5.6.28 : Database - ztest
*********************************************************************
*/
/*!40101 SET NAMES utf8 */;
/*!40101 SET SQL_MODE=''*/;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`ztest` /*!40100 DEFAULT CHARACTER SET latin1 */;
USE `ztest`;
/*Table structure for table `food_in` */
DROP TABLE IF EXISTS `food_in`;
CREATE TABLE `food_in` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Cat` varchar(255) DEFAULT NULL,
`Item` varchar(255) DEFAULT NULL,
`price` decimal(10,4) DEFAULT NULL,
`quantity` decimal(10,0) DEFAULT NULL,
KEY `ID` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=latin1;
/*Data for the table `food_in` */
insert into `food_in`(`ID`,`Cat`,`Item`,`price`,`quantity`) values
(2,'Liq','Beer','2.5000','300'),
(7,'Liq','Water','3.5000','230'),
(9,'Liq','Soda','3.5000','399');
/*Table structure for table `food_min` */
DROP TABLE IF EXISTS `food_min`;
CREATE TABLE `food_min` (
`Item` varchar(255) DEFAULT NULL,
`quantity` decimal(10,0) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/*Data for the table `food_min` */
insert into `food_min`(`Item`,`quantity`) values
('Pizza','300'),
('Hotdogs','200'),
('Beer','300'),
('Water','230'),
('Soda','399'),
('Soup','100');
/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
Answers:
One way is to export mysqldump to sqlite (e.g. run this shell script) then read the sqlite file/database.
See the SQL section of the docs:
pd.read_sql_table(table_name, sqlite_file)
Another option is just to run read_sql
on the mysql database directly…
No
Pandas has no native way of reading a mysqldump without it passing through a database.
There is a possible workaround, but it is in my opinion a very bad idea.
Workaround (Not recommended for production use)
Of course you could parse the data from the mysqldump file using a preprocessor.
MySQLdump files often contain a lot of extra data we are not interested in when loading a pandas dataframe, so we need to preprocess it and remove noise and even reformat lines so that they conform.
Using StringIO
we can read a file, process the data before it is fed to the pandas.read_csv
funcion
from StringIO import StringIO
import re
def read_dump(dump_filename, target_table):
sio = StringIO()
fast_forward = True
with open(dump_filename, 'rb') as f:
for line in f:
line = line.strip()
if line.lower().startswith('insert') and target_table in line:
fast_forward = False
if fast_forward:
continue
data = re.findall('([^)]*)', line)
try:
newline = data[0]
newline = newline.strip(' ()')
newline = newline.replace('`', '')
sio.write(newline)
sio.write("n")
except IndexError:
pass
if line.endswith(';'):
break
sio.pos = 0
return sio
Now that we have a function that reads and formatts the data to look like a CSV file, we can read it with pandas.read_csv()
import pandas as pd
food_min_filedata = read_dump('mysqldumpexample', 'food_min')
food_in_filedata = read_dump('mysqldumpexample', 'food_in')
df_food_min = pd.read_csv(food_min_filedata)
df_food_in = pd.read_csv(food_in_filedata)
Results in:
Item quantity
0 'Pizza' '300'
1 'Hotdogs' '200'
2 'Beer' '300'
3 'Water' '230'
4 'Soda' '399'
5 'Soup' '100'
and
ID Cat Item price quantity
0 2 'Liq' 'Beer' '2.5000' '300'
1 7 'Liq' 'Water' '3.5000' '230'
2 9 'Liq' 'Soda' '3.5000' '399'
Note on Stream processing
This approach is called stream processing and is incredibly streamlined, almost taking no memory at all. In general it is a good idea to use this approach to read csv files more efficiently into pandas.
It is the parsing of a mysqldump file I advice against
I found myself in a similar situation to yours, and the answer from @firelynx was really helpful!
But since I had only limited knowledge of the tables included in the file, I extended the script by adding the header generation (pandas picks it up automatically), as well as searching for all the tables within the dump file. As a result, I ended up with a following script, that indeed works extremely fast. I switched to io.StringIO
, and save the resulting tables as table_name.csv
files.
P.S. I also support the advise against relying on this approach, and provide the code just for illustration purposes 🙂
So, first thing first, we can augment the read_dump
function like this
from io import StringIO
import re, shutil
def read_dump(dump_filename, target_table):
sio = StringIO()
read_mode = 0 # 0 - skip, 1 - header, 2 - data
with open(dump_filename, 'r') as f:
for line in f:
line = line.strip()
if line.lower().startswith('insert') and target_table in line:
read_mode = 2
if line.lower().startswith('create table') and target_table in line:
read_mode = 1
continue
if read_mode==0:
continue
# Filling up the headers
elif read_mode==1:
if line.lower().startswith('primary'):
# add more conditions here for different cases
#(e.g. when simply a key is defined, or no key is defined)
read_mode=0
sio.seek(sio.tell()-1) # delete last comma
sio.write('n')
continue
colheader = re.findall('`([w_]+)`',line)
for col in colheader:
sio.write(col.strip())
sio.write(',')
# Filling up the data -same as @firelynx's code
elif read_mode ==2:
data = re.findall('([^)]*)', line)
try:
# ...
except IndexError:
pass
if line.endswith(';'):
break
sio.seek(0)
with open (target_table+'.csv', 'w') as fd:
shutil.copyfileobj(sio, fd,-1)
return # or simply return sio itself
To find the list of tables we can use the following function:
def find_tables(dump_filename):
table_list=[]
with open(dump_filename, 'r') as f:
for line in f:
line = line.strip()
if line.lower().startswith('create table'):
table_name = re.findall('create table `([w_]+)`', line.lower())
table_list.extend(table_name)
return table_list
Then just combine the two, for example in a .py script that you’ll run like
python this_script.py mysqldump_name.sql [table_name]
import os.path
def main():
try:
if len(sys.argv)>=2 and os.path.isfile(sys.argv[1]):
if len(sys.argv)==2:
print('Table name not provided, looking for all tables...')
table_list = find_tables(sys.argv[1])
if len(table_list)>0:
print('Found tables: ',str(table_list))
for table in table_list:
read_dump(sys.argv[1], table)
elif len(sys.argv)==3:
read_dump(sys.argv[1], sys.argv[2])
except KeyboardInterrupt:
sys.exit(0)
I would like to share my solution about this problem and ask for feedback:
import pandas as pd
import re
import os.path
import csv
import logging
import sys
def convert_dump_to_intermediate_csv(dump_filename, csv_header, csv_out_put_file, delete_csv_file_after_read=True):
"""
:param dump_filename: five an mysql export dump (mysqldump...syntax)
:param csv_header: the very first line in the csv file which should appear, give a string separated by coma
:param csv_out_put_file: the name of the csv file
:param delete_csv_file_after_read: if you set this to False, no new records will be written as the file exists.
:return: returns a pandas dataframe for further analysis.
"""
with open(dump_filename, 'r') as f:
for line in f:
pre_compiled_all_values_per_line = re.compile('(?:INSERTsINTOsS[a-zS]+sVALUESs+)(?P<values>.*)(?=;)')
result = pre_compiled_all_values_per_line.finditer(line)
for element in result:
values_only = element.groups('values')[0]
value_compile = re.compile('(.*?)')
all_identified = value_compile.finditer(values_only)
for single_values in all_identified:
string_to_split = single_values.group(0)[1:-1]
string_array = string_to_split.split(",")
if not os.path.exists(csv_out_put_file):
with open(csv_out_put_file, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(csv_header.split(","))
writer.writerow(string_array)
else:
with open(csv_out_put_file, 'a', newline='') as file:
writer = csv.writer(file)
writer.writerow(string_array)
df = pd.read_csv(csv_out_put_file)
if delete_csv_file_after_read:
os.remove(csv_out_put_file)
return df
if __name__ == "__main__":
log_name = 'test.log'
LOGGER = logging.getLogger(log_name)
LOGGER.setLevel(logging.DEBUG)
LOGGER.addHandler(logging.NullHandler())
FORMATTER = logging.Formatter(
fmt='%(asctime)s %(levelname)-8s %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
SCREEN_HANDLER = logging.StreamHandler(stream=sys.stdout)
SCREEN_HANDLER.setFormatter(FORMATTER)
LOGGER.addHandler(SCREEN_HANDLER)
dump_filename = 'test_sql.sql'
header_of_csv_file = "A,B,C,D,E,F,G,H,I" # i did not identify the columns in the table definition...
csv_output_file = 'test.csv'
pandas_df = convert_dump_to_intermediate_csv(dump_filename, header_of_csv_file, csv_output_file, delete_csv_file_after_read=False)
LOGGER.debug(pandas_df)
Of course, logger part can be removed….
I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv
Try the mysqltotsv pypi module:
pip3 install --user mysqltotsv
python3 mysql-to-tsv.py --file dump.sql --outdir out1
This will produce multiple .tsv
files in the out1
directory (one .tsv
file for each table found in the MySQL dump). And from there on you can continue your normal workflow with Pandas by loading the TSV files.
I am interested if there is a simple way to import a mysqldump into Pandas.
I have a few small (~110MB) tables and I would like to have them as DataFrames.
I would like to avoid having to put the data back into a database since that would require installation/connection to such a data base. I have the .sql files and want to import the contained tables into Pandas. Does any module exist to do this?
If versioning matters the .sql files all list "MySQL dump 10.13 Distrib 5.6.13, for Win32 (x86)" as the system the dump was produced in.
Background in hindsight
I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv, .csv or json from a third party and to do some analysis which would be given back. A new third party gave all their data in .sql format and this broke my workflow since I would need a lot of overhead to get it into a format which my programs could take as input. We ended up asking them to send the data in a different format but for business/reputation reasons wanted to look for a work around first.
Edit: Below is Sample MYSQLDump File With two tables.
/*
MySQL - 5.6.28 : Database - ztest
*********************************************************************
*/
/*!40101 SET NAMES utf8 */;
/*!40101 SET SQL_MODE=''*/;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`ztest` /*!40100 DEFAULT CHARACTER SET latin1 */;
USE `ztest`;
/*Table structure for table `food_in` */
DROP TABLE IF EXISTS `food_in`;
CREATE TABLE `food_in` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Cat` varchar(255) DEFAULT NULL,
`Item` varchar(255) DEFAULT NULL,
`price` decimal(10,4) DEFAULT NULL,
`quantity` decimal(10,0) DEFAULT NULL,
KEY `ID` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=latin1;
/*Data for the table `food_in` */
insert into `food_in`(`ID`,`Cat`,`Item`,`price`,`quantity`) values
(2,'Liq','Beer','2.5000','300'),
(7,'Liq','Water','3.5000','230'),
(9,'Liq','Soda','3.5000','399');
/*Table structure for table `food_min` */
DROP TABLE IF EXISTS `food_min`;
CREATE TABLE `food_min` (
`Item` varchar(255) DEFAULT NULL,
`quantity` decimal(10,0) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/*Data for the table `food_min` */
insert into `food_min`(`Item`,`quantity`) values
('Pizza','300'),
('Hotdogs','200'),
('Beer','300'),
('Water','230'),
('Soda','399'),
('Soup','100');
/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
One way is to export mysqldump to sqlite (e.g. run this shell script) then read the sqlite file/database.
See the SQL section of the docs:
pd.read_sql_table(table_name, sqlite_file)
Another option is just to run read_sql
on the mysql database directly…
No
Pandas has no native way of reading a mysqldump without it passing through a database.
There is a possible workaround, but it is in my opinion a very bad idea.
Workaround (Not recommended for production use)
Of course you could parse the data from the mysqldump file using a preprocessor.
MySQLdump files often contain a lot of extra data we are not interested in when loading a pandas dataframe, so we need to preprocess it and remove noise and even reformat lines so that they conform.
Using StringIO
we can read a file, process the data before it is fed to the pandas.read_csv
funcion
from StringIO import StringIO
import re
def read_dump(dump_filename, target_table):
sio = StringIO()
fast_forward = True
with open(dump_filename, 'rb') as f:
for line in f:
line = line.strip()
if line.lower().startswith('insert') and target_table in line:
fast_forward = False
if fast_forward:
continue
data = re.findall('([^)]*)', line)
try:
newline = data[0]
newline = newline.strip(' ()')
newline = newline.replace('`', '')
sio.write(newline)
sio.write("n")
except IndexError:
pass
if line.endswith(';'):
break
sio.pos = 0
return sio
Now that we have a function that reads and formatts the data to look like a CSV file, we can read it with pandas.read_csv()
import pandas as pd
food_min_filedata = read_dump('mysqldumpexample', 'food_min')
food_in_filedata = read_dump('mysqldumpexample', 'food_in')
df_food_min = pd.read_csv(food_min_filedata)
df_food_in = pd.read_csv(food_in_filedata)
Results in:
Item quantity
0 'Pizza' '300'
1 'Hotdogs' '200'
2 'Beer' '300'
3 'Water' '230'
4 'Soda' '399'
5 'Soup' '100'
and
ID Cat Item price quantity
0 2 'Liq' 'Beer' '2.5000' '300'
1 7 'Liq' 'Water' '3.5000' '230'
2 9 'Liq' 'Soda' '3.5000' '399'
Note on Stream processing
This approach is called stream processing and is incredibly streamlined, almost taking no memory at all. In general it is a good idea to use this approach to read csv files more efficiently into pandas.
It is the parsing of a mysqldump file I advice against
I found myself in a similar situation to yours, and the answer from @firelynx was really helpful!
But since I had only limited knowledge of the tables included in the file, I extended the script by adding the header generation (pandas picks it up automatically), as well as searching for all the tables within the dump file. As a result, I ended up with a following script, that indeed works extremely fast. I switched to io.StringIO
, and save the resulting tables as table_name.csv
files.
P.S. I also support the advise against relying on this approach, and provide the code just for illustration purposes 🙂
So, first thing first, we can augment the read_dump
function like this
from io import StringIO
import re, shutil
def read_dump(dump_filename, target_table):
sio = StringIO()
read_mode = 0 # 0 - skip, 1 - header, 2 - data
with open(dump_filename, 'r') as f:
for line in f:
line = line.strip()
if line.lower().startswith('insert') and target_table in line:
read_mode = 2
if line.lower().startswith('create table') and target_table in line:
read_mode = 1
continue
if read_mode==0:
continue
# Filling up the headers
elif read_mode==1:
if line.lower().startswith('primary'):
# add more conditions here for different cases
#(e.g. when simply a key is defined, or no key is defined)
read_mode=0
sio.seek(sio.tell()-1) # delete last comma
sio.write('n')
continue
colheader = re.findall('`([w_]+)`',line)
for col in colheader:
sio.write(col.strip())
sio.write(',')
# Filling up the data -same as @firelynx's code
elif read_mode ==2:
data = re.findall('([^)]*)', line)
try:
# ...
except IndexError:
pass
if line.endswith(';'):
break
sio.seek(0)
with open (target_table+'.csv', 'w') as fd:
shutil.copyfileobj(sio, fd,-1)
return # or simply return sio itself
To find the list of tables we can use the following function:
def find_tables(dump_filename):
table_list=[]
with open(dump_filename, 'r') as f:
for line in f:
line = line.strip()
if line.lower().startswith('create table'):
table_name = re.findall('create table `([w_]+)`', line.lower())
table_list.extend(table_name)
return table_list
Then just combine the two, for example in a .py script that you’ll run like
python this_script.py mysqldump_name.sql [table_name]
import os.path
def main():
try:
if len(sys.argv)>=2 and os.path.isfile(sys.argv[1]):
if len(sys.argv)==2:
print('Table name not provided, looking for all tables...')
table_list = find_tables(sys.argv[1])
if len(table_list)>0:
print('Found tables: ',str(table_list))
for table in table_list:
read_dump(sys.argv[1], table)
elif len(sys.argv)==3:
read_dump(sys.argv[1], sys.argv[2])
except KeyboardInterrupt:
sys.exit(0)
I would like to share my solution about this problem and ask for feedback:
import pandas as pd
import re
import os.path
import csv
import logging
import sys
def convert_dump_to_intermediate_csv(dump_filename, csv_header, csv_out_put_file, delete_csv_file_after_read=True):
"""
:param dump_filename: five an mysql export dump (mysqldump...syntax)
:param csv_header: the very first line in the csv file which should appear, give a string separated by coma
:param csv_out_put_file: the name of the csv file
:param delete_csv_file_after_read: if you set this to False, no new records will be written as the file exists.
:return: returns a pandas dataframe for further analysis.
"""
with open(dump_filename, 'r') as f:
for line in f:
pre_compiled_all_values_per_line = re.compile('(?:INSERTsINTOsS[a-zS]+sVALUESs+)(?P<values>.*)(?=;)')
result = pre_compiled_all_values_per_line.finditer(line)
for element in result:
values_only = element.groups('values')[0]
value_compile = re.compile('(.*?)')
all_identified = value_compile.finditer(values_only)
for single_values in all_identified:
string_to_split = single_values.group(0)[1:-1]
string_array = string_to_split.split(",")
if not os.path.exists(csv_out_put_file):
with open(csv_out_put_file, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(csv_header.split(","))
writer.writerow(string_array)
else:
with open(csv_out_put_file, 'a', newline='') as file:
writer = csv.writer(file)
writer.writerow(string_array)
df = pd.read_csv(csv_out_put_file)
if delete_csv_file_after_read:
os.remove(csv_out_put_file)
return df
if __name__ == "__main__":
log_name = 'test.log'
LOGGER = logging.getLogger(log_name)
LOGGER.setLevel(logging.DEBUG)
LOGGER.addHandler(logging.NullHandler())
FORMATTER = logging.Formatter(
fmt='%(asctime)s %(levelname)-8s %(message)s',
datefmt='%Y-%m-%d %H:%M:%S')
SCREEN_HANDLER = logging.StreamHandler(stream=sys.stdout)
SCREEN_HANDLER.setFormatter(FORMATTER)
LOGGER.addHandler(SCREEN_HANDLER)
dump_filename = 'test_sql.sql'
header_of_csv_file = "A,B,C,D,E,F,G,H,I" # i did not identify the columns in the table definition...
csv_output_file = 'test.csv'
pandas_df = convert_dump_to_intermediate_csv(dump_filename, header_of_csv_file, csv_output_file, delete_csv_file_after_read=False)
LOGGER.debug(pandas_df)
Of course, logger part can be removed….
I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv
Try the mysqltotsv pypi module:
pip3 install --user mysqltotsv
python3 mysql-to-tsv.py --file dump.sql --outdir out1
This will produce multiple .tsv
files in the out1
directory (one .tsv
file for each table found in the MySQL dump). And from there on you can continue your normal workflow with Pandas by loading the TSV files.