How to import a mysqldump into Pandas

Question:

I am interested if there is a simple way to import a mysqldump into Pandas.

I have a few small (~110MB) tables and I would like to have them as DataFrames.

I would like to avoid having to put the data back into a database since that would require installation/connection to such a data base. I have the .sql files and want to import the contained tables into Pandas. Does any module exist to do this?

If versioning matters the .sql files all list "MySQL dump 10.13 Distrib 5.6.13, for Win32 (x86)" as the system the dump was produced in.

Background in hindsight

I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv, .csv or json from a third party and to do some analysis which would be given back. A new third party gave all their data in .sql format and this broke my workflow since I would need a lot of overhead to get it into a format which my programs could take as input. We ended up asking them to send the data in a different format but for business/reputation reasons wanted to look for a work around first.

Edit: Below is Sample MYSQLDump File With two tables.

/*
MySQL - 5.6.28 : Database - ztest
*********************************************************************
*/


/*!40101 SET NAMES utf8 */;

/*!40101 SET SQL_MODE=''*/;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`ztest` /*!40100 DEFAULT CHARACTER SET latin1 */;

USE `ztest`;

/*Table structure for table `food_in` */

DROP TABLE IF EXISTS `food_in`;

CREATE TABLE `food_in` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `Cat` varchar(255) DEFAULT NULL,
  `Item` varchar(255) DEFAULT NULL,
  `price` decimal(10,4) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL,
  KEY `ID` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=latin1;

/*Data for the table `food_in` */

insert  into `food_in`(`ID`,`Cat`,`Item`,`price`,`quantity`) values 

(2,'Liq','Beer','2.5000','300'),

(7,'Liq','Water','3.5000','230'),

(9,'Liq','Soda','3.5000','399');

/*Table structure for table `food_min` */

DROP TABLE IF EXISTS `food_min`;

CREATE TABLE `food_min` (
  `Item` varchar(255) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

/*Data for the table `food_min` */

insert  into `food_min`(`Item`,`quantity`) values 

('Pizza','300'),

('Hotdogs','200'),

('Beer','300'),

('Water','230'),

('Soda','399'),

('Soup','100');

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
Asked By: Keith

||

Answers:

One way is to export mysqldump to sqlite (e.g. run this shell script) then read the sqlite file/database.

See the SQL section of the docs:

pd.read_sql_table(table_name, sqlite_file)

Another option is just to run read_sql on the mysql database directly…

Answered By: Andy Hayden

No

Pandas has no native way of reading a mysqldump without it passing through a database.

There is a possible workaround, but it is in my opinion a very bad idea.

Workaround (Not recommended for production use)

Of course you could parse the data from the mysqldump file using a preprocessor.

MySQLdump files often contain a lot of extra data we are not interested in when loading a pandas dataframe, so we need to preprocess it and remove noise and even reformat lines so that they conform.

Using StringIO we can read a file, process the data before it is fed to the pandas.read_csv funcion

from StringIO import StringIO
import re

def read_dump(dump_filename, target_table):
    sio = StringIO()
        
    fast_forward = True
    with open(dump_filename, 'rb') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('insert') and target_table in line:
                fast_forward = False
            if fast_forward:
                continue
            data = re.findall('([^)]*)', line)
            try:
                newline = data[0]
                newline = newline.strip(' ()')
                newline = newline.replace('`', '')
                sio.write(newline)
                sio.write("n")
            except IndexError:
                pass
            if line.endswith(';'):
                break
    sio.pos = 0
    return sio

Now that we have a function that reads and formatts the data to look like a CSV file, we can read it with pandas.read_csv()

import pandas as pd

food_min_filedata = read_dump('mysqldumpexample', 'food_min')
food_in_filedata = read_dump('mysqldumpexample', 'food_in')

df_food_min = pd.read_csv(food_min_filedata)
df_food_in = pd.read_csv(food_in_filedata)

Results in:

        Item quantity
0    'Pizza'    '300'
1  'Hotdogs'    '200'
2     'Beer'    '300'
3    'Water'    '230'
4     'Soda'    '399'
5     'Soup'    '100'

and

   ID    Cat     Item     price quantity
0   2  'Liq'   'Beer'  '2.5000'    '300'
1   7  'Liq'  'Water'  '3.5000'    '230'
2   9  'Liq'   'Soda'  '3.5000'    '399'

Note on Stream processing

This approach is called stream processing and is incredibly streamlined, almost taking no memory at all. In general it is a good idea to use this approach to read csv files more efficiently into pandas.

It is the parsing of a mysqldump file I advice against

Answered By: firelynx

I found myself in a similar situation to yours, and the answer from @firelynx was really helpful!

But since I had only limited knowledge of the tables included in the file, I extended the script by adding the header generation (pandas picks it up automatically), as well as searching for all the tables within the dump file. As a result, I ended up with a following script, that indeed works extremely fast. I switched to io.StringIO, and save the resulting tables as table_name.csv files.

P.S. I also support the advise against relying on this approach, and provide the code just for illustration purposes 🙂

So, first thing first, we can augment the read_dump function like this

from io import StringIO
import re, shutil

def read_dump(dump_filename, target_table):
    sio = StringIO()

    read_mode = 0 # 0 - skip, 1 - header, 2 - data
    with open(dump_filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('insert') and target_table in line:
                read_mode = 2
            if line.lower().startswith('create table') and target_table in line:
                read_mode = 1
                continue

            if read_mode==0:
                continue

            # Filling up the headers
            elif read_mode==1:
                if line.lower().startswith('primary'):
                    # add more conditions here for different cases 
                    #(e.g. when simply a key is defined, or no key is defined)
                    read_mode=0
                    sio.seek(sio.tell()-1) # delete last comma
                    sio.write('n')
                    continue
                colheader = re.findall('`([w_]+)`',line)
                for col in colheader:
                    sio.write(col.strip())
                    sio.write(',')

            # Filling up the data -same as @firelynx's code
            elif read_mode ==2:
                data = re.findall('([^)]*)', line)
                try:
                    # ...
                except IndexError:
                    pass
                if line.endswith(';'):
                    break
    sio.seek(0)
    with open (target_table+'.csv', 'w') as fd:
        shutil.copyfileobj(sio, fd,-1)
    return # or simply return sio itself

To find the list of tables we can use the following function:

def find_tables(dump_filename):
    table_list=[]

    with open(dump_filename, 'r') as f:
        for line in f:
            line = line.strip()
            if line.lower().startswith('create table'):
                table_name = re.findall('create table `([w_]+)`', line.lower())
                table_list.extend(table_name)

    return table_list

Then just combine the two, for example in a .py script that you’ll run like

python this_script.py mysqldump_name.sql [table_name]

import os.path
def main():
    try:
        if len(sys.argv)>=2 and os.path.isfile(sys.argv[1]):
            if len(sys.argv)==2:
                print('Table name not provided, looking for all tables...')
                table_list = find_tables(sys.argv[1])
                if len(table_list)>0:
                    print('Found tables: ',str(table_list))
                    for table in table_list:
                        read_dump(sys.argv[1], table)
            elif len(sys.argv)==3:
                read_dump(sys.argv[1], sys.argv[2])
    except KeyboardInterrupt:
        sys.exit(0)
Answered By: Tony S.

I would like to share my solution about this problem and ask for feedback:

import pandas as pd
import re
import os.path
import csv
import logging
import sys


def convert_dump_to_intermediate_csv(dump_filename, csv_header, csv_out_put_file, delete_csv_file_after_read=True):
    """
    :param dump_filename: five an mysql export dump (mysqldump...syntax)
    :param csv_header: the very first line in the csv file which should appear, give a string separated by coma
    :param csv_out_put_file: the name of the csv file
    :param delete_csv_file_after_read: if you set this to False, no new records will be written as the file exists.
    :return: returns a pandas dataframe for further analysis.
    """
    with open(dump_filename, 'r') as f:
        for line in f:
            pre_compiled_all_values_per_line = re.compile('(?:INSERTsINTOsS[a-zS]+sVALUESs+)(?P<values>.*)(?=;)')
            result = pre_compiled_all_values_per_line.finditer(line)
            for element in result:
                values_only = element.groups('values')[0]
                value_compile = re.compile('(.*?)')
                all_identified = value_compile.finditer(values_only)
                for single_values in all_identified:
                    string_to_split = single_values.group(0)[1:-1]
                    string_array = string_to_split.split(",")

                    if not os.path.exists(csv_out_put_file):
                        with open(csv_out_put_file, 'w', newline='') as file:
                            writer = csv.writer(file)
                            writer.writerow(csv_header.split(","))
                            writer.writerow(string_array)
                    else:
                        with open(csv_out_put_file, 'a', newline='') as file:
                            writer = csv.writer(file)
                            writer.writerow(string_array)
    df = pd.read_csv(csv_out_put_file)
    if delete_csv_file_after_read:
        os.remove(csv_out_put_file)
    return df


if __name__ == "__main__":
    log_name = 'test.log'
    LOGGER = logging.getLogger(log_name)
    LOGGER.setLevel(logging.DEBUG)
    LOGGER.addHandler(logging.NullHandler())
    FORMATTER = logging.Formatter(
        fmt='%(asctime)s %(levelname)-8s %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S')
    SCREEN_HANDLER = logging.StreamHandler(stream=sys.stdout)
    SCREEN_HANDLER.setFormatter(FORMATTER)
    LOGGER.addHandler(SCREEN_HANDLER)

    dump_filename = 'test_sql.sql'
    header_of_csv_file = "A,B,C,D,E,F,G,H,I" # i did not identify the columns in the table definition...
    csv_output_file = 'test.csv'
    pandas_df = convert_dump_to_intermediate_csv(dump_filename, header_of_csv_file, csv_output_file, delete_csv_file_after_read=False)
    LOGGER.debug(pandas_df)

Of course, logger part can be removed….

Answered By: Peter Ebelsberger

I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv

Try the mysqltotsv pypi module:

pip3 install --user mysqltotsv
python3 mysql-to-tsv.py --file dump.sql --outdir out1

This will produce multiple .tsv files in the out1 directory (one .tsv file for each table found in the MySQL dump). And from there on you can continue your normal workflow with Pandas by loading the TSV files.

Answered By: wsdookadr