How do I split a huge text file in python

Question:

I have a huge text file (~1GB) and sadly the text editor I use won’t read such a large file. However, if I can just split it into two or three parts I’ll be fine, so, as an exercise I wanted to write a program in python to do it.

What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.

Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?

I’ll be writing this code test-first, so there’s no need to give me a complete answer, unless its a one-liner 😉

Asked By: quamrana

||

Answers:

Check out os.stat() for file size and file.readlines([sizehint]). Those two functions should be all you need for the reading part, and hopefully you know how to do the writing 🙂

Answered By: Kamil Kisiel

You can use wc and split (see the respective manpages) to get the desired effect. In bash:

split -dl$((`wc -l 'filename'|sed 's/ .*$//'` / 3 + 1)) filename filename-chunk.

produces 3 parts of the same linecount (with a rounding error in the last, of course), named filename-chunk.00 to filename-chunk.02.

Answered By: Svante

Or, a python version of wc and split:

lines = 0
for l in open(filename): lines += 1

Then some code to read the first lines/3 into one file, the next lines/3 into another , etc.

Answered By: Claudiu

I’ve written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.

def Split(inputFile,numParts,outputName):
    fileSize=os.stat(inputFile).st_size
    parts=FileSizeParts(fileSize,numParts)
    openInputFile = open(inputFile, 'r')
    outPart=1
    for part in parts:
        if openInputFile.tell()<fileSize:
            fullOutputName=outputName+os.extsep+str(outPart)
            outPart+=1
            openOutputFile=open(fullOutputName,'w')
            openOutputFile.writelines(openInputFile.readlines(part))
            openOutputFile.close()
    openInputFile.close()
    return outPart-1
Answered By: quamrana

linux has a split command

split -l 100000 file.txt

would split into files of equal 100,000 line size

Answered By: James

don’t forget seek() and mmap() for random access to files.

def getSomeChunk(filename, start, len):
    fobj = open(filename, 'r+b')
    m = mmap.mmap(fobj.fileno(), 0)
    return m[start:start+len]
Answered By: Joe Koberg

This generator method is a (slow) way to get a slice of lines without blowing up your memory.

import itertools

def slicefile(filename, start, end):
    lines = open(filename)
    return itertools.islice(lines, start, end)

out = open("/blah.txt", "w")
for line in slicefile("/python27/readme.txt", 10, 15):
    out.write(line)
Answered By: Ryan Ginstrom

As an alternative method, using the logging library:

>>> import logging.handlers
>>> log = logging.getLogger()
>>> fh = logging.handlers.RotatingFileHandler("D://filename.txt", 
     maxBytes=2**20*100, backupCount=100) 
# 100 MB each, up to a maximum of 100 files
>>> log.addHandler(fh)
>>> log.setLevel(logging.INFO)
>>> f = open("D://biglog.txt")
>>> while True:
...     log.info(f.readline().strip())

Your files will appear as follows:

filename.txt (end of file)
filename.txt.1
filename.txt.2

filename.txt.10 (start of file)

This is a quick and easy way to make a huge log file match your RotatingFileHandler implementation.

Answered By: Alex L

This worked for me

import os

fil = "inputfile"
outfil = "outputfile"

f = open(fil,'r')

numbits = 1000000000

for i in range(0,os.stat(fil).st_size/numbits+1):
    o = open(outfil+str(i),'w')
    segment = f.readlines(numbits)
    for c in range(0,len(segment)):
        o.write(segment[c]+"n")
    o.close()
Answered By: Ryan

I had a requirement to split csv files for import into Dynamics CRM since the file size limit for import is 8MB and the files we receive are much larger. This program allows user to input FileNames and LinesPerFile, and then splits the specified files into the requested number of lines. I can’t believe how fast it works!

# user input FileNames and LinesPerFile
FileCount = 1
FileNames = []
while True:
    FileName = raw_input('File Name ' + str(FileCount) + ' (enter "Done" after last File):')
    FileCount = FileCount + 1
    if FileName == 'Done':
        break
    else:
        FileNames.append(FileName)
LinesPerFile = raw_input('Lines Per File:')
LinesPerFile = int(LinesPerFile)

for FileName in FileNames:
    File = open(FileName)

    # get Header row
    for Line in File:
        Header = Line
        break

    FileCount = 0
    Linecount = 1
    for Line in File:

        #skip Header in File
        if Line == Header:
            continue

        #create NewFile with Header every [LinesPerFile] Lines
        if Linecount % LinesPerFile == 1:
            FileCount = FileCount + 1
            NewFileName = FileName[:FileName.find('.')] + '-Part' + str(FileCount) + FileName[FileName.find('.'):]
            NewFile = open(NewFileName,'w')
            NewFile.write(Header)

        NewFile.write(Line)
        Linecount = Linecount + 1

    NewFile.close()
Answered By: Ron Smith

While Ryan Ginstrom’s answer is correct, it does take longer than it should (as he has already noted). Here’s a way to circumvent the multiple calls to itertools.islice by successively iterating over the open file descriptor:

def splitfile(infilepath, chunksize):
    fname, ext = infilepath.rsplit('.',1)
    i = 0
    written = False
    with open(infilepath) as infile:
        while True:
            outfilepath = "{}{}.{}".format(fname, i, ext)
            with open(outfilepath, 'w') as outfile:
                for line in (infile.readline() for _ in range(chunksize)):
                    outfile.write(line)
                written = bool(line)
            if not written:
                break
            i += 1
Answered By: inspectorG4dget

usage – split.py filename splitsizeinkb

import os
import sys

def getfilesize(filename):
   with open(filename,"rb") as fr:
       fr.seek(0,2) # move to end of the file
       size=fr.tell()
       print("getfilesize: size: %s" % size)
       return fr.tell()

def splitfile(filename, splitsize):
   # Open original file in read only mode
   if not os.path.isfile(filename):
       print("No such file as: "%s"" % filename)
       return

   filesize=getfilesize(filename)
   with open(filename,"rb") as fr:
    counter=1
    orginalfilename = filename.split(".")
    readlimit = 5000 #read 5kb at a time
    n_splits = filesize//splitsize
    print("splitfile: No of splits required: %s" % str(n_splits))
    for i in range(n_splits+1):
        chunks_count = int(splitsize)//int(readlimit)
        data_5kb = fr.read(readlimit) # read
        # Create split files
        print("chunks_count: %d" % chunks_count)
        with open(orginalfilename[0]+"_{id}.".format(id=str(counter))+orginalfilename[1],"ab") as fw:
            fw.seek(0) 
            fw.truncate()# truncate original if present
            while data_5kb:                
                fw.write(data_5kb)
                if chunks_count:
                    chunks_count-=1
                    data_5kb = fr.read(readlimit)
                else: break            
        counter+=1 

if __name__ == "__main__":
   if len(sys.argv) < 3: print("Filename or splitsize not provided: Usage:     filesplit.py filename splitsizeinkb ")
   else:
       filesize = int(sys.argv[2]) * 1000 #make into kb
       filename = sys.argv[1]
       splitfile(filename, filesize)
Answered By: Mudit Verma

Here is a python script you can use for splitting large files using subprocess:

"""
Splits the file into the same directory and
deletes the original file
"""

import subprocess
import sys
import os

SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2'  # subprocess expects a string, i.e. 2 = aa, ab, ac etc..

if __name__ == "__main__":

    file_path = sys.argv[1]
    # i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
    subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
                     os.path.dirname(file_path) + '/'])

    # Remove the original file once done splitting
    try:
        os.remove(file_path)
    except OSError:
        pass

You can call it externally:

import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))

You can also import subprocess and run it directly in your program.

The issue with this approach is high memory usage: subprocess creates a fork with a memory footprint same size as your process and if your process memory is already heavy, it doubles it for the time that it runs. The same thing with os.system.

Here is another pure python way of doing this, although I haven’t tested it on huge files, it’s going to be slower but be leaner on memory:

CHUNK_SIZE = 5000

def yield_csv_rows(reader, chunk_size):
    """
    Opens file to ingest, reads each line to return list of rows
    Expects the header is already removed
    Replacement for ingest_csv
    :param reader: dictReader
    :param chunk_size: int, chunk size
    """
    chunk = []
    for i, row in enumerate(reader):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk

with open(local_file_path, 'rb') as f:
    f.readline().strip().replace('"', '')
    reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
    chunks = yield_csv_rows(reader, CHUNK_SIZE)
    for chunk in chunks:
        if not chunk:
            break
        # Do something with your chunk here

Here is another example using readlines():

"""
Simple example using readlines()
where the 'file' is generated via:
seq 10000 > file
"""
CHUNK_SIZE = 5


def yield_rows(reader, chunk_size):
    """
    Yield row chunks
    """
    chunk = []
    for i, row in enumerate(reader):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk


def batch_operation(data):
    for item in data:
        print(item)


with open('file', 'r') as f:
    chunks = yield_rows(f.readlines(), CHUNK_SIZE)
    for _chunk in chunks:
        batch_operation(_chunk)

The readlines example demonstrates how to chunk your data to pass chunks to function that expects chunks. Unfortunately readlines opens the whole file in memory, its better to use the reader example for performance. Although if you can easily fit what you need into memory and need to process it in chunks this should suffice.

Answered By: radtek

Now, there is a pypi module available that you can use to split files of any size into chunks. Check this out

https://pypi.org/project/filesplit/

Answered By: Ram

You can achieve splitting any file to chunks like below, here the CHUNK_SIZE is 500000 bytes(500kb) and content can be any file :

for idx,val in enumerate(get_chunk(content, CHUNK_SIZE)):
    data=val
    index=idx

def get_chunk(content,size):
        for i in range(0,len(content),size):
            yield content[i:i+size]
Answered By: Ajith
import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)

For example if you want 50000 lines in one files and path is /home/data then you can run below command

subprocess.run('split -l 50000 /home/data', shell = True)

If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want

! wc -l file_path

in this case

! wc -l /home/data

And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows

Answered By: Manoj Kumar Singh
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.