Randomly mix lines of 3 million-line file

Question:

Everything is in the title. I’m wondering if any one knows a quick and with reasonable memory demands way of randomly mixing all the lines of a 3 million lines file. I guess it is not possible with a simple vim command, so any simple script using Python. I tried with python by using a random number generator, but did not manage to find a simple way out.

Asked By: Nigu

||

Answers:

Takes only a few seconds in Python:

import random
lines = open('3mil.txt').readlines()
random.shuffle(lines)
open('3mil.txt', 'w').writelines(lines)
Answered By: John Kugelman

On many systems the sort shell command takes -R to randomize its input.

Answered By: fuzzyTew
import random
with open('the_file','r') as source:
    data = [ (random.random(), line) for line in source ]
data.sort()
with open('another_file','w') as target:
    for _, line in data:
        target.write( line )

That should do it. 3 million lines will fit into most machine’s memory unless the lines are HUGE (over 512 characters).

Answered By: S.Lott

Here’s another version

At the shell, use this.

python decorate.py | sort | python undecorate.py

decorate.py

import sys
import random
for line in sys.stdin:
    sys.stdout.write( "{0}|{1}".format( random.random(), line ) )

undecorate.py

import sys
for line in sys.stdin:
    _, _, data= line.partition("|")
    sys.stdout.write( line )

Uses almost no memory.

Answered By: S.Lott

This is the same as Mr. Kugelman’s, but using vim’s built-in python interface:

:py import vim, random as r; cb = vim.current.buffer ; l = cb[:] ; r.shuffle(l) ; cb[:] = l
Answered By: sleepynate

If you do not want to load everything into memory and sort it there, you have to store the lines on disk while doing random sorting. That will be very slow.

Here is a very simple, stupid and slow version. Note that this may take a surprising amount of diskspace, and it will be very slow. I ran it with 300.000 lines, and it takes several minutes. 3 million lines could very well take an hour. So: Do it in memory. Really. It’s not that big.

import os
import tempfile
import shutil
import random
tempdir = tempfile.mkdtemp()
print tempdir

files = []
# Split the lines:
with open('/tmp/sorted.txt', 'rt') as infile:
    counter = 0    
    for line in infile:
        outfilename = os.path.join(tempdir, '%09i.txt' % counter)
        with open(outfilename, 'wt') as outfile:
            outfile.write(line)
        counter += 1
        files.append(outfilename)

with open('/tmp/random.txt', 'wt') as outfile:
    while files:
        index = random.randint(0, len(files) - 1)
        filename = files.pop(index)
        outfile.write(open(filename, 'rt').read())

shutil.rmtree(tempdir)

Another version would be to store the files in an SQLite database and pull the lines randomly from that database. That is probably going to be faster than this.

Answered By: Lennart Regebro

I just tried this on a file with 4.3M of lines and fastest thing was ‘shuf’ command on Linux. Use it like this:

shuf huge_file.txt -o shuffled_lines_huge_file.txt

It took 2-3 seconds to finish.

Answered By: Drag0

Here is another way using random.choice, this may provide some gradual memory relieve as well, but with a worse Big-O 🙂

from random import choice

with open('data.txt', 'r') as r:
    lines = r.readlines()

with open('shuffled_data.txt', 'w') as w:
    while lines:
        l = choice(lines)
        lines.remove(l)
        w.write(l)
Answered By: Aziz Alto

The following Vimscript can be used to swap lines:

function! Random()                                                       
  let nswaps = 100                                                       
  let firstline = 1                                                     
  let lastline = 10                                                      
  let i = 0                                                              
  while i <= nswaps                                                      
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe line.'d'                                                         
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe "normal! " . line . 'Gp'                                         
    let i += 1                                                           
  endwhile                                                               
endfunction

Select the function in visual mode and type :@" then execute it with :call Random()

Answered By: builder-7000

This will do the trick:
My solution even don’t use random and it will also remove duplicates.

import sys
lines= list(set(open(sys.argv[1]).readlines()))
print(' '.join(lines))

in the shell

python shuffler.py nameoffilestobeshuffled.txt > shuffled.txt
Answered By: Kumaresp

It is not a necessary solution to your problem. Just keeping it here for the people who come here seeking solution for shuffling a file of bigger size. But it will work for smaller files as well. Change split -b 1GB to a smaller file size i.e. split -b 100MB to make a lot of text files each sizing 100MB.

I had a 20GB file containing more than 1.5 billion sentences in it. Calling shuf command in the linux terminal simply overwhelmed both my 16GB RAM and a same swap area. This is a bash script I wrote to get the job done. It assumes that you keep the bash script in the same folder as your big text file.

#!/bin

#Create a temporary folder named "splitted" 
mkdir ./splitted


#Split input file into multiple small(1GB each) files
#This is will help us shuffle the data
echo "Splitting big txt file..."
split -b 1GB ./your_big_file.txt ./splitted/file --additional-suffix=.txt
echo "Done."

#Shuffle the small files
echo "Shuffling splitted txt files..."
for entry in "./splitted"/*.txt
do
  shuf $entry -o $entry
done
echo "Done."

#Concatinate the splitted shuffled files into one big text file
echo "Concatinating shuffled txt files into 1 file..."
cat ./splitted/* > ./your_big_file_shuffled.txt
echo "Done"

#Delete the temporary "splitted" folder
rm -rf ./splitted
echo "Complete."
Answered By: Akib Sadmanee
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.