Read random lines from huge CSV file

Question:

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it.
As far as I can see – and implement – the CSV utility in Python only allows to iterate sequentially in the file.

It’s very memory consuming to read the all file into memory to use some random choosing and it’s very time consuming to go trough all the file and discard some values and choose others, so is there any way to choose some random line from the CSV file and read only that line?

I tried without success:

import csv

with open('linear_e_LAN2A_F_0_435keV.csv') as file:
    reader = csv.reader(file)
    print reader[someRandomInteger]

A sample of the CSV file:

331.093,329.735
251.188,249.994
374.468,373.782
295.643,295.159
83.9058,0
380.709,116.221
352.238,351.891
183.809,182.615
257.277,201.302
61.4598,40.7106
Asked By: jbssm

||

Answers:

import random

filesize = 1500                 #size of the really big file
offset = random.randrange(filesize)

f = open('really_big_file')
f.seek(offset)                  #go to random position
f.readline()                    # discard - bound to be partial line
random_line = f.readline()      # bingo!

# extra to handle last/first line edge cases
if len(random_line) == 0:       # we have hit the end
    f.seek(0)
    random_line = f.readline()  # so we'll grab the first line instead

As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:

Let’s assume (in this case) we have min=3 and max=15

1) Find the length (Lp) of the previous line.

Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time
If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.

We accomplish this by randomly keeping the line X% of the time where:

X = min / Lp

If we don’t keep the line, we do another random pick until our dice roll comes good. 🙂

Answered By: Maria Zverina

If the lines are truly .csv format and NOT fixed field, then no, there’s not. You can crawl through the file once, indexing the byte offsets for each line, then when later needed only use the index set, but there’s no way to a priori predict the exact location of the line-terminating n character for arbitrary csv files.

Answered By: parselmouth

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it

Assuming you don’t need exactly 1 million lines and know then number of lines in your CSV file beforehand, you can use reservoir sampling to retrieve your random subset. Simply iterate through your data and for each line determine the chances of the line being selected. That way you only need a single pass of your data.

This works well if you need to extract the random samples often but the actual dataset changes infrequently (since you’ll only need to keep track of the number of entries each time the dataset changes).

chances_selected = desired_num_results / total_entries
for line in csv.reader(file):
   if random() < chances_selected:
        result.append(line)
Answered By: Shawn Chin

Another solution is possible if you know the total number of lines – generate 1 million random numbers (random.sample(xrange(n), 1000000)) up to the total number of lines as a set, then use:

for i, line in enumerate(csvfile):
    if i in lines_to_grab:
        yield line

This will get you exactly 1 million lines in an unbiased way, but you need to have the number of lines beforehand.

Answered By: Thomas K

You can rewrite the file with fixed-length records, and then perform random access on the intermediate file later:

ifile = file.open("inputfile.csv")
ofile = file.open("intermediatefile.csv",'w')
for line in ifile:
    ofile.write(line.rstrip('n').ljust(15)+'n')

Then, you can do:

import random
ifile = file.open("intermediatefile.csv")
lines = []
samples = random.sample(range(nlines))
for sample in samples:
    ifile.seek(sample)
    lines.append(ifile.readline())

Requires more disk space, and the first program may take some time to run, but it allows unlimited later random access to records with the second.

Answered By: Andrew Buss
# pass 1, count the number of rows in the file
rowcount = sum(1 for line in file)
# pass 2, select random lines
file.seek(0)
remaining = 1000000
for row in csv.reader(file):
    if random.randrange(rowcount) < remaining:
        print row
        remaining -= 1
    rowcount -= 1
Answered By: Mark Ransom

You can use a variation of the probabilistic method for choosing a random line in a file.

Instead of just keeping a single number that gets chosen, you can keep a buffer of size C. For each line number, n, in the file with N lines, you want to choose that line with probability C/n (rather than the original 1/n. If the number is selected, you then choose a random location from the C-length buffer to evict.

Here’s how it works:

import random

C = 2
fpath = 'somelines.txt'
buffer = []

f = open(fpath, 'r')
for line_num, line in enumerate(f):
    n = line_num + 1.0
    r = random.random()
    if n <= C:
        buffer.append(line.strip())
    elif r < C/n:
        loc = random.randint(0, C-1)
        buffer[loc] = line.strip()

This requires a single pass through the file (so it’s linear time) and returns exactly C lines from the file. Each line will have probability C/N of being selected.

To verify that the above works, I created a file with 5 lines containing a,b,c,d,e. I ran the code 10,000 times with C=2. This should produce about an even distribution of the 5 choose 2 (so 10) possible choices. The results:

a,b: 1046
b,c: 1018
b,e: 1014
a,c: 1003
c,d: 1002
d,e: 1000
c,e: 993
a,e: 992
a,d: 985
b,d: 947
Answered By: jterrace

If you can place this data in a sqlite3 database, selecting some number of random rows is trivial. You will not need to pre-read or pad lines in the file. Since sqlite data files are binary, you data file will be 1/3 to 1/2 smaller than CSV text.

You can use a script like THIS to import the CSV file or, better still, just write your data to a database table in the first place. SQLITE3 is part of the Python distribution.

Then use these statements to get 1,000,000 random rows:

mydb='csv.db'
con=sqlite3.connect(mydb)

with con:
    cur=con.cursor()
    cur.execute("SELECT * FROM csv ORDER BY RANDOM() LIMIT 1000000;")

    for row in cur.fetchall():
        # now you have random rows...
Answered By: dawg

In this method, we generate a random number set whose number of elements is equal to the number of lines to be read, with its range being the number of rows present in the data. It is then sorted from smallest to largest and stored.

Then the csv file is read line by line and a line_counter is in place to denote the row number. This line_counter is then checked with the first element of the sorted random number list and if they are same then that specific line is written into the new csv file and the first element is removed from the list and the previously second element takes the place of the first and the cycle continues.

import random
k=random.sample(xrange(No_of_rows_in_data),No_of_lines_to_be_read)
Num=sorted(k)    
line_counter = 0

with open(input_file,'rb') as file_handle:
    reader = csv.reader(file_handle)
    with open(output_file,'wb') as outfile:
            a=csv.writer(outfile)
            for line in reader:
                line_counter += 1
                if line_counter == Num[0]:
                a.writerow(line)
                Num.remove(Num[0])
                if len(Num)==0:
                break    
Answered By: Kothandaraman R

If you want to grab random lines many times (e.g., mini-batches for machine learning), and you don’t mind scanning through the huge file once (without loading it into memory), then you can create a list of line indeces and use seek to quickly grab the lines (based off of Maria Zverina’s answer).

# Overhead:
# Read the line locations into memory once.  (If the lines are long,
# this should take substantially less memory than the file itself.)
fname = 'big_file'
s = [0]
linelocs = [s.append(s[0]+len(n)) or s.pop(0) for n in open(fname)]
f = open(fname) # Reopen the file.

# Each subsequent iteration uses only the code below:
# Grab a 1,000,000 line sample
# I sorted these because I assume the seeks are faster that way.
chosen = sorted(random.sample(linelocs, 1000000))
sampleLines = []
for offset in chosen:
  f.seek(offset)
  sampleLines.append(f.readline())
# Now we can randomize if need be.
random.shuffle(sampleLines)
Answered By: Marctar

If you can use pandas and numpy, I have posted a solution in another question that is pandas specific but very efficient:

import pandas as pd
import numpy as np

filename = "data.csv"
sample_size = 1000000
batch_size = 5000

rng = np.random.default_rng()

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

sample = sample_reader.get_chunk(sample_size)

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))
    sample.loc[chunk.index] = chunk

For more details, please see the other answer.

Answered By: LeoRochael
def random_line(path, hint=1):
    with open(path, mode='rb') as file:
        import random
        while file.seek(random.randrange(file.seek(-2, 2))) and not file.readline(hint).endswith(b'n'):
            pass
        return file.readline().decode().strip()

This is what i wrote for reading a random line from a very large file .

The time complexity is O(k) ,k is the average length of the lines in the text file.

The hint argument is the minimal length of the lines in the text file ,if you know it beforehand ,use it to speed up the function .

Answered By: iouvxz

Always works for me

import csv
import random
randomINT = random.sample(range(1, 72655), 40000)
with open(file.csv,"rU") as fp:
    reader = csv.reader(fp, delimiter=",", quotechar='"', dialect=csv.excel_tab)
    data_read = [row for idx, row in enumerate(reader) if idx in randomINT]
    for idx, line in enumerate(data_read):
        pass
Answered By: joydeba
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.