How to efficiently use CountVectorizer to get ngram counts for all files in a directory combined?

Question:

I have around 10k .bytes files in my directory and I want to use count vectorizer to get n_gram counts (i.e fit on train and transform on test set).
In those 10k files I have 8k files as train and 2k as test.

files = 
['bfiles/GhHS0zL9cgNXFK6j1dIJ.bytes',
 'bfiles/8qCPkhNr1KJaGtZ35pBc.bytes',
 'bfiles/bLGq2tnA8CuxsF4Py9RO.bytes',
 'bfiles/C0uidNjwV8lrPgzt1JSG.bytes',
 'bfiles/IHiArX1xcBZgv69o4s0a.bytes',
    ...............................
    ...............................]

print(open(files[0]).read())
    'A4 AC 4A 00 AC 4F 00 00 51 EC 48 00 57 7F 45 00 2D 4B 42 45 E9 77 51 4D 89 1D 19 40 30 01 89 45 E7 D9 F6 47 E7 59 75 49 1F ....'

I can’t do something like below and pass everything to CountVectorizer.

file_content = []
for file in file:
    file_content.append(open(file).read())

I can’t append each file text to a big nested lists of files and then use CountVectorizer because the all combined text file size exceeds 150gb. I don’t have resources to do that because CountVectorizer use huge amount of memory.

I need a more efficient way of solving this, Is there some other way I can achieve what I want without loading everything into memory at once. Any help is much appreciated.

All I could achieve was read one file and then use CountVectorizer but I don’t know how to achieve what I’m looking for.

cv = CountVectorizer(ngram_range=(1, 4))
temp = cv.fit_transform([open(files[0]).read()])
temp
<1x451500 sparse matrix of type '<class 'numpy.int64'>'
    with 335961 stored elements in Compressed Sparse Row format>
Asked By: user_12

||

Answers:

The sklearn documentation states that .fit_transform could take an iterable which yields either str, unicode or file objects. So you can create a generator which yield your files one by one and passes it to the fit method. You can create a generator by passing the path to your files as shown below:

def gen(path):
    A = os.listdir(path)
    for i in A:
        yield (i)

Now you can create your generator and pass it on to CountVectorizer as follows:

q = gen("/path/to/your/file/")

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1, 4))
cv.fit_transform(q)

You can build a solution using the following flow:

1) Loop through you files and create a set of all tokens in your files. In the example below this is done using Counter, but you can use python sets to achieve the same result. The bonus here is that Counter will also give you the total number of occurrences of each term.

2) Fit CountVectorizer with the set/list of tokens. You can instantiate CountVectorizer with ngram_range=(1, 4). Below this is avoided in order to limit the number of features in df_new_data.

3) Transform new data as usual.

The example below works on small data. I hope you can adapt the code to suit your needs.

import glob
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

# Create a list of file names
pattern = 'C:\Bytes\*.csv'
csv_files = glob.glob(pattern)

# Instantiate Counter and loop through the files chunk by chunk 
# to create a dictionary of all tokens and their number of occurrence
counter = Counter()
c_size = 1000
for file in csv_files:
    for chunk in pd.read_csv(file, chunksize=c_size, index_col=0, header=None):
        counter.update(chunk[1])

# Fit the CountVectorizer to the counter keys
vectorizer = CountVectorizer(lowercase=False)
vectorizer.fit(list(counter.keys()))

# Loop through your files chunk by chunk and accummulate the counts
counts = np.zeros((1, len(vectorizer.get_feature_names())))
for file in csv_files:
    for chunk in pd.read_csv(file, chunksize=c_size,
                             index_col=0, header=None):
        new_counts = vectorizer.transform(chunk[1])
        counts += new_counts.A.sum(axis=0)

# Generate a data frame with the total counts
df_new_data = pd.DataFrame(counts, columns=vectorizer.get_feature_names())

df_new_data
Out[266]: 
      00     01     0A     0B     10     11     1A     1B     A0     A1  
0  258.0  228.0  286.0  251.0  235.0  273.0  259.0  249.0  232.0  233.0   

      AA     AB     B0     B1     BA     BB  
0  248.0  227.0  251.0  254.0  255.0  261.0  

Code for the generation of the data:

import numpy as np
import pandas as pd

def gen_data(n): 
    numbers = list('01')
    letters = list('AB')
    numlet = numbers + letters
    x = np.random.choice(numlet, size=n)
    y = np.random.choice(numlet, size=n)
    df = pd.DataFrame({'X': x, 'Y': y})
    return df.sum(axis=1)

n = 2000
df_1 = gen_data(n)
df_2 = gen_data(n)

df_1.to_csv('C:\Bytes\df_1.csv')
df_2.to_csv('C:\Bytes\df_2.csv')

df_1.head()
Out[218]: 
0    10
1    01
2    A1
3    AB
4    1A
dtype: object
Answered By: KRKirov

By using a generator instead of list, your code won’t store the value of your files into your memory. Instead, it will yield a value and let forget it, then yields the next, and so on. Here, I’ll use your code and do a simple tweak to change list into a generator. You could just use () instead of [].

cv = CountVectorizer(ngram_range=(1, 4))
temp = cv.fit_transform((open(file).read() for file in files))
Answered By: Darren Christopher
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.