Best Way to Count Occurences of Each Character in a Large Dataset

Question

I am trying to count the number of occurrences of each character within a large dateset. For example, if the data was the numpy array [‘A’, ‘AB’, ‘ABC’] then I would want {‘A’: 3, ‘B’: 2, ‘C’: 1} as the output. I currently have an implementation that looks like this:

char_count = {}
for c in string.printable:
    char_count[c] = np.char.count(data, c).sum()

The issue I am having is that this takes too long for my data. I have ~14,000,000 different strings that I would like to count and this implementation is not efficient for that amount of data. Any help is appreciated!

Asked By: damp_floor_sign

||

Source

Answer 1

One approach:

import numpy as np
from collections import defaultdict

data = np.array(['A', 'AB', 'ABC'])

counts = defaultdict(int)
for e in data:
    for c in e:
        counts[c] += 1

print(counts)

Output

defaultdict(<class 'int'>, {'A': 3, 'B': 2, 'C': 1})

Note that your code iterates len(string.printable) times over data in contrast my proposal iterates one time.

One alternative using a dictionary:

data = np.array(['A', 'AB', 'ABC'])

counts = dict()
for e in data:
    for c in e:
        counts[c] = counts.get(c, 0) + 1

print(counts)

Answered By: Dani Mesejo

Answer 2

Another way.

import collections
c = collections.Counter()
for thing in data:
    c.update(thing)

Same basic advantage – only iterates the data once.

Answered By: wwii

Best Way to Count Occurences of Each Character in a Large Dataset

Question:

Answers: