Counting bigrams (pair of two words) in a file using Python
Question:
I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex “w+sw+” on file contents, but it did not prove to be efficient.
e.g. Let’s say I want to count the number of bigrams from a file a.txt, which has following content:
"the quick person did not realize his speed and the quick person bumped "
For above file, the bigram set and their count will be :
(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1
I have come across an example of Counter objects in Python, which is used to count unigrams (single words). It also uses regex approach.
The example goes like this:
>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('w+', open('a.txt').read())
>>> print Counter(words)
The output of above code is :
[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
('realize', 1), ('his', 1), ('speed', 1), ('bumped', 1)]
I was wondering if it is possible to use the Counter object to get count of bigrams.
Any approach other than Counter object or regex will also be appreciated.
Answers:
Some itertools
magic:
>>> import re
>>> from itertools import islice, izip
>>> words = re.findall("w+",
"the quick person did not realize his speed and the quick person bumped")
>>> print Counter(izip(words, islice(words, 1, None)))
Output:
Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1,
('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1,
('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1,
('realize', 'his'): 1})
Bonus
Get the frequency of any n-gram:
from itertools import tee, islice
def ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
>>> Counter(ngrams(words, 3))
Output:
Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1,
('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1,
('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1,
('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1,
('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})
This works with lazy iterables and generators too. So you can write a generator which reads a file line by line, generating words, and pass it to ngarms
to consume lazily without reading the whole file in memory.
How about zip()
?
import re
from collections import Counter
words = re.findall('w+', open('a.txt').read())
print(Counter(zip(words,words[1:])))
It has been long time since this question was asked and successfully responded. I benefit from the responses to create my own solution. I would like to share it:
import regex
bigrams_tst = regex.findall(r"bw+sw+", open(myfile).read(), overlapped=True)
This will provide all bigrams that do not interrupted by a punctuation.
You can simply use Counter
for any n_gram like so:
from collections import Counter
from nltk.util import ngrams
text = "the quick person did not realize his speed and the quick person bumped "
n_gram = 2
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the'): 1,
('did', 'not'): 1,
('his', 'speed'): 1,
('not', 'realize'): 1,
('person', 'bumped'): 1,
('person', 'did'): 1,
('quick', 'person'): 2,
('realize', 'his'): 1,
('speed', 'and'): 1,
('the', 'quick'): 2})
For 3-grams, just change the n_gram
to 3:
n_gram = 3
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the', 'quick'): 1,
('did', 'not', 'realize'): 1,
('his', 'speed', 'and'): 1,
('not', 'realize', 'his'): 1,
('person', 'did', 'not'): 1,
('quick', 'person', 'bumped'): 1,
('quick', 'person', 'did'): 1,
('realize', 'his', 'speed'): 1,
('speed', 'and', 'the'): 1,
('the', 'quick', 'person'): 2})
Starting in Python 3.10
, the new pairwise
function provides a way to slide through pairs of consecutive elements, such that your use-case simply becomes:
from itertools import pairwise
import re
from collections import Counter
# text = "the quick person did not realize his speed and the quick person bumped "
Counter(pairwise(re.findall('w+', text)))
# Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, ('did', 'not'): 1, ('not', 'realize'): 1, ('realize', 'his'): 1, ('his', 'speed'): 1, ('speed', 'and'): 1, ('and', 'the'): 1, ('person', 'bumped'): 1})
Details for intermediate results:
re.findall('w+', text)
# ['the', 'quick', 'person', 'did', 'not', 'realize', 'his', ...]
pairwise(re.findall('w+', text))
# [('the', 'quick'), ('quick', 'person'), ('person', 'did'), ...]
One can use CountVectorizer from scikit-learn (pip install sklearn
) to generate the bigrams (or more generally, any ngram).
Example (tested with Python 3.6.7 and scikit-learn 0.24.2).
import sklearn.feature_extraction.text
ngram_size = 2
train_set = ['the quick person did not realize his speed and the quick person bumped']
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vectorizer.fit(train_set) # build ngram dictionary
ngram = vectorizer.transform(train_set) # get ngram
print('ngram: {0}n'.format(ngram))
print('ngram.shape: {0}'.format(ngram.shape))
print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))
Output:
>>> print('ngram: {0}n'.format(ngram)) # Shows the bi-gram count
ngram: (0, 0) 1
(0, 1) 1
(0, 2) 1
(0, 3) 1
(0, 4) 1
(0, 5) 1
(0, 6) 2
(0, 7) 1
(0, 8) 1
(0, 9) 2
>>> print('ngram.shape: {0}'.format(ngram.shape))
ngram.shape: (1, 10)
>>> print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))
vectorizer.vocabulary_: {'the quick': 9, 'quick person': 6, 'person did': 5, 'did not': 1,
'not realize': 3, 'realize his': 7, 'his speed': 2, 'speed and': 8, 'and the': 0,
'person bumped': 4}
I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex “w+sw+” on file contents, but it did not prove to be efficient.
e.g. Let’s say I want to count the number of bigrams from a file a.txt, which has following content:
"the quick person did not realize his speed and the quick person bumped "
For above file, the bigram set and their count will be :
(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1
I have come across an example of Counter objects in Python, which is used to count unigrams (single words). It also uses regex approach.
The example goes like this:
>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('w+', open('a.txt').read())
>>> print Counter(words)
The output of above code is :
[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
('realize', 1), ('his', 1), ('speed', 1), ('bumped', 1)]
I was wondering if it is possible to use the Counter object to get count of bigrams.
Any approach other than Counter object or regex will also be appreciated.
Some itertools
magic:
>>> import re
>>> from itertools import islice, izip
>>> words = re.findall("w+",
"the quick person did not realize his speed and the quick person bumped")
>>> print Counter(izip(words, islice(words, 1, None)))
Output:
Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1,
('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1,
('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1,
('realize', 'his'): 1})
Bonus
Get the frequency of any n-gram:
from itertools import tee, islice
def ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
>>> Counter(ngrams(words, 3))
Output:
Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1,
('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1,
('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1,
('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1,
('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})
This works with lazy iterables and generators too. So you can write a generator which reads a file line by line, generating words, and pass it to ngarms
to consume lazily without reading the whole file in memory.
How about zip()
?
import re
from collections import Counter
words = re.findall('w+', open('a.txt').read())
print(Counter(zip(words,words[1:])))
It has been long time since this question was asked and successfully responded. I benefit from the responses to create my own solution. I would like to share it:
import regex
bigrams_tst = regex.findall(r"bw+sw+", open(myfile).read(), overlapped=True)
This will provide all bigrams that do not interrupted by a punctuation.
You can simply use Counter
for any n_gram like so:
from collections import Counter
from nltk.util import ngrams
text = "the quick person did not realize his speed and the quick person bumped "
n_gram = 2
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the'): 1,
('did', 'not'): 1,
('his', 'speed'): 1,
('not', 'realize'): 1,
('person', 'bumped'): 1,
('person', 'did'): 1,
('quick', 'person'): 2,
('realize', 'his'): 1,
('speed', 'and'): 1,
('the', 'quick'): 2})
For 3-grams, just change the n_gram
to 3:
n_gram = 3
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the', 'quick'): 1,
('did', 'not', 'realize'): 1,
('his', 'speed', 'and'): 1,
('not', 'realize', 'his'): 1,
('person', 'did', 'not'): 1,
('quick', 'person', 'bumped'): 1,
('quick', 'person', 'did'): 1,
('realize', 'his', 'speed'): 1,
('speed', 'and', 'the'): 1,
('the', 'quick', 'person'): 2})
Starting in Python 3.10
, the new pairwise
function provides a way to slide through pairs of consecutive elements, such that your use-case simply becomes:
from itertools import pairwise
import re
from collections import Counter
# text = "the quick person did not realize his speed and the quick person bumped "
Counter(pairwise(re.findall('w+', text)))
# Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, ('did', 'not'): 1, ('not', 'realize'): 1, ('realize', 'his'): 1, ('his', 'speed'): 1, ('speed', 'and'): 1, ('and', 'the'): 1, ('person', 'bumped'): 1})
Details for intermediate results:
re.findall('w+', text)
# ['the', 'quick', 'person', 'did', 'not', 'realize', 'his', ...]
pairwise(re.findall('w+', text))
# [('the', 'quick'), ('quick', 'person'), ('person', 'did'), ...]
One can use CountVectorizer from scikit-learn (pip install sklearn
) to generate the bigrams (or more generally, any ngram).
Example (tested with Python 3.6.7 and scikit-learn 0.24.2).
import sklearn.feature_extraction.text
ngram_size = 2
train_set = ['the quick person did not realize his speed and the quick person bumped']
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vectorizer.fit(train_set) # build ngram dictionary
ngram = vectorizer.transform(train_set) # get ngram
print('ngram: {0}n'.format(ngram))
print('ngram.shape: {0}'.format(ngram.shape))
print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))
Output:
>>> print('ngram: {0}n'.format(ngram)) # Shows the bi-gram count
ngram: (0, 0) 1
(0, 1) 1
(0, 2) 1
(0, 3) 1
(0, 4) 1
(0, 5) 1
(0, 6) 2
(0, 7) 1
(0, 8) 1
(0, 9) 2
>>> print('ngram.shape: {0}'.format(ngram.shape))
ngram.shape: (1, 10)
>>> print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))
vectorizer.vocabulary_: {'the quick': 9, 'quick person': 6, 'person did': 5, 'did not': 1,
'not realize': 3, 'realize his': 7, 'his speed': 2, 'speed and': 8, 'and the': 0,
'person bumped': 4}