Find a specific concordance index using nltk
Question:
I use this code below to get a concordance from nltk and then show the indices of each concordance. And I get these results show below. So far so good.
How do I look up the index of just one specific concordance? It is easy enough to match the concordance to the index in this small example, but if I have 300 concordances, I want to find the index for one.
.index
doesn’t take multiple items in a list as an argument.
Can someone point me to the command/structure I should be using to get the indices to display with the concordances? I’ve attached an example below of a more useful result that goes outside nltk to get a separate list of indices. I’d like to combine those into one result, but how do I get there?
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text
moby = open('mobydick.txt', 'r')
moby_read = moby.read()
moby_text = nltk.Text(nltk.word_tokenize(moby_read))
moby_text.concordance("monstrous")
moby_indices = [index for (index, item) in enumerate(moby_text) if item == "monstrous"]
print(moby_indices)
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u
[858, 1124, 9359, 9417, 32173, 94151, 122253, 122269, 162203, 205095]
I’d ideally like to have something like this.
Displaying 11 of 11 matches:
[858] ong the former , one was of a most monstrous size . ... This came towards us ,
[1124] N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
[9359] ll over with a heathenish array of monstrous clubs and spears . Some were thick
[9417] d as you gazed , and wondered what monstrous cannibal and savage could ever hav
[32173] that has survived the flood ; most monstrous and most mountainous ! That Himmal
[94151] they might scout at Moby Dick as a monstrous fable , or still worse and more de
[122253] of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
[122269] ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
[162203] ere to enter upon those still more monstrous stories of them which are to be fo
[162203] ght have been rummaged out of this monstrous cabinet there is no telling . But
[205095] e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u
Answers:
We can use concordance_list
function (https://www.nltk.org/api/nltk.text.html) so that we can specify the width
and number of lines
, and then iterate over line
s getting the 'offset'
(i.e. line number) and adding surrounding brackets '['
']'
plus roi
(i.e. 'monstrous'
) between the left
and right
words (of each line
):
some_text = open('/content/drive/My Drive/Colab Notebooks/DATA_FOLDERS/TEXT/mobydick.txt', 'r')
roi = 'monstrous'
moby_read = some_text.read()
moby_text = nltk.Text(nltk.word_tokenize(moby_read))
moby_text = moby_text.concordance_list(roi, width=22, lines=1000)
for line in moby_text:
print('[' + str(line.offset) + '] ' + ' '.join(line.left) + ' ' + roi + ' ' + ' '.join(line.right))
or if you find this more readable (import numpy as np
):
for line in moby_text:
print('[' + str(line.offset) + '] ', np.append(' '.join(np.append(np.array(line.left), roi)), np.array(' '.join(line.right))))
Outputs (my line numbers don’t match yours because I used this source: https://gist.github.com/StevenClontz/4445774 which just has different spacing/line numbers):
[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *
[1652] the Psalms. ' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears .
[9933] gazed , and wondered what monstrous cannibal and savage could
[32736] survived the Flood ; most monstrous and most mountainous !
[95115] scout at Moby-Dick as a monstrous fable , or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field , Desmarest , monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales ,
[123541] enter upon those still more monstrous stories of them which
If we want to consider punctuation and all that, we can do something like:
for line in moby_text:
left_words = [left_word for left_word in line.left]
right_words = [right_word for right_word in line.right]
return_text = '[' + str(line.offset) + '] '
for word in left_words:
if any([word == '.', word == ',', word == ';', word == '!']):
return_text += word
else:
return_text += ' ' + word if return_text[-1] != ' ' else word
return_text += roi + ' '
for word in right_words:
if any([word == '.', word == ',', word == ';', word == '!']):
return_text += word
else:
return_text += ' ' + word if return_text[-1] != ' ' else word
print(return_text)
Outputs:
[494] 306 LV. OF THE monstrous PICTURES OF WHALES.
[1385] one was of a most monstrous size. * *
[1652] the Psalms.' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears.
[9933] gazed, and wondered what monstrous cannibal and savage could
[32736] survived the Flood; most monstrous and most mountainous!
[95115] scout at Moby-Dick as a monstrous fable, or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field, Desmarest, monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales,
[123541] enter upon those still more monstrous stories of them which
but you may have to tweak it as I didn’t put a lot of thought into the different contexts that may arise (e.g. '*'
, numbers, chapter titles in ALL-CAPS, roman numerals, etc.) and this is more up to you for how you want the output text to look like–I’m just providing an example.
Note: width
in the concordance_list
function refers to the max length of the next left (and right) word, so if we set it to 4
the first line would print:
[494] THE monstrous
because len('THE ')
is 4
, so setting it to 3
would cut off 'THE'
next left word of 'monstrous'
:
[494] monstrous
While lines
in the concordance_list
function refers to the max number of lines, so if we want only the first two lines containing 'monstrous'
(i.e. moby_text.concordance_list(..., lines=2)
):
[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *
There is an offset number in each concordance list object:
from itertools import zip_longest
from nltk.book import text1
def pad_int(number, length=8, pad_with=" "):
return "".join(reversed([ch if ch else pad_with for i, ch in zip_longest(range(length), reversed(str(number)))]))
width = 50
for con_line in text1.concordance_list("monstrous", width=width):
left, right = " ".join(con_line.left).strip()[-width:], " ".join(con_line.right).strip()[:width]
offset = pad_int(con_line.offset)
print(f"[{offset}]t{left} {con_line.query} {right}")
[out]:
[ 899] , appeared . Among the former , one was of a most monstrous size . ... This came towards us , open - mouthed
[ 1176] BACON ' S VERSION OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have received nothing
[ 9530] entry was hung all over with a heathenish array of monstrous clubs and spears . Some were thickly set with glit
[ 9594] r . You shuddered as you gazed , and wondered what monstrous cannibal and savage could ever have gone a death -
[ 32717] t animated mass that has survived the flood ; most monstrous and most mountainous ! That Himmalehan , salt - se
[ 96103] f the fishery , they might scout at Moby Dick as a monstrous fable , or still worse and more detestable , a hid
[ 122521] lt since the death of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere long paint to you
[ 124761] Pictures of Whaling Scenes . In connexion with the monstrous pictures of whales , I am strongly tempted here to
[ 124777] rongly tempted here to enter upon those still more monstrous stories of them which are to be found in certain b
[ 165681] other marvels might have been rummaged out of this monstrous cabinet there is no telling . But a sudden stop wa
[ 209645] which are made of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead upon that shore .
What does the offset refer to?
It refers to the index of the position of the query token.
from nltk.book import text1
width = 50
for con_line in text1.concordance_list("monstrous", width=width):
print(text1.tokens[con_line.offset - 2], text1.tokens[con_line.offset - 1], text1.tokens[con_line.offset])
[out]:
a most monstrous
Touching that monstrous
array of monstrous
wondered what monstrous
; most monstrous
as a monstrous
Of the Monstrous
with the monstrous
still more monstrous
of this monstrous
of a monstrous
I use this code below to get a concordance from nltk and then show the indices of each concordance. And I get these results show below. So far so good.
How do I look up the index of just one specific concordance? It is easy enough to match the concordance to the index in this small example, but if I have 300 concordances, I want to find the index for one.
.index
doesn’t take multiple items in a list as an argument.
Can someone point me to the command/structure I should be using to get the indices to display with the concordances? I’ve attached an example below of a more useful result that goes outside nltk to get a separate list of indices. I’d like to combine those into one result, but how do I get there?
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text
moby = open('mobydick.txt', 'r')
moby_read = moby.read()
moby_text = nltk.Text(nltk.word_tokenize(moby_read))
moby_text.concordance("monstrous")
moby_indices = [index for (index, item) in enumerate(moby_text) if item == "monstrous"]
print(moby_indices)
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u
[858, 1124, 9359, 9417, 32173, 94151, 122253, 122269, 162203, 205095]
I’d ideally like to have something like this.
Displaying 11 of 11 matches:
[858] ong the former , one was of a most monstrous size . ... This came towards us ,
[1124] N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
[9359] ll over with a heathenish array of monstrous clubs and spears . Some were thick
[9417] d as you gazed , and wondered what monstrous cannibal and savage could ever hav
[32173] that has survived the flood ; most monstrous and most mountainous ! That Himmal
[94151] they might scout at Moby Dick as a monstrous fable , or still worse and more de
[122253] of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
[122269] ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
[162203] ere to enter upon those still more monstrous stories of them which are to be fo
[162203] ght have been rummaged out of this monstrous cabinet there is no telling . But
[205095] e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u
We can use concordance_list
function (https://www.nltk.org/api/nltk.text.html) so that we can specify the width
and number of lines
, and then iterate over line
s getting the 'offset'
(i.e. line number) and adding surrounding brackets '['
']'
plus roi
(i.e. 'monstrous'
) between the left
and right
words (of each line
):
some_text = open('/content/drive/My Drive/Colab Notebooks/DATA_FOLDERS/TEXT/mobydick.txt', 'r')
roi = 'monstrous'
moby_read = some_text.read()
moby_text = nltk.Text(nltk.word_tokenize(moby_read))
moby_text = moby_text.concordance_list(roi, width=22, lines=1000)
for line in moby_text:
print('[' + str(line.offset) + '] ' + ' '.join(line.left) + ' ' + roi + ' ' + ' '.join(line.right))
or if you find this more readable (import numpy as np
):
for line in moby_text:
print('[' + str(line.offset) + '] ', np.append(' '.join(np.append(np.array(line.left), roi)), np.array(' '.join(line.right))))
Outputs (my line numbers don’t match yours because I used this source: https://gist.github.com/StevenClontz/4445774 which just has different spacing/line numbers):
[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *
[1652] the Psalms. ' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears .
[9933] gazed , and wondered what monstrous cannibal and savage could
[32736] survived the Flood ; most monstrous and most mountainous !
[95115] scout at Moby-Dick as a monstrous fable , or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field , Desmarest , monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales ,
[123541] enter upon those still more monstrous stories of them which
If we want to consider punctuation and all that, we can do something like:
for line in moby_text:
left_words = [left_word for left_word in line.left]
right_words = [right_word for right_word in line.right]
return_text = '[' + str(line.offset) + '] '
for word in left_words:
if any([word == '.', word == ',', word == ';', word == '!']):
return_text += word
else:
return_text += ' ' + word if return_text[-1] != ' ' else word
return_text += roi + ' '
for word in right_words:
if any([word == '.', word == ',', word == ';', word == '!']):
return_text += word
else:
return_text += ' ' + word if return_text[-1] != ' ' else word
print(return_text)
Outputs:
[494] 306 LV. OF THE monstrous PICTURES OF WHALES.
[1385] one was of a most monstrous size. * *
[1652] the Psalms.' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears.
[9933] gazed, and wondered what monstrous cannibal and savage could
[32736] survived the Flood; most monstrous and most mountainous!
[95115] scout at Moby-Dick as a monstrous fable, or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field, Desmarest, monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales,
[123541] enter upon those still more monstrous stories of them which
but you may have to tweak it as I didn’t put a lot of thought into the different contexts that may arise (e.g. '*'
, numbers, chapter titles in ALL-CAPS, roman numerals, etc.) and this is more up to you for how you want the output text to look like–I’m just providing an example.
Note: width
in the concordance_list
function refers to the max length of the next left (and right) word, so if we set it to 4
the first line would print:
[494] THE monstrous
because len('THE ')
is 4
, so setting it to 3
would cut off 'THE'
next left word of 'monstrous'
:
[494] monstrous
While lines
in the concordance_list
function refers to the max number of lines, so if we want only the first two lines containing 'monstrous'
(i.e. moby_text.concordance_list(..., lines=2)
):
[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *
There is an offset number in each concordance list object:
from itertools import zip_longest
from nltk.book import text1
def pad_int(number, length=8, pad_with=" "):
return "".join(reversed([ch if ch else pad_with for i, ch in zip_longest(range(length), reversed(str(number)))]))
width = 50
for con_line in text1.concordance_list("monstrous", width=width):
left, right = " ".join(con_line.left).strip()[-width:], " ".join(con_line.right).strip()[:width]
offset = pad_int(con_line.offset)
print(f"[{offset}]t{left} {con_line.query} {right}")
[out]:
[ 899] , appeared . Among the former , one was of a most monstrous size . ... This came towards us , open - mouthed
[ 1176] BACON ' S VERSION OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have received nothing
[ 9530] entry was hung all over with a heathenish array of monstrous clubs and spears . Some were thickly set with glit
[ 9594] r . You shuddered as you gazed , and wondered what monstrous cannibal and savage could ever have gone a death -
[ 32717] t animated mass that has survived the flood ; most monstrous and most mountainous ! That Himmalehan , salt - se
[ 96103] f the fishery , they might scout at Moby Dick as a monstrous fable , or still worse and more detestable , a hid
[ 122521] lt since the death of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere long paint to you
[ 124761] Pictures of Whaling Scenes . In connexion with the monstrous pictures of whales , I am strongly tempted here to
[ 124777] rongly tempted here to enter upon those still more monstrous stories of them which are to be found in certain b
[ 165681] other marvels might have been rummaged out of this monstrous cabinet there is no telling . But a sudden stop wa
[ 209645] which are made of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead upon that shore .
What does the offset refer to?
It refers to the index of the position of the query token.
from nltk.book import text1
width = 50
for con_line in text1.concordance_list("monstrous", width=width):
print(text1.tokens[con_line.offset - 2], text1.tokens[con_line.offset - 1], text1.tokens[con_line.offset])
[out]:
a most monstrous
Touching that monstrous
array of monstrous
wondered what monstrous
; most monstrous
as a monstrous
Of the Monstrous
with the monstrous
still more monstrous
of this monstrous
of a monstrous