Find a specific concordance index using nltk

Question

I use this code below to get a concordance from nltk and then show the indices of each concordance. And I get these results show below. So far so good.

How do I look up the index of just one specific concordance? It is easy enough to match the concordance to the index in this small example, but if I have 300 concordances, I want to find the index for one.

.index doesn’t take multiple items in a list as an argument.

Can someone point me to the command/structure I should be using to get the indices to display with the concordances? I’ve attached an example below of a more useful result that goes outside nltk to get a separate list of indices. I’d like to combine those into one result, but how do I get there?

import nltk 
nltk.download('punkt') 
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text

moby = open('mobydick.txt', 'r')

moby_read = moby.read() 
moby_text = nltk.Text(nltk.word_tokenize(moby_read))

moby_text.concordance("monstrous")

moby_indices  = [index for (index, item) in enumerate(moby_text) if item == "monstrous"]

print(moby_indices)

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u

[858, 1124, 9359, 9417, 32173, 94151, 122253, 122269, 162203, 205095]

I’d ideally like to have something like this.

Displaying 11 of 11 matches:
[858] ong the former , one was of a most monstrous size . ... This came towards us , 
[1124] N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
[9359] ll over with a heathenish array of monstrous clubs and spears . Some were thick
[9417] d as you gazed , and wondered what monstrous cannibal and savage could ever hav
[32173] that has survived the flood ; most monstrous and most mountainous ! That Himmal
[94151] they might scout at Moby Dick as a monstrous fable , or still worse and more de
[122253] of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
[122269] ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
[162203] ere to enter upon those still more monstrous stories of them which are to be fo
[162203] ght have been rummaged out of this monstrous cabinet there is no telling . But 
[205095] e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u

Asked By: David Beales

||

Source

Answer 1

We can use concordance_list function (https://www.nltk.org/api/nltk.text.html) so that we can specify the width and number of lines, and then iterate over lines getting the 'offset' (i.e. line number) and adding surrounding brackets '[' ']' plus roi (i.e. 'monstrous') between the left and right words (of each line):

some_text = open('/content/drive/My Drive/Colab Notebooks/DATA_FOLDERS/TEXT/mobydick.txt', 'r')
roi = 'monstrous'

moby_read = some_text.read()
moby_text = nltk.Text(nltk.word_tokenize(moby_read))
moby_text = moby_text.concordance_list(roi, width=22, lines=1000)
for line in moby_text:
    print('[' + str(line.offset) + '] ' + ' '.join(line.left) + ' ' + roi + ' ' + ' '.join(line.right))

or if you find this more readable (import numpy as np):

for line in moby_text:
    print('[' + str(line.offset) + '] ', np.append(' '.join(np.append(np.array(line.left), roi)), np.array(' '.join(line.right))))

Outputs (my line numbers don’t match yours because I used this source: https://gist.github.com/StevenClontz/4445774 which just has different spacing/line numbers):

[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *
[1652] the Psalms. ' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears .
[9933] gazed , and wondered what monstrous cannibal and savage could
[32736] survived the Flood ; most monstrous and most mountainous !
[95115] scout at Moby-Dick as a monstrous fable , or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field , Desmarest , monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales ,
[123541] enter upon those still more monstrous stories of them which

If we want to consider punctuation and all that, we can do something like:

for line in moby_text:
    left_words = [left_word for left_word in line.left]
    right_words = [right_word for right_word in line.right]
    return_text = '[' +  str(line.offset) + '] '
    for word in left_words:
        if any([word == '.', word == ',', word == ';', word == '!']):
            return_text += word
        else:
            return_text += ' ' + word if return_text[-1] != ' ' else word
    return_text += roi + ' '
    for word in right_words:
        if any([word == '.', word == ',', word == ';', word == '!']):
            return_text += word
        else:
            return_text += ' ' + word if return_text[-1] != ' ' else word
    print(return_text)

Outputs:

[494] 306 LV. OF THE monstrous PICTURES OF WHALES.
[1385] one was of a most monstrous size. * *
[1652] the Psalms.' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears.
[9933] gazed, and wondered what monstrous cannibal and savage could
[32736] survived the Flood; most monstrous and most mountainous!
[95115] scout at Moby-Dick as a monstrous fable, or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field, Desmarest, monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales,
[123541] enter upon those still more monstrous stories of them which

but you may have to tweak it as I didn’t put a lot of thought into the different contexts that may arise (e.g. '*', numbers, chapter titles in ALL-CAPS, roman numerals, etc.) and this is more up to you for how you want the output text to look like–I’m just providing an example.

Note: width in the concordance_list function refers to the max length of the next left (and right) word, so if we set it to 4 the first line would print:

[494] THE monstrous

because len('THE ') is 4, so setting it to 3 would cut off 'THE' next left word of 'monstrous':

[494] monstrous

While lines in the concordance_list function refers to the max number of lines, so if we want only the first two lines containing 'monstrous' (i.e. moby_text.concordance_list(..., lines=2)):

[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *

Answered By: Ori Yarden

Answer 2

There is an offset number in each concordance list object:

from itertools import zip_longest
from nltk.book import text1

def pad_int(number, length=8, pad_with=" "):
    return "".join(reversed([ch if ch else pad_with for i, ch in zip_longest(range(length), reversed(str(number)))]))

width = 50
for con_line in text1.concordance_list("monstrous", width=width):
  left, right = " ".join(con_line.left).strip()[-width:], " ".join(con_line.right).strip()[:width]
  offset = pad_int(con_line.offset)
  print(f"[{offset}]t{left}  {con_line.query}  {right}")

[out]:

[     899]  , appeared . Among the former , one was of a most  monstrous  size . ... This came towards us , open - mouthed
[    1176]   BACON ' S VERSION OF THE PSALMS . " Touching that  monstrous  bulk of the whale or ork we have received nothing 
[    9530]  entry was hung all over with a heathenish array of  monstrous  clubs and spears . Some were thickly set with glit
[    9594]  r . You shuddered as you gazed , and wondered what  monstrous  cannibal and savage could ever have gone a death -
[   32717]  t animated mass that has survived the flood ; most  monstrous  and most mountainous ! That Himmalehan , salt - se
[   96103]  f the fishery , they might scout at Moby Dick as a  monstrous  fable , or still worse and more detestable , a hid
[  122521]  lt since the death of Radney .'" CHAPTER 55 Of the  Monstrous  Pictures of Whales . I shall ere long paint to you
[  124761]  Pictures of Whaling Scenes . In connexion with the  monstrous  pictures of whales , I am strongly tempted here to
[  124777]  rongly tempted here to enter upon those still more  monstrous  stories of them which are to be found in certain b
[  165681]  other marvels might have been rummaged out of this  monstrous  cabinet there is no telling . But a sudden stop wa
[  209645]  which are made of Whale - Bones ; for Whales of a  monstrous  size are oftentimes cast up dead upon that shore .

What does the offset refer to?

It refers to the index of the position of the query token.

from nltk.book import text1

width = 50
for con_line in text1.concordance_list("monstrous", width=width):
  print(text1.tokens[con_line.offset - 2], text1.tokens[con_line.offset - 1], text1.tokens[con_line.offset])

[out]:

a most monstrous
Touching that monstrous
array of monstrous
wondered what monstrous
; most monstrous
as a monstrous
Of the Monstrous
with the monstrous
still more monstrous
of this monstrous
of a monstrous

Answered By: alvas

Find a specific concordance index using nltk

Question:

Answers:

What does the offset refer to?