Can spacy's text categorizer learn the logic of recognizing two words in order?

Question:

I’m trying to determine if Spacy’s text categorizer can learn a simple logic to detect the presence of two consecutive words in order: "jhon died". After training, for this experiment, the only results that matter are the output for the same texts used in the training samples, but I have been unable to have it match only "jhon died" and not "died jhon". Is spacy’s textcat unable to consider the order of the tokens during categorization?

The training, evaluation and test sets are repetitions of this 4 samples:

    rows.append(["jhon died", 1])
    rows.append(["died jhon", 0])
    rows.append(["died", 0])
    rows.append(["jhon", 0])

These are the set sizes:
Total: 76 – Train: 57 – Dev: 11 – Test: 8

I populate all sets with:

db = spacy.tokens.DocBin()
docs = []
for doc, label in nlp.pipe(data, as_tuples=True):
    doc.cats["POS"] = label == 1
    doc.cats["NEG"] = label == 0
    db.add(doc)

db.to_disk(outfile)

Training command is:

python -m spacy init config  --lang en --pipeline textcat --optimize efficiency --force config.cfg

When testing this:

texts = ["jhon", "jhon died", "died", "died jhon", "died fast", "fast jhon"]

nlp = spacy.load("./model/model-best")
for text in texts:
    doc = nlp(text)
    diff = doc.cats['POS'] - doc.cats['NEG']
    print("yes" if diff > 0 else ("no" if diff < 0 else "neither") ,  "-",  text, doc.cats)

I get:

no - jhon {'POS': 0.1631753146648407, 'NEG': 0.8368247151374817}
no - jhon died {'POS': 0.4730854034423828, 'NEG': 0.5269145965576172}
no - died {'POS': 0.1631753146648407, 'NEG': 0.8368247151374817}
no - died jhon {'POS': 0.4730854034423828, 'NEG': 0.5269145965576172}
no - died fast {'POS': 0.1631753146648407, 'NEG': 0.8368247151374817}
no - fast jhon {'POS': 0.1631753146648407, 'NEG': 0.8368247151374817}

If i change the "died jhon" to classify (rows.append(["died jhon", 0])), the I get this:

no - jhon {'POS': 0.21423980593681335, 'NEG': 0.785760223865509}
yes - jhon died {'POS': 0.8561566472053528, 'NEG': 0.1438433676958084}
no - died {'POS': 0.21423980593681335, 'NEG': 0.785760223865509}
yes - died jhon {'POS': 0.8561566472053528, 'NEG': 0.1438433676958084}
no - died fast {'POS': 0.21423980593681335, 'NEG': 0.785760223865509}
no - fast jhon {'POS': 0.21423980593681335, 'NEG': 0.785760223865509}

The result I’m expecting should match the original samples like this:

no - jhon {...}
yes - jhon died {...}
no - died {...}
no - died jhon {...}
no - died fast {...} // Result doesn't matter here.
no - fast jhon {...} // Result doesn't matter here.

Here is the colab I’m working on for reference:
https://colab.research.google.com/drive/1rnYhc-h4e0VlgatWzy1Z3-1rNbd0bGvM#scrollTo=tzXLe-IahuA5

Asked By: jacmkno

||

Answers:

Yes it can, it seems impractical to use the train command for trivial examples.

The following code does exactly what is requested. Just using the default optimizer and basic updates on the model:

import spacy
from spacy.training import Example

samples = [
  ["jhon died", 1],
  ["died jhon", 0],
  ["died", 0],
  ["jhon", 0]
]

for r in samples:
  print(r)

def train(samples, repetitions):
  nlp = spacy.blank("en")

  textcat = nlp.add_pipe( "textcat")
  textcat.add_label("POS")
  textcat.add_label("NEG")

  optimizer = nlp.initialize()
  for i in range(0, repetitions):
    for raw_text, label in samples:
      predicted = nlp(raw_text)
      reference = nlp(raw_text)
      reference.cats["POS"] = label == 1
      reference.cats["NEG"] = label == 0
      example = Example(predicted=predicted, reference=reference)
      nlp.update([example], sgd=optimizer)

  return nlp

train(samples, 5)import spacy
from spacy.training import Example

samples = [
  ["jhon died", 1],
  ["died jhon", 0],
  ["died", 0],
  ["jhon", 0]
]

for r in samples:
  print(r)

def train(samples, repetitions):
  nlp = spacy.blank("en")

  textcat = nlp.add_pipe( "textcat")
  textcat.add_label("POS")
  textcat.add_label("NEG")

  optimizer = nlp.initialize()
  for i in range(0, repetitions):
    for raw_text, label in samples:
      predicted = nlp(raw_text)
      reference = nlp(raw_text)
      reference.cats["POS"] = label == 1
      reference.cats["NEG"] = label == 0
      example = Example(predicted=predicted, reference=reference)
      nlp.update([example], sgd=optimizer)

  return nlp

# It seems 10 iterations gives better results than 5 for new words
train(samples, 10).to_disk('./test-model')

Testing the model:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe( "textcat")
nlp.from_disk("./test-model")

texts = ["jhon died", "died jhon", "died", "jhon", "jhon walked", "died smiled"]
for text in texts:
    doc = nlp(text)
    diff = doc.cats['POS'] - doc.cats['NEG']
    print("yes" if diff > 0 else ("no" if diff < 0 else "neither") ,  "-",  text, doc.cats)

Final Output:

yes - jhon died {'POS': 0.997473418712616, 'NEG': 0.002526533557102084}
no - died jhon {'POS': 0.0009508572984486818, 'NEG': 0.9990491271018982}
no - died {'POS': 0.0012573363492265344, 'NEG': 0.9987426400184631}
no - jhon {'POS': 0.0008163611637428403, 'NEG': 0.9991835951805115}
no - jhon walked {'POS': 0.44277048110961914, 'NEG': 0.5572295188903809}
no - died smiled {'POS': 0.014941525645554066, 'NEG': 0.9850584268569946}

Here is the working colab for reference: https://colab.research.google.com/drive/1rnYhc-h4e0VlgatWzy1Z3-1rNbd0bGvM

Answered By: jacmkno