How to get domain of words using WordNet in Python?

Question

How can I find domain of words using nltk Python module and WordNet?

Suppose I have words like (transaction, Demand Draft, cheque, passbook) and the domain for all these words is “BANK”. How can we get this using nltk and WordNet in Python?

I am trying through hypernym and hyponym relationship:

For example:

from nltk.corpus import wordnet as wn
sports = wn.synset('sport.n.01')
sports.hyponyms()
[Synset('judo.n.01'), Synset('athletic_game.n.01'), Synset('spectator_sport.n.01'),    Synset('contact_sport.n.01'), Synset('cycling.n.01'), Synset('funambulism.n.01'), Synset('water_sport.n.01'), Synset('riding.n.01'), Synset('gymnastics.n.01'), Synset('sledding.n.01'), Synset('skating.n.01'), Synset('skiing.n.01'), Synset('outdoor_sport.n.01'), Synset('rowing.n.01'), Synset('track_and_field.n.01'), Synset('archery.n.01'), Synset('team_sport.n.01'), Synset('rock_climbing.n.01'), Synset('racing.n.01'), Synset('blood_sport.n.01')]

and

bark = wn.synset('bark.n.02')
bark.hypernyms()
[Synset('noise.n.01')]

Asked By: Madhusudan

||

Source

Answer 1

There is no explicit domain information in the Princeton WordNet nor the NLTK’s WN API.

I would recommend you get a copy of the WordNet Domain resource and then link your synsets using the domains, see http://wndomains.fbk.eu/

After you’ve registered and completed the download you will see a wn-domains-3.2-20070223 textfile, which is a tab-delimited file with first column the offset-PartofSpeech identifier and the 2nd column contains the domain tags separated by spaces, e.g.

00584282-v  military pedagogy
00584395-v  military school university
00584526-v  animals pedagogy
00584634-v  pedagogy
00584743-v  school university
00585097-v  school university
00585271-v  pedagogy
00585495-v  pedagogy
00585683-v  psychological_features

Then you use the following script to access synsets’ domain(s):

from collections import defaultdict
from nltk.corpus import wordnet as wn

# Loading the Wordnet domains.
domain2synsets = defaultdict(list)
synset2domains = defaultdict(list)
for i in open('wn-domains-3.2-20070223', 'r'):
    ssid, doms = i.strip().split('t')
    doms = doms.split()
    synset2domains[ssid] = doms
    for d in doms:
        domain2synsets[d].append(ssid)

# Gets domains given synset.
for ss in wn.all_synsets():
    ssid = str(ss.offset).zfill(8) + "-" + ss.pos()
    if synset2domains[ssid]: # not all synsets are in WordNet Domain.
        print ss, ssid, synset2domains[ssid]

# Gets synsets given domain.        
for dom in sorted(domain2synsets):
    print dom, domain2synsets[dom][:3]

Also look for the wn-affect that is very useful to disambiguate words for sentiment within the WordNet Domain resource.

With updated NLTK v3.0, it comes with the Open Multilingual WordNet (http://compling.hss.ntu.edu.sg/omw/), and since the French synsets share the same offset IDs, you can simply use the WND as a crosslingual resource. The french lemma names can be accessed as such:

# Gets domains given synset.
for ss in wn.all_synsets():
    ssid = str(ss.offset()).zfill(8) + "-" + ss.pos()
    if synset2domains[ssid]: # not all synsets are in WordNet Domain.
        print ss, ss.lemma_names('fre'), ssid, synset2domains[ssid]

Note that the most recent version of NLTK changes synset properties to “get” functions: Synset.offset -> Synset.offset()

Answered By: alvas

Answer 2

As @alvas suggests, you can use WordNetDomains. You have to download both WordNet2.0 (in its current status WordNetDomains does not support the sense inventory of WordNet3.0, which is the default version of WordNet used by NLTK) and WordNetDomains.

WordNet2.0 can be downloaded from here
WordNetDomains can be downloaded from here (after having being granted permission).

I have created a very simple Python API that loads both resources in Python3.x and provides some common routines you might need (such as getting a set of domains linked to a given term, or to a given synset, etc.). The data load of WordNetDomains is from @alvas.

This is how it looks like (with most comments omitted):

from collections import defaultdict
from nltk.corpus import WordNetCorpusReader
from os.path import exists


class WordNetDomains:
    def __init__(self, wordnet_home):
        #This class assumes you have downloaded WordNet2.0 and WordNetDomains and that they are on the same data home.
        assert exists(f'{wordnet_home}/WordNet-2.0'), f'error: missing WordNet-2.0 in {wordnet_home}'
        assert exists(f'{wordnet_home}/wn-domains-3.2'), f'error: missing WordNetDomains in {wordnet_home}'

        # load WordNet2.0
        self.wn = WordNetCorpusReader(f'{wordnet_home}/WordNet-2.0/dict', 'WordNet-2.0/dict')

        # load WordNetDomains (based on https://stackoverflow.com/a/21904027/8759307)
        self.domain2synsets = defaultdict(list)
        self.synset2domains = defaultdict(list)
        for i in open(f'{wordnet_home}/wn-domains-3.2/wn-domains-3.2-20070223', 'r'):
            ssid, doms = i.strip().split('t')
            doms = doms.split()
            self.synset2domains[ssid] = doms
            for d in doms:
                self.domain2synsets[d].append(ssid)

    def get_domains(self, word, pos=None):
        word_synsets = self.wn.synsets(word, pos=pos)
        domains = []
        for synset in word_synsets:
            domains.extend(self.get_domains_from_synset(synset))
        return set(domains)

    def get_domains_from_synset(self, synset):
        return self.synset2domains.get(self._askey_from_synset(synset), set())

    def get_synsets(self, domain):
        return [self._synset_from_key(key) for key in self.domain2synsets.get(domain, [])]

    def get_all_domains(self):
        return set(self.domain2synsets.keys())

    def _synset_from_key(self, key):
        offset, pos = key.split('-')
        return self.wn.synset_from_pos_and_offset(pos, int(offset))

    def _askey_from_synset(self, synset):
        return self._askey_from_offset_pos(synset.offset(), synset.pos())

    def _askey_from_offset_pos(self, offset, pos):
        return str(offset).zfill(8) + "-" + pos

Answered By: Alex Moreo

Answer 3

I think that you can use the spacy library as well see code below:

Code is taken from this official spacy-wordnet website https://pypi.org/project/spacy-wordnet/:

import spacy

from spacy_wordnet.wordnet_annotator import WordnetAnnotator 

# Load an spacy model (supported models are "es" and "en")  nlp = spacy.load('en') nlp.add_pipe(WordnetAnnotator(nlp.lang), after='tagger') token = nlp('prices')[0]

# wordnet object link spacy token with nltk wordnet interface by giving acces to
# synsets and lemmas  token._.wordnet.synsets() token._.wordnet.lemmas()

# And automatically tags with wordnet domains token._.wordnet.wordnet_domains()

# Imagine we want to enrich the following sentence with synonyms sentence = nlp('I want to withdraw 5,000 euros')

# spaCy WordNet lets you find synonyms by domain of interest
# for example economy economy_domains = ['finance', 'banking'] enriched_sentence = []

# For each token in the sentence for token in sentence:
    # We get those synsets within the desired domains
    synsets = token._.wordnet.wordnet_synsets_for_domain(economy_domains)
    if synsets:
        lemmas_for_synset = []
        for s in synsets:
            # If we found a synset in the economy domains
            # we get the variants and add them to the enriched sentence
            lemmas_for_synset.extend(s.lemma_names())
            enriched_sentence.append('({})'.format('|'.join(set(lemmas_for_synset))))
    else:
        enriched_sentence.append(token.text)

# Let's see our enriched sentence print(' '.join(enriched_sentence))
# >> I (need|want|require) to (draw|withdraw|draw_off|take_out) 5,000 euros

Answered By: sel

Answer 4

Branching off of @sel’s answer, I used spacy_wordnet (which uses nltk.wordnet under the hood).

import spacy
from spacy_wordnet.wordnet_annotator import WordnetAnnotator  # must be imported for pipe creation

nlp = spacy.load("en_core_web_md")  # I was using medium, but may be able to get away with small

# this adds `wordnet` capabilities to your tokens when processed by the `nlp` pipeline
nlp.add_pipe("spacy_wordnet", after="tagger", config={"lang": nlp.lang})

# your words
words = ["transaction", "Demand Draft", "cheque", "passbook"]

for word in words:
    # process text with spacy
    doc: spacy.tokens.Doc = nlp(word)
    
    for token in doc:
        # get all wordnet domains for token
        token_wordnet_domains = token._.wordnet.wordnet_domains()
        print(token, token_wordnet_domains)

As an example for the word "transaction", this will print out:

transaction ['social', 'diplomacy', 'book_keeping', 'money', 'finance', 'industry', 'economy', 'telephony', 'tax', 'exchange', 'betting', 'law', 'commerce', 'insurance', 'banking', 'enterprise']

You can check if "banking" is in the domains with a conditional:

for word in words:
    # convert each word into a spacy.tokens.Doc
    doc: spacy.tokens.Doc = nlp(word)
    
    for token in doc:
        # get all wordnet domains for token
        token_wordnet_domains = token._.wordnet.wordnet_domains()
        # print(token, token_wordnet_domains)
        print(token, "banking" in token_wordnet_domains)

Output:

transaction True
Demand True
Draft True
cheque True
passbook True

Answered By: Ian Thompson

How to get domain of words using WordNet in Python?

Question:

Answers: