How to match text in two different file and extract values

Question:

So I have two files. One yaml file that contains tibetan words : its meaning. Another csv file that contains only word and it’s POStag. As below:

yaml file :

ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།
ད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།
ད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།
ད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།

csv file :

ད་ཆུ PART
ད་གདོད DET

Desired output:

ད་ཆུ PART དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་གདོད DET ད་གཟོད་དང་དོན་འདྲ།

Any idea on how to make text match from csv file to yaml file and extract its meaning in csv?

Asked By: lungsang

||

Answers:

The easiest solution that came to my mind would be iterating over all lines in the YAML-file and checking if the word is inside the CSV-file:

YAML_LINES = "ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།nད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞནnད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞནnད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།nད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།".split("n")
CSV_LINES = "ད་ཆུnད་གདོད".split("n")


for line in YAML_LINES:
    word, meaning = line.split(": ")

    if word in CSV_LINES:
        output = word + " " + meaning
        print(output)

The YAML_LINES and CSV_LINES lists are only to provide a quick and dirty example.

Answered By: noah

Assuming your files are called dict.yml and input.csv.

You can start by turning the yaml file into a dictionary with

import yaml

with open('dict.yaml', 'r') as file:
    trans_dict = yaml.safe_load(file)

Which should give you

>>> trans_dict

{'ད་གདོད': 'ད་གཟོད་དང་དོན་འདྲ།',
 'ད་ཆུ': 'དངུལ་ཆུ་ཡི་མིང་གཞན།',
 'ད་ཕྲུག': 'དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།',
 'ད་བེར': 'སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།',
 'ད་མེ་དུམ་མེ': 'དམ་དུམ་ལ་ལྟོས།'}

Then, you can iterate over the lines in the CSV and use the dictionary to get the definition:

outputs = []
with open('text.txt', 'r') as file:
    for line in file:
        term = line.strip()
        definition = trans_dict.get(term.strip())
        outputs.append(
            term if definition is None 
            else f"{term} {definition}"
        )

From here, your outputs variable should contain ['ད་ཆུ དངུལ་ཆུ་ཡི་མིང་གཞན།', 'ད་གདོད ད་གཟོད་དང་དོན་འདྲ།']. If you optionally wanted to write this out to a file, you could do

with open('output.txt', 'w') as file:
    file.write('n'.join(outputs))

If you had more tokens on each line of the CSV (unclear from your post), you could iterate over those tokens within a line, but you’d be able to apply basically the same approach.

Answered By: RagingRoosevelt

On a functional point of view, you have:

  • a dictionary, meaning here a key: value thing
  • a list of words to search in that dictionary, and that will produce a record

If everything can fit in memory, you can first read the yaml file to produce a Python dictionary, and then read the words file, one line at a time and use the above dictionary to generate the expected line. If the yaml file is too large, you could use the dbm (or shelve) module as an on disk dictionary.

As you have not shown any code, I cannot either… I can just say that you can simply use process the second file as a plain text one and just read it one line at a time. For the first one, you can either look for a yaml module from PyPI, or if the syntax is always as simple as the lines you have shown, just process it as text one line at a time and use split to extract the key and the value.

Answered By: Serge Ballesta