compare string with data frame

Question:

I am trying to compare strings with elements inside a dataframe.
My strings in the file are like this:

0100000000
0001000000

I would like to pass every line from my file to the position in dataframe, and if the line corresponds, then print the line and its corresponding vector of the dataframe. Something like this:

0100000000 01
0001000000 01

I have this code so far, it’s basic and I don’t know how to continue

import pandas as pd

data_f = pd.DataFrame(
    {'position':
        {0: '1000000000',
         1: '0100000000',
         2: '0010000000',
         3: '0001000000',
         4: '0000100000',
         5: '0000010000'},
     'vector': {0: '10', 1: '01', 2: '10', 3: '01', 4: '01', 5: '01'}})

with open("/test_2vec/example_vec61.txt", "r") as f1:
    for vec in f1:
        print(vec)
Asked By: Vykov

||

Answers:

Approach #1 (iterative)

Iterate over file’s lines and check if a line occurs within a position column of the dataframe df:

with open('yourfile.txt') as fin:
    for line in fin:
        line = line.strip()
        vec = df.loc[df['position'].eq(line.strip()), 'vector'].values
        if vec.size:
            print(line, vec[0])

Approach #2 (merging, the shorter one)

Load the text file with lines to another dataframe to merge it with the initial one on matched lines.

df2 = pd.read_table('yourfile.txt', header=None, dtype=str)
matched_df = df.merge(df2, left_on='position', right_on=0)
print(matched_df.to_string(columns=['position', 'vector'], header=None, index=None))

The output (for the initial input):

0000000000000000000000000001000000000000000000000000000000000 01
0000000000010000000000000000000000000000000000000000000000000 01
0000000000000000000000000000000000000010000000000000000000000 01
0000000000000000000000000000000000000000000000000000000000100 10
0000000000000000000000000000000010000000000000000000000000000 10
0000000000100000000000000000000000000000000000000000000000000 10
0000000000001000000000000000000000000000000000000000000000000 01
0100000000000000000000000000000000000000000000000000000000000 01
0000000000000100000000000000000000000000000000000000000000000 10
Answered By: RomanPerekhrest

You could read the file into a set then get the positions that are elements of it. The only thing is this doesn’t preserve the order of the lines.

with open(...) as f1:
    pos = set(line.rstrip('n') for line in f1)

df_out = data_f.loc[data_f['position'].isin(pos)]
df_out
     position vector
1  0100000000     01
3  0001000000     01

Then to print it like you want:

print(df_out.to_string(header=False, index=False))
0100000000 01
0001000000 01
Answered By: wjandrea
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.