Finding whether a string is present in a pandas data frame column, and create a column with that string if it is

Question:

I have a list of 6000 files and a pandas data frame that contains a list of URLs. Some of those URLs match the names of those 6000 files. While I am iterating through the list of the files for some other purpose (extracting text), I am also looking for matching names in the URLs column. If there is a match, I write the matching file path in a new column.

Does not sound complicated, except for the fact that my code does not work:

files = glob.glob("materials/*.html")
data = pd.read_csv("file.csv")

def match_name(row):
    if filename in row['URL']:
        return file

for file in files:
    filename = os.path.basename(f'{file[:-5]}')
    extractor = open(file, 'rb')
    ...
    full = [p_text, os.path(basename(file)]
    df_full = pd.DataFrame(full)

    data['Path'] = dataset.apply(lambda x: match_name(x), axis=1)```

However, it does not work and all the columns return Null. I also tried:

data[‘Path’] = data.apply(lambda x: file if filename in x else None, axis=1)


Those columns of the data frame look like this:

|Name | Value | URL                         |
|-----|-------|-----------------------------|
|Name1|Value1 |http://example.com/LALAC.html|
|Name2|Value2 |http://example.com/ABASW.html|
|Name3|Value3 |http://example.com/4421C.html|

The files are LALAC.txt, SDDSA1.txt, 4421C.html, etc. The output that I want to get is:

|Name | Value | URL                         |Path               |
|-----|-------|-----------------------------|-------------------|
|Name1|Value1 |http://example.com/LALAC.html|materials/LALAC.txt|
|Name2|Value2 |http://example.com/ABASW.html|None               |
|Name3|Value3 |http://example.com/4421C.html|materials/4421C.txt|

The path does exist in the folder, but I am missing the reason why I keep getting None. Any ideas?
Asked By: Octner

||

Answers:

If you have all of the file names in a set, and all of the URLs in a dataframe, you can do:

import pandas as pd
filenames = {"LALAC", "ABASW", "4421C"}

df = pd.DataFrame({'URL': [
"http://example.com/LALAC.html",
"http://example.com/ABASW.html",
"http://example.com/4421C.html",
"HTTP://example.com/12345.html"
]})

df["Path"] = "materials/" + df["URL"].str.findall('|'.join(filenames)).str[0]  + ".txt"

result:

                             URL                 path
0  http://example.com/LALAC.html  materials/LALAC.txt
1  http://example.com/ABASW.html  materials/ABASW.txt
2  http://example.com/4421C.html  materials/4421C.txt
3  http://example.com/12345.html                  NaN
Answered By: Tom McLean
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.