Unable to return required strings from XML files

Question:

I have created this code to have a user point at a directory and for it to go through the directory looking for .xml files. Once found the program is supposed to search each file looking for strings that are 32 bits in length. This is the only requirement, the content is not important at this time just that it return 32 bit strings.

i have tried using the regex module within Python as below, when run the program iterates over the available files. returns all the file names but the String_recovery function returns only empty lists. I have confirmed that the xml contains 32 bit strings visually.

import os
import re
import tkinter as tk
from tkinter import filedialog



def string_recovery(data):
    short_string = re.compile(r"^[a-zA-Z0-9-._]{32}$")
    strings = re.findall(short_string, data)
    print(strings)


def xml_search(directory):
    xml_files = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith(".xml"):
                xml_files.append(os.path.join(root, file))
    print("The following XML files have been found.")
    print(xml_files)

    for xml_file in xml_files:
        with open(xml_file, "r") as f:
            string_recovery(f.read())


def key_finder():
    directory = filedialog.askdirectory()
    xml_search(directory)


key_finder()
Asked By: LHenne300

||

Answers:

Maybe you should go over each line:

    for xml_file in xml_files:
        with open(xml_file, "r") as f:
            string_recovery(f.read())

If your string_recovery works properly (try it with a line, I cannot reproduce your example but create a variable line = and put there a line which should be recoverd.

And go over each line instead of the whole file:

    for xml_file in xml_files:
        with open(xml_file, "r") as f:
            for line in f.readliens():
                string_recovery(line)
Answered By: 3dSpatialUser

By default, python patterns are not "multiline" thus ^ and $ match the start and end of your text block, not each line. You need to set this flag re.M aka re.MULTILINE:

compare:

import re

text = """
foo
12345678901234567890123456789011
12345678901234567890123456789011
"""
pattern = r"^[a-zA-Z0-9-._]{32}$"
print(re.findall(pattern, text, re.M))  ## <--- flag

Giving:

[
    '12345678901234567890123456789011',
    '12345678901234567890123456789011'
]

with:

import re

text = """
foo
12345678901234567890123456789011
12345678901234567890123456789011
"""
pattern = r"^[a-zA-Z0-9-._]{32}$"
print(re.findall(pattern, text))

Giving:

[]
Answered By: JonSG
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.