Why isn't matches with regex storing to the dictionary properly?

Question:

For my assignment I have to use regex and code chunking to find all URL’s inside a raw data file. For some reason when I find the matches it’s not matching URL’s but it’s finding numbers? The dictionary is getting filled with numbers instead of matches. The problem I believe is somewhere in the "for URL in fileContents". I have been trying to trouble shoot this for hours, it is looping and looks for matches using urlPattern but for some reason it’s finding number and not actual URL’s? The Regex isn’t the issue either because I used this for a simplified test and it finds the URL’s. Another issue I noticed is that if I input 50 for chunkSize it only ever checks the first 50 bytes, I’m not sure what’s the best approach for it to start with that chunk but continue to check the whole file?

Here is the code I have

import re
import os
import sys
from prettytable import PrettyTable

largeFile = input("Enter the name of a large File: ")
chunkSize = int(input("What size chunks?  "))

urlPattern = re.compile(b'w+://[[email protected]][w.:@]+/?[w.?=%&[email protected]/$,]*')
matches = {}

try:
    if os.path.isfile(largeFile):
        with open(largeFile, 'rb') as targetFile:
            fileContents = targetFile.read(chunkSize)


            


            print("nURLs")

            for URL in fileContents:
                try:
                    urlMatches   = urlPattern.findall(fileContents)
                    cnt = matches[URL]
                    cnt += 1
                    matches[URL] = cnt
                    print(urlMatches)
                    print(URL)
                except:
                    matches[URL] = 1


            tbl = PrettyTable(["Words", "Occurrences"])
            for word, cnt in matches.items():
                tbl.add_row([word, cnt])
                tbl.align = 'l'
                for link, count in matches:
                    tbl.add_row([link, count])
                print(tbl.get_string(sortby="Occurrences", reversesort=True))
                break
    else:
        print(largeFile, " is not a valid file")
        sys.exit("Script Aborted")

except Exception as err:
    print(err)
Asked By: Java_Assembly55

||

Answers:

Your for URL in fileConents line is simply iterating over a bunch of numbers.

To demonstrate lets assume that the first chunk of the file we read contains b"https://www.example.website.com/page/1":

>>> fileContents = b"https://www.example.website.com/page/1"
>>> for URL in fileContents:
...    print(URL)
...
104
116
116
112
115
58
47
47
119

As you can see when you iterate over a byte string, individual bytes are converted into integers.

I think this is probably closer at least to what you are trying to achieve.

import re
import os
import sys
from prettytable import PrettyTable

largeFile = input("Enter the name of a large File: ")
chunkSize = int(input("What size chunks?  "))

print(largeFile, chunkSize)

urlPattern = re.compile(b'w+://[[email protected]][w.:@]+/?[w.?=%&[email protected]/$,]*')
matches = {}

try:
    if os.path.isfile(largeFile):
        with open(largeFile, 'rb') as targetFile:
            fileContents = targetFile.read(chunkSize)

            # while there is data left in the file
            while len(fileContents) > 0:
                # find all the urls inside this filecontents chunk
                urlMatches = urlPattern.findall(fileContents) 

                # for each of the found urls
                for match in urlMatches:

                    # create a dictionary key if one doesn't exist
                    matches.setdefault(match, 0)

                    # increase the count value 
                    matches[match] += 1
                 
                # read in the next chunk of data from the file
                fileContents = targetFile.read(chunkSize)

            tbl = PrettyTable(["Words", "Occurrences"])
            for word, cnt in matches.items():
                tbl.add_row([word, cnt])
            tbl.align = 'l'
            print(tbl.get_string(sortby="Occurrences", reversesort=True))
    else:
        print(largeFile, " is not a valid file")
        sys.exit("Script Aborted")

except Exception as err:
    print(err)

This is my output:

+--------------------------+-------------+
| Words                    | Occurrences |
+--------------------------+-------------+
| b'https://www.dod.mil'   | 1           |
| b'https://treasury.gov'  | 1           |
| b'https://microsoft.com' | 1           |
| b'http://www.google.com' | 1           |
| b'http://www.amazon.com' | 1           |
+--------------------------+-------------+

Answered By: Alexander
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.