File Names Chain in python

Question:

I CANNOT USE ANY IMPORTED LIBRARY. I have this task where I have some directories containing some files; every file contains, besides some words, the name of the next file to be opened, in its first line. Once every word of every files contained in a directory is opened, they have to be treated in a way that should return a single string; such string contains in its first position, the most frequent first letter of every word seen before, in its second position the most frequent second letter, and so on. I have managed to do this with a directory containing 3 files, but it’s not using any type of chain-like mechanism, rather a passing of local variables. Some of my college colleagues suggested I had to use slicing of lists, but I can’t figure out how. I CANNOT USE ANY IMPORTED LIBRARY.
This is what I got:

'''
    The objective of the homework assignment is to design and implement a function
    that reads some strings contained in a series of files and generates a new
    string from all the strings read.
    The strings to be read are contained in several files, linked together to
    form a closed chain. The first string in each file is the name of another
    file that belongs to the chain: starting from any file and following the
    chain, you always return to the starting file.
    
    Example: the first line of file "A.txt" is "B.txt," the first line of file
    "B.txt" is "C.txt," and the first line of "C.txt" is "A.txt," forming the 
    chain "A.txt"-"B.txt"-"C.txt".
    
    In addition to the string with the name of the next file, each file also
    contains other strings separated by spaces, tabs, or carriage return 
    characters. The function must read all the strings in the files in the chain
    and construct the string obtained by concatenating the characters with the
    highest frequency in each position. That is, in the string to be constructed,
    at position p, there will be the character with the highest frequency at 
    position p of each string read from the files. In the case where there are
    multiple characters with the same frequency, consider the alphabetical order.
    The generated string has a length equal to the maximum length of the strings
    read from the files.
    
    Therefore, you must write a function that takes as input a string "filename"
    representing the name of a file and returns a string.
    The function must construct the string according to the directions outlined
    above and return the constructed string.
    
    Example: if the contents of the three files A.txt, B.txt, and C.txt in the
    directory test01 are as follows
    
    
    test01/A.txt          test01/B.txt         test01/C.txt                                                                 
    -------------------------------------------------------------------------------
    test01/B.txt          test01/C.txt         test01/A.txt
    house                 home                 kite                                                                       
    garden                park                 hello                                                                       
    kitchen               affair               portrait                                                                     
    balloon                                    angel                                                                                                                                               
                                               surfing                                                               
    
    the function most_frequent_chars ("test01/A.txt") will return "hareennt".
    '''

        def file_names_list(filename):
            intermezzo = []
            lista_file = []
        
            a_file = open(filename)
        
            lines = a_file.readlines()
            for line in lines:
                intermezzo.extend(line.split())
            del intermezzo[1:]
            lista_file.append(intermezzo[0])
            intermezzo.pop(0)
            return lista_file
        
        
        def words_list(filename):
            lista_file = []
            a_file = open(filename)
        
            lines = a_file.readlines()[1:]
            for line in lines:
                lista_file.extend(line.split())
            return lista_file
        
        def stuff_list(filename):
            file_list = file_names_list(filename)
            the_rest = words_list(filename)
            second_file_name = file_names_list(file_list[0])
            
            
            the_lists = words_list(file_list[0]) and 
            words_list(second_file_name[0])
            the_rest += the_lists[0:]
            return the_rest
            
        
        def most_frequent_chars(filename):
            huge_words_list = stuff_list(filename)
            maxOccurs = ""
            list_of_chars = []
            for i in range(len(max(huge_words_list, key=len))):
                for item in huge_words_list:
                    try:
                        list_of_chars.append(item[i])
                    except IndexError:
                        pass
                    
                maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)
                list_of_chars.clear()
            return maxOccurs
        print(most_frequent_chars("test01/A.txt"))
Asked By: youngsoyuz

||

Answers:

This assignment is relatively easy, if the code has a good structure. Here is a full implementation:

def read_file(fname):
    with open(fname, 'r') as f:
        return list(filter(None, [y.rstrip(' n').lstrip(' ') for x in f for y in x.split()]))

def read_chain(fname):
    seen   = set()
    new    =  fname
    result = []
    while not new in seen:
        A          = read_file(new)
        seen.add(new)
        new, words = A[0], A[1:]
        result.extend(words)
    return result

def most_frequent_chars (fname):
    all_words = read_chain(fname)
    result    = []
    for i in range(max(map(len,all_words))):
        chars = [word[i] for word in all_words if i<len(word)]
        result.append(max(sorted(set(chars)), key = chars.count))
    return ''.join(result)

print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"

In the code above, we define 3 functions:

  1. read_file: simple function to read the contents of a file and return a list of strings. The command x.split() takes care of any spaces or tabs used to separate words. The final command list(filter(None, arr)) makes sure that empty strings are erased from the solution.

  2. read_chain: Simple routine to iterate through the chain of files, and return all the words contained in them.

  3. most_frequent_chars: Easy routine, where the most frequent characters are counted carefully.


PS. This line of code you had is very interesting:

maxOccurs += max(sorted(set(list_of_chars)), key = list_of_chars.count)

I edited my code to include it.


Space complexity optimization

The space complexity of the previous code can be improved by orders of magnitude, if the files are scanned without storing all the words:

def scan_file(fname, database):
    with open(fname, 'r') as f:
        next_file = None
        for x in f:
            for y in x.split():
                if next_file is None:
                    next_file = y
                else:
                    for i,c in enumerate(y):
                        while len(database) <= i:
                            database.append({})
                        if c in database[i]:
                            database[i][c] += 1
                        else:
                            database[i][c]  = 1
        return next_file

def most_frequent_chars (fname):
    database  =  []
    seen      =  set()
    new       =  fname
    while not new in seen:
        seen.add(new)
        new  =  scan_file(new, database)
    return ''.join(max(sorted(d.keys()),key=d.get) for d in database)
print(most_frequent_chars("test01/A.txt"))
# output: "hareennt"

Now we scan the files tracking the frequency of the characters in database, without storing intermediate arrays.

Answered By: C-3PO

Ok, here’s my solution:

def parsi_file(filename):
    
    visited_files = set()
    words_list = []
    
    # Getting words from all files
    while filename not in visited_files:
        visited_files.add(filename)
        with open(filename) as f:
            filename = f.readline().strip()
            words_list += [line.strip() for line in f.readlines()] 
    
    # Creating dictionaries of letters:count for each index
    letters_dicts = []
    for word in words_list:
        for i in range(len(word)):    
            if i > len(letters_dicts)-1:
                letters_dicts.append({})
            letter = word[i]
            if letters_dicts[i].get(letter):
                letters_dicts[i][letter] += 1
            else:
                letters_dicts[i][letter] = 1
        
     # Sorting dicts and getting the "best" letter
    code = ""
    for dic in  letters_dicts:
        sorted_letters = sorted(dic, key = lambda letter: (-dic[letter],letter))
        code += sorted_letters[0]
        
    return code
  • We first get the words_list from all files.
  • Then, for each index, we create a dictionary of the letters in all words at that index, with their count.
  • Now we sort the dictionary keys by descending count (-count) then by alphabetical order.
  • Finally we get the first letter (thus the one with the max count) and add it to the "code" word for this test battery.

Edit: in terms of efficiency, parsing through all words for each index will get worse as the number of words grows, so it would be better to tweak the code to simultaneously create the dictionaries for each index and parse through the list of words only once. Done.

Answered By: Swifty
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.