Why is appending a dict to list not working here?

Question:

I’m trying to append the contents of two pickled files in a directory to a dict which is then appended to a list. For reference there are only two .pkl files in the directory and the pickled objects are returned as lists. However, when I try to append the dicts to the list, I get duplicate results. Anyone idea why?

import os
import pickle
import pandas as pd


y_labels = ('anime.pkl', 'manga.pkl')


def process_docs(path, label):
    docs = os.listdir(path)
    data = []
    for doc in docs:
        with open(f'{path}/{doc}', 'rb') as f:
            text = pickle.load(f)
            data.append({'label': label, 'text': ' '.join(text)})
    return data


data = []
for label in y_labels:
    data.extend(process_docs('keywords', label))
df = pd.DataFrame(data)

ACTUAL OUPUT:

[{'label': 'anime.pkl', 'text': 'a b c'}, {'label': 'anime.pkl', 'text': 'a b c'}] 
[{'label': 'manga.pkl', 'text': '1 2 3'}, {'label': 'manga.pkl', 'text': '1 2 3'}]

EXPECTED OUPUT:

[{'label': 'anime.pkl', 'text': 'a b c'}, {'label': 'anime.pkl', 'text': '1 2 3'}]
Asked By: ariyasas94

||

Answers:

That’s because you are reading the same directory twice. From your code, you called process_docs('keywords', label) twice. Each time, you called docs = os.listdir(path) where the path was 'keywords' for both times. Therefore, docs were the same. After that, you looped the docs and append the content of the same files. As a result, you got duplicated results.

In order to get your expected result, you only need to iterate both docs and label pairs once only. You do not need two for loops. For example, you can do the following.

data = []
path = 'keywords'
docs = os.listdir(path)
for i, label in enumerate(y_labels):
    doc = docs[i]
    with open(f'{path}/{doc}', 'rb') as f:
        text = pickle.load(f)
        data.append({'label': label, 'text': ' '.join(text)})
df = pd.DataFrame(data)
Answered By: Wai

The dictionaries in process_docs have the same key ‘label’ for both files which is why it’s duplicating. You should be creating a unique key in the ‘label’ argument instead.

def process_docs(path, label):
    docs = os.listdir(path)
    data = []
    for doc in docs:
        with open(f'{path}/{doc}', 'rb') as f:
            text = pickle.load(f)
            # use filename as key for uniqueness
            data.append({'label': label, f'file_{doc}': ' '.join(text)})
    return data

Once you have unique labels then you can update and append to the list:

data = []
for label in y_labels:
    docs_data = process_docs('keywords', label)
    # combine the dictionaries by updating the label key
    combined_data = {}
    for doc_data in docs_data:
        combined_data.update(doc_data)
    data.append(combined_data)
df = pd.DataFrame(data)
Answered By: tamarajqawasmeh
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.