Why is appending a dict to list not working here?
Question:
I’m trying to append the contents of two pickled files in a directory to a dict which is then appended to a list. For reference there are only two .pkl files in the directory and the pickled objects are returned as lists. However, when I try to append the dicts to the list, I get duplicate results. Anyone idea why?
import os
import pickle
import pandas as pd
y_labels = ('anime.pkl', 'manga.pkl')
def process_docs(path, label):
docs = os.listdir(path)
data = []
for doc in docs:
with open(f'{path}/{doc}', 'rb') as f:
text = pickle.load(f)
data.append({'label': label, 'text': ' '.join(text)})
return data
data = []
for label in y_labels:
data.extend(process_docs('keywords', label))
df = pd.DataFrame(data)
ACTUAL OUPUT:
[{'label': 'anime.pkl', 'text': 'a b c'}, {'label': 'anime.pkl', 'text': 'a b c'}]
[{'label': 'manga.pkl', 'text': '1 2 3'}, {'label': 'manga.pkl', 'text': '1 2 3'}]
EXPECTED OUPUT:
[{'label': 'anime.pkl', 'text': 'a b c'}, {'label': 'anime.pkl', 'text': '1 2 3'}]
Answers:
That’s because you are reading the same directory twice. From your code, you called process_docs('keywords', label)
twice. Each time, you called docs = os.listdir(path)
where the path was 'keywords'
for both times. Therefore, docs were the same. After that, you looped the docs
and append the content of the same files. As a result, you got duplicated results.
In order to get your expected result, you only need to iterate both docs and label pairs once only. You do not need two for loops. For example, you can do the following.
data = []
path = 'keywords'
docs = os.listdir(path)
for i, label in enumerate(y_labels):
doc = docs[i]
with open(f'{path}/{doc}', 'rb') as f:
text = pickle.load(f)
data.append({'label': label, 'text': ' '.join(text)})
df = pd.DataFrame(data)
The dictionaries in process_docs
have the same key ‘label’ for both files which is why it’s duplicating. You should be creating a unique key in the ‘label’ argument instead.
def process_docs(path, label):
docs = os.listdir(path)
data = []
for doc in docs:
with open(f'{path}/{doc}', 'rb') as f:
text = pickle.load(f)
# use filename as key for uniqueness
data.append({'label': label, f'file_{doc}': ' '.join(text)})
return data
Once you have unique labels then you can update and append to the list:
data = []
for label in y_labels:
docs_data = process_docs('keywords', label)
# combine the dictionaries by updating the label key
combined_data = {}
for doc_data in docs_data:
combined_data.update(doc_data)
data.append(combined_data)
df = pd.DataFrame(data)
I’m trying to append the contents of two pickled files in a directory to a dict which is then appended to a list. For reference there are only two .pkl files in the directory and the pickled objects are returned as lists. However, when I try to append the dicts to the list, I get duplicate results. Anyone idea why?
import os
import pickle
import pandas as pd
y_labels = ('anime.pkl', 'manga.pkl')
def process_docs(path, label):
docs = os.listdir(path)
data = []
for doc in docs:
with open(f'{path}/{doc}', 'rb') as f:
text = pickle.load(f)
data.append({'label': label, 'text': ' '.join(text)})
return data
data = []
for label in y_labels:
data.extend(process_docs('keywords', label))
df = pd.DataFrame(data)
ACTUAL OUPUT:
[{'label': 'anime.pkl', 'text': 'a b c'}, {'label': 'anime.pkl', 'text': 'a b c'}]
[{'label': 'manga.pkl', 'text': '1 2 3'}, {'label': 'manga.pkl', 'text': '1 2 3'}]
EXPECTED OUPUT:
[{'label': 'anime.pkl', 'text': 'a b c'}, {'label': 'anime.pkl', 'text': '1 2 3'}]
That’s because you are reading the same directory twice. From your code, you called process_docs('keywords', label)
twice. Each time, you called docs = os.listdir(path)
where the path was 'keywords'
for both times. Therefore, docs were the same. After that, you looped the docs
and append the content of the same files. As a result, you got duplicated results.
In order to get your expected result, you only need to iterate both docs and label pairs once only. You do not need two for loops. For example, you can do the following.
data = []
path = 'keywords'
docs = os.listdir(path)
for i, label in enumerate(y_labels):
doc = docs[i]
with open(f'{path}/{doc}', 'rb') as f:
text = pickle.load(f)
data.append({'label': label, 'text': ' '.join(text)})
df = pd.DataFrame(data)
The dictionaries in process_docs
have the same key ‘label’ for both files which is why it’s duplicating. You should be creating a unique key in the ‘label’ argument instead.
def process_docs(path, label):
docs = os.listdir(path)
data = []
for doc in docs:
with open(f'{path}/{doc}', 'rb') as f:
text = pickle.load(f)
# use filename as key for uniqueness
data.append({'label': label, f'file_{doc}': ' '.join(text)})
return data
Once you have unique labels then you can update and append to the list:
data = []
for label in y_labels:
docs_data = process_docs('keywords', label)
# combine the dictionaries by updating the label key
combined_data = {}
for doc_data in docs_data:
combined_data.update(doc_data)
data.append(combined_data)
df = pd.DataFrame(data)