Why is this For Loop overwriting the contents of my dictionary strangely?

Question:

I am trying to convert my pandas DataFrame data into a different medium which is easily represented via JSON. I have chosen to do this by turning it into python dictionaries then converting it into JSON.

The problem I am encountering is that the data I am putting through the process of formatting is coming out in a different order than expected – the values I am expecting are being replaced by the last values in my for loop.

Here is a reproducible example, which is split between 2 files:

    import re
    import pandas as pd
    import json
    
    from help import Model  # Note! this is another file help.py
    
    jan = {'Month': ["January", "January", "January", "January"],
           'Date': ['1st', '2nd', '28th', '29th'],
           'a': ["j1a", "a3x", "d9c", "h9c"],
           'b': ["X1", "SG", "DV", "XP"]}
    
    dec = {'Month': ["December", "December", "December", "December"],
           'Date': ['1st', '2nd', '28th', '29th'],
           'a': ["d1a", "o3x", "j9c", "h9c"],
           'b': ["X2", "SG", "DV", "XP"]}
    
    a = pd.DataFrame.from_dict(jan)
    b = pd.DataFrame.from_dict(dec)
    
    dfs = [a, b]
    df = pd.concat(dfs)
    
    DateNum = []
    for values in df['Date']:
        DateNum.append(re.search(r'd+', values).group())
    df['Date Num'] = DateNum
    df.reset_index(drop=True, inplace=True)
    
    dfl = df.Month.tolist()
    months = []
    for data in dfl:
        if data not in months:
            months.append(data)
    # months = ['January', 'December']
    models = []
    for month in months:
        models.append(Model(month))
    
    
    calendar = {}
    for month in models:
        datacopy = df.copy()
        datacopy = datacopy[datacopy.Month == month.name]
        month.data = datacopy
    
        month.update(debug=True)
        calendar[month.name] = month.days
    
    print(json.dumps(calendar, indent=4))

Here is the other file – help.py contains the classes Model and Day

    class Model:
        """
        model for months
        """
        name = ""
        data = None
    
        days = {}
    
        def __init__(self, monthname):
            self.name = monthname
    
        def update(self, debug=False):
    
            edit = self.data  # a copy of a slice from the df
            edit = edit.drop("Month", axis=1)  # drop Month column
            edit = edit.set_index('Date Num').T.to_dict('list')  # set Date Num column to be the index and make dict
    
            data_formatted = {self.name: edit}  # save the dict with key as month name as data_formatted
    
            for k, v in data_formatted[self.name].items():  # data_formatted [month] = (day number : data)
    
                if debug:
                    print(k, v)  # e.g. k=1 v=['1st', 'a', 'n']
    
                day_object = Day(v)  # make a day object out of the values (formatting in initializer)
                self.days[k] = day_object.data_formatted  # assign the formatted value e.g. days[1] = (formatted data)
    
                # print(self.days[k])  # shows correct data e.g. {'date': '25th', 'a': 'a', 'b': 'n', 'c': 'x'}
    
    
    class Day:
        date = ""
        a = ""
        b = ""
    
        data_formatted = {}
    
        def __init__(self, data):
            self.date = data[0]
            self.a = data[1]
            self.b = data[2]
    
            self.format_data()
    
        def format_data(self):
            self.data_formatted = {
                "date": self.date,
                "a": self.a,
                "b": self.b,
            }

As expected, the data is being processed in the expected order:

1 ['1st', 'j1a', 'X1']
2 ['2nd', 'a3x', 'SG']
28 ['28th', 'd9c', 'DV']
29 ['29th', 'h9c', 'XP']
1 ['1st', 'd1a', 'X2']
2 ['2nd', 'o3x', 'SG']
28 ['28th', 'j9c', 'DV']
29 ['29th', 'h9c', 'XP']

But the output of the json.dumps is different (identical to the last month in months):

{
"January": {
    "1": {
        "date": "1st",
        "a": "d1a", - Should be j1a
        "b": "X2" - should be X1
    },
    "2": {
        "date": "2nd",
        "a": "o3x", - Should be a3x
        "b": "SG"
    } ...

Thank you for reading this and I hope you can help me.

Here are some other notes:

  • The code without the Model class is being run in an interactive python notebook – could this change things?
  • The code I have provided only shows 2 months. In my case, the data from the last month (which I assume to be the last iteration) is being saved as the data for ALL the months.
Asked By: mfm

||

Answers:

The problem is here:

    month.data = datacopy

    month.update(debug=True)
    calendar[month.name] = month.days

That’s fine the first time around, but in the next iteration, you change the data and rerun .update for month, but its .days is still the same dictionary. So, you’re not just updating the dictionary for the next month, but also for all previous months.

Edit: you asked for some clarification in the comments – that’s fine, it’s perhaps not immediately obvious.

The problem starts here, in your Model class:

class Model:
    ...
    # this is the only place a new dictionary is created
    days = {}  

    def __init__(self, monthname):
        # after __init__, this object will have a reference to the 1 days in your class
        ...

    def update(self, debug=False):
        ...
        for k, v in data_formatted[self.name].items():
            ...
            day_object = Day(v)
            # so here, you just update that one dictionary
            self.days[k] = day_object.data_formatted  

I’ve removed the code that doesn’t contribute to the problem and added some comments to explain. The key problem is that you defined days as an attribute of Model – that means it’s a class attribute, to which all instances of the class have access, but there’s only one of it.

If you need each instance of Model to have a unique instance of .days, you should just create it in __init__ (and you don’t need it on the class body at all):

    def __init__(self, monthname):
        self.name = monthname
        self.days = {}

So, the problem is not really to do with loops, the problem is the difference between a class attribute and an object attribute.

Answered By: Grismar