Dynamically generate json file keys and write on S3

Question:

I am generating json file using python script, but the problem after for loop it is only picking up last updated value. Below is the code.

1 read watermark file:

watermark_file = config_dict["watermark_file"] + "watermark.json"
current_date, flag = read_watermark_file(config_dict.get("out_bucket"), watermark_file)
contents = list_s3_files(opt={'Bucket': config_dict['inp_bucket'], 'Prefix': config_dict['inp_location']})
print("contents :", contents)
for n in range(len(contents)):
    watermark_json = {}
    loop = {}
    zipped_fileName = contents[n].split("/")[-1]
    therapeutic_area = re.match("(.*?)_(.*)", zipped_fileName)[1]
    indication = re.match("(.*?)_(.*?)_(.*)", zipped_fileName)[2]
    print("value of n:", n)
    loop['item_' + str(n)] = {"therapeutic_area": therapeutic_area,
                              "indication": indication,
                              "s3_path": config_dict["inp_location"] + therapeutic_area + "/" + indication + "/"}
    print("loop :", loop)
    watermark_json.update(loop)
    print("watermark_json :", watermark_json)
# update water mark file
print("watermark_file :", watermark_file)
watermark_json['date_dir'] = datetime.datetime.now().strftime("%Y/%m/%d/%H") + "/"
watermark_json['processed_flag'] = False
print("final watermark file ", watermark_json)
# refresh watermark file
write_to_s3(config_dict['out_bucket'], watermark_file, watermark_json, config_dict)

Logs:

2020-08-23T23:00:43.055+05:30

Copy
contents : ['mdit/cord/data/inbox/Immunology_COVID-19_Data_202008061200_09.zip', 'mdit/cord/data/inbox/Immunology_SLE_Data_202008131800_01.zip', 'mdit/cord/data/inbox/Neurology_ALZ_Data_202008031800_01.zip']
contents : ['mdit/cord/data/inbox/Immunology_COVID-19_Data_202008061200_09.zip', 'mdit/cord/data/inbox/Immunology_SLE_Data_202008131800_01.zip', 'mdit/cord/data/inbox/Neurology_ALZ_Data_202008031800_01.zip']

2020-08-23T23:00:43.055+05:30

Copy
value of n: 0
value of n: 0

2020-08-23T23:00:43.055+05:30

Copy
loop : {'item_0': {'therapeutic_area': 'Immunology', 'indication': 'COVID-19', 's3_path': 'mdit/cord/data/inbox/Immunology/COVID-19/'}}
loop : {'item_0': {'therapeutic_area': 'Immunology', 'indication': 'COVID-19', 's3_path': 'mdit/cord/data/inbox/Immunology/COVID-19/'}}

2020-08-23T23:00:43.055+05:30

Copy
watermark_json : {'item_0': {'therapeutic_area': 'Immunology', 'indication': 'COVID-19', 's3_path': 'mdit/cord/data/inbox/Immunology/COVID-19/'}}
watermark_json : {'item_0': {'therapeutic_area': 'Immunology', 'indication': 'COVID-19', 's3_path': 'mdit/cord/data/inbox/Immunology/COVID-19/'}}

2020-08-23T23:00:43.055+05:30

Copy
value of n: 1
value of n: 1

2020-08-23T23:00:43.055+05:30

Copy
loop : {'item_1': {'therapeutic_area': 'Immunology', 'indication': 'SLE', 's3_path': 'mdit/cord/data/inbox/Immunology/SLE/'}}
loop : {'item_1': {'therapeutic_area': 'Immunology', 'indication': 'SLE', 's3_path': 'mdit/cord/data/inbox/Immunology/SLE/'}}

2020-08-23T23:00:43.055+05:30

Copy
watermark_json : {'item_1': {'therapeutic_area': 'Immunology', 'indication': 'SLE', 's3_path': 'mdit/cord/data/inbox/Immunology/SLE/'}}
watermark_json : {'item_1': {'therapeutic_area': 'Immunology', 'indication': 'SLE', 's3_path': 'mdit/cord/data/inbox/Immunology/SLE/'}}

2020-08-23T23:00:43.055+05:30

Copy
value of n: 2
value of n: 2

2020-08-23T23:00:43.055+05:30

Copy
loop : {'item_2': {'therapeutic_area': 'Neurology', 'indication': 'ALZ', 's3_path': 'mdit/cord/data/inbox/Neurology/ALZ/'}}
loop : {'item_2': {'therapeutic_area': 'Neurology', 'indication': 'ALZ', 's3_path': 'mdit/cord/data/inbox/Neurology/ALZ/'}}

2020-08-23T23:00:43.055+05:30

Copy
watermark_json : {'item_2': {'therapeutic_area': 'Neurology', 'indication': 'ALZ', 's3_path': 'mdit/cord/data/inbox/Neurology/ALZ/'}}
watermark_json : {'item_2': {'therapeutic_area': 'Neurology', 'indication': 'ALZ', 's3_path': 'mdit/cord/data/inbox/Neurology/ALZ/'}}

2020-08-23T23:00:43.055+05:30

Copy
watermark_file : mdit/cord/technical_metadata/watermark/watermark.json
watermark_file : mdit/cord/technical_metadata/watermark/watermark.json

2020-08-23T23:00:43.055+05:30

Copy
final watermark file 
 {'item_2': {'therapeutic_area': 'Neurology', 'indication': 'ALZ', 's3_path': 'mdit/cord/data/inbox/Neurology/ALZ/'}, 'date_dir': '2020/08/23/17/', 'processed_flag': False}

Expected Watermark.json file:

{
    "loop": {
        "item_0":{
                "therapeutic_area": "Immunology",
                "indication": "SLE",
                "s3_path": "mdit/cord/data/inbound/Immunology/SLE/"
            },
        "item_1":{
                "therapeutic_area": "Immunology",
                "indication": "COVID-19",
                "s3_path": "mdit/cord/data/inbound/Immunology/COVID-19/"
            },
        "item_2":{
                "therapeutic_area": "Neurology",
                "indication": "ALZ",
                "s3_path": "mdit/cord/data/inbound/Immunology/ALZ/"
            }
    },
    "date_dir": "2020/08/23/12/",
    "processed_flag": false
}

Json file getting generating from code :

{
    "item_2": {
        "therapeutic_area": "Neurology",
        "indication": "ALZ",
        "s3_path": "mdit/cord/data/inbox/Neurology/ALZ/"
    },
    "date_dir": "2020/08/23/17/",
    "processed_flag": false
}

What I am doing wrong in code?

Asked By: Harshit Kakkar

||

Answers:

The cause of the erroneous behavior of your code is that watermark_json = {} is inside the for n in range(len(contents)): loop. It should be located before the for loop.

And the code should be little bit further changed to get the output you want.

You can try the following code:

watermark_file = config_dict["watermark_file"] + "watermark.json"
current_date, flag = read_watermark_file(config_dict.get("out_bucket"), watermark_file)
contents = list_s3_files(opt={'Bucket': config_dict['inp_bucket'], 'Prefix': config_dict['inp_location']})
print("contents :", contents)
watermark_json = {'loop': {}}  # <- This line is changed
for n in range(len(contents)):
    loop = {}
    zipped_fileName = contents[n].split("/")[-1]
    therapeutic_area = re.match("(.*?)_(.*)", zipped_fileName)[1]
    indication = re.match("(.*?)_(.*?)_(.*)", zipped_fileName)[2]
    print("value of n:", n)
    loop['item_' + str(n)] = {"therapeutic_area": therapeutic_area,
                              "indication": indication,
                              "s3_path": config_dict["inp_location"] + therapeutic_area + "/" + indication + "/"}
    print("loop :", loop)
    watermark_json['loop'].update(loop)  # <- This line is changed
    print("watermark_json :", watermark_json)
# update water mark file
print("watermark_file :", watermark_file)
watermark_json['date_dir'] = datetime.datetime.now().strftime("%Y/%m/%d/%H") + "/"
watermark_json['processed_flag'] = False
print("final watermark file ", watermark_json)
# refresh watermark file
write_to_s3(config_dict['out_bucket'], watermark_file, watermark_json, config_dict)
Answered By: Gorisanson
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.