Accessing data in blob object from download_as_string in Python

Question:

I am trying to access and modify data in a newline JSON file pulled from Google Cloud Storage in Google Cloud Functions. The results always show up as numbers despite that not being the data in the JSON.

I see that download_as_string() for blob object returns Bytes (https://googleapis.github.io/google-cloud-python/latest/_modules/google/cloud/storage/blob.html#Blob.download_as_string) but in any references I see, everyone is able to access their data just fine.

I am doing this in Cloud Functions but I think my question would apply in any GCP tool.

My example below simply should load the newline JSON data, add it to a list, select the first two dictionary entries, convert back to newline JSON and output to JSON file on GCS. Samples, code, and bad output listed below.

Sample newline JSON input

{"Website": "Google", "URL": "Google.com", "ID": 1}
{"Website": "Bing", "URL": "Bing.com", "ID": 2}
{"Website": "Yahoo", "URL": "Yahoo.com", "ID": 3}
{"Website": "Yandex", "URL": "Yandex.com", "ID": 4}

Code in Cloud Function

import requests
import json
import csv
from datetime import datetime, timedelta
import sys
from collections import OrderedDict
import os
import random

from google.cloud import bigquery
from google.cloud import storage

def importData(request, execution):
    # Read the data from Google Cloud Storage
    read_storage_client = storage.Client()

    # Set buckets and filenames
    bucket_name = "sample_bucket"
    filename = 'sample_json_output.json'

    # get bucket with name
    bucket = read_storage_client.get_bucket('sample_bucket')
    # get bucket data as blob
    blob = bucket.get_blob('sample_json.json')
    # download as string
    json_data = blob.download_as_string()

    # create list 
    website_list = []
    for u,y in enumerate(json_data):
        website_list.append(y)

    # select first two
    website_list = website_list[0:2]

    # Create new-line JSON
    results_ready = 'n'.join(json.dumps(item) for item in website_list)

    # Write the data to Google Cloud Storage
    write_storage_client = storage.Client()

    write_storage_client.get_bucket(bucket_name) 
        .blob(filename) 
        .upload_from_string(results_ready)

Current output in sample_json_output.json file

123
34

Expected output

{"Website": "Google", "URL": "Google.com", "ID": 1}
{"Website": "Bing", "URL": "Bing.com", "ID": 2}

Update 6/6: If I write a file straight from the download_to_string blob, then it writes the JSON file perfectly, but I need to access the contents prior.

import requests
import json
import csv
from datetime import datetime, timedelta
import sys
from collections import OrderedDict
import os
import random

from google.cloud import bigquery
from google.cloud import storage

def importData(request, execution):

    # Read the data from Google Cloud Storage
    read_storage_client = storage.Client()

    # Set buckets and filenames
    bucket_name = "sample_bucket"
    filename = 'sample_json_output.json'

    # get bucket with name
    bucket = read_storage_client.get_bucket('sample_bucket')

    # get bucket data as blob
    blob = bucket.get_blob('sample_json.json')

    # convert to string
    json_data = blob.download_as_string()


    # Write the data to Google Cloud Storage
    write_storage_client = storage.Client()

    write_storage_client.get_bucket(bucket_name) 
        .blob(filename) 
        .upload_from_string(json_data)

Update 6/6 Output

{"Website": "Google", "URL": "Google.com", "ID": 1}
{"Website": "Bing", "URL": "Bing.com", "ID": 2}
{"Website": "Yahoo", "URL": "Yahoo.com", "ID": 3}
{"Website": "Yandex", "URL": "Yandex.com", "ID": 4}
Asked By: AngryWhopper

||

Answers:

When you read the blob in json_data you are getting a bytes object, and when you iterate over it, you get the numeric representation of each character. Below an example that creates a list of dicts from the bytes object

json_data                                                                                                                                                                                                 
b'{"Website": "Google", "URL": "Google.com", "ID": 1}n{"Website": "Bing", "URL": "Bing.com", "ID": 2}n{"Website": "Yahoo", "URL": "Yahoo.com", "ID": 3}n{"Website": "Yandex", "URL": "Yandex.com", "ID": 4}n'

type(json_data)                                                                                                                                                                                           
bytes

website_list = [json.loads(row.decode('utf-8')) for row in json_data.split(b'n') if row]                                                                                                                 

website_list                                                                                                                                                                                              
[{'Website': 'Google', 'URL': 'Google.com', 'ID': 1},
 {'Website': 'Bing', 'URL': 'Bing.com', 'ID': 2},
 {'Website': 'Yahoo', 'URL': 'Yahoo.com', 'ID': 3},
 {'Website': 'Yandex', 'URL': 'Yandex.com', 'ID': 4}]
Answered By: Diego Rodríguez

I was able to get the result you wanted using a similar method to yourself in the code below and the ndjson library for new line JSON.

import requests
import json
import ndjson
import csv
from datetime import datetime, timedelta
import sys
from collections import OrderedDict
import os
import random

from google.cloud import bigquery
from google.cloud import storage

def importData(request, execution):

    # Read the data from Google Cloud Storage
    read_storage_client = storage.Client()

    # Set buckets and filenames
    bucket_name = "bucket-name"
    filename = "sample_json_output.json"

    # get bucket with name
    bucket = read_storage_client.get_bucket(bucket_name)

    # get bucket data as blob
    blob = bucket.get_blob("sample_json.json")

    # convert to string
    json_data_string = blob.download_as_string()

    json_data = ndjson.loads(json_data_string)

    list = []
    for item in json_data:
        list.append(item)

    list1 = list[0:2]

    result = ""
    for item in list1:
        result = result + str(item) + "n"


    # Write the data to Google Cloud Storage
    write_storage_client = storage.Client()

    write_storage_client.get_bucket(bucket_name) 
        .blob(filename) 
        .upload_from_string(result)
Answered By: Corinne White

The text would be a regular string by default if you’d replace

json_data = blob.download_as_string()

by

json_data = blob.download_as_text()
Answered By: Vincent