Parsing data from JSON (tweepy) into a pandas dataframe

Question

I’ve streamed tweets from Tweepy and stored it as a text file, as such. Now I am looking to convert this into a pandas dataframe but I don’t know how. I’ve tried looking for similar posts here on Stack Overflow and in the pandas documentation as well, but I’m still not sure on how I would start parsing all of this data.

Answer: Solved this by turning the json file into a list and then was able to turn it into a dataframe. Thank you everyone who helped.

    tweets = []
    for line in open('tweets.txt', 'r'):
       tweets.append(json.loads(line))

    df = pd.DataFrame(tweets)

Asked By: Philip Liu

||

Source

Answer 1

You don’t have to convert your text file to json in order to read it as a pandas dataframe just do:

pd.read_json('yourfile.txt')

and it should work. This assumes that your format is:

{"name": "first json"}

and not:

{"name": "first json"}{"name": "second json"}

However, if you do have the second format then you can just any of these methods (there are many more):

Iterate through the file -> track the open brackets -> create json objects on the go -> append them to a list -> feed the list into pandas.

def parseMultipleJSON(lines):
    skip = prev = 0
    data = []
    lines = ''.join(lines)
    for idx, line in enumerate(lines):
        if line == "{":
            skip += 1
        elif line == "}":
            skip -= 1
            if skip == 0:
                json_string = ''.join(lines[prev:idx+1])
                data.append(json.loads(json_string))
                prev = idx+1
    return data

Or use split as such and add removed brackets:

def parseMultipleJSON2(lines):
    lines = ''.join(lines).split('}{')
    data = []
    for line in lines:
        if line.endswith('}') == False:
            line += '}'
        if line.startswith('{') == False:
            line = '{%s' % line
        data.append(json.loads(line))
    return data

This is the same as the second solution but abbreviated:

def parseMultipleJSON3(lines):
    lines = ''.join(lines).split('}{')
    data = [json.loads('%s}' % line) if idx == 0 else json.loads('{%s' % line) if idx == len(lines)-1 else json.loads('{%s}' % line) for idx, line in enumerate(lines)]
    return data

Then you can call any which you want to choose as such:

import pandas as pd
import json

with open('yourfile.txt','r') as json_file:
    lines = json_file.readlines()
    lines = [line.strip("n") for line in lines]
    #data = parseMultipleJSON(lines)
    #data = parseMultipleJSON2(lines)
    data = parseMultipleJSON3(lines)

df = pd.DataFrame(data)

Answered By: chargingupfor

Answer 2

If you have multiple tweets in your json file (yourfile.txt) and you want to read them all into your data frame:

df = pd.read_json('yourfile.txt', lines=True)

Answered By: hsn

Parsing data from JSON (tweepy) into a pandas dataframe

Question:

Answers: