How to extract data from field in json line format and store it in a new file in python as a text
Question:
I have json file that looks like this:
{"reviewerID": "A11N155CW1UV02", "asin": "B000H00VBQ", "reviewerName": "AdrianaM", "helpful": [0, 0], "reviewText": "I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all.", "overall": 2.0, "summary": "A little bit boring for me", "unixReviewTime": 1399075200, "reviewTime": "05 3, 2014"}
{"reviewerID": "A3BC8O2KCL29V2", "asin": "B000H00VBQ", "reviewerName": "Carol T", "helpful": [0, 0], "reviewText": "I highly recommend this series. It is a must for anyone who is yearning to watch "grown up" television. Complex characters and plots to keep one totally involved. Thank you Amazin Prime.", "overall": 5.0, "summary": "Excellent Grown Up TV", "unixReviewTime": 1346630400, "reviewTime": "09 3, 2012"}
{"reviewerID": "A60D5HQFOTSOM", "asin": "B000H00VBQ", "reviewerName": "Daniel Cooper "dancoopermedia"", "helpful": [0, 1], "reviewText": "This one is a real snoozer. Don't believe anything you read or hear, it's awful. I had no idea what the title means. Neither will you.", "overall": 1.0, "summary": "Way too boring for me", "unixReviewTime": 1381881600, "reviewTime": "10 16, 2013"}
I need to extract data from fields "summary" and "reviewText" and store it in two new files for further analysis, like tokenization.
I am trying this:
import json
rt = open("review.txt", "a") #creates new file for storage
su = open("summary.txt", "a")
with open("/Users/anano/Desktop/MAXWELL/SPRING/NLP/Amazon_Instant_Video_5.json") as json_file:
for line in json_file: #runs the loop to extract info
data = json.loads(line)
rt.write(data['reviewText'])
su.write(data['summary'])
rt.close()
su.closed()
Because sentences in summary do not have suspension points (dots) at the end, it saves all strings as one sentence, like this:
A little bit boring for meExcellent Grown Up TVWay too boring for meRobson Green is mesmerizing
This makes tokenization impossible. How can I sove this problem?
Answers:
All you need to do is adding n
to end of sentences. (n is an escape character for strings that is replaced with the new line object)
So, your code evulates to this:
import json
rt = open("review.txt", "a") #creates new file for storage
su = open("summary.txt", "a")
with open("/Users/anano/Desktop/MAXWELL/SPRING/NLP/Amazon_Instant_Video_5.json") as json_file:
for line in json_file: #runs the loop to extract info
data = json.loads(line)
rt.write(data['reviewText'] + 'n')
su.write(data['summary'] + 'n')
rt.close()
su.close()
I have json file that looks like this:
{"reviewerID": "A11N155CW1UV02", "asin": "B000H00VBQ", "reviewerName": "AdrianaM", "helpful": [0, 0], "reviewText": "I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all.", "overall": 2.0, "summary": "A little bit boring for me", "unixReviewTime": 1399075200, "reviewTime": "05 3, 2014"}
{"reviewerID": "A3BC8O2KCL29V2", "asin": "B000H00VBQ", "reviewerName": "Carol T", "helpful": [0, 0], "reviewText": "I highly recommend this series. It is a must for anyone who is yearning to watch "grown up" television. Complex characters and plots to keep one totally involved. Thank you Amazin Prime.", "overall": 5.0, "summary": "Excellent Grown Up TV", "unixReviewTime": 1346630400, "reviewTime": "09 3, 2012"}
{"reviewerID": "A60D5HQFOTSOM", "asin": "B000H00VBQ", "reviewerName": "Daniel Cooper "dancoopermedia"", "helpful": [0, 1], "reviewText": "This one is a real snoozer. Don't believe anything you read or hear, it's awful. I had no idea what the title means. Neither will you.", "overall": 1.0, "summary": "Way too boring for me", "unixReviewTime": 1381881600, "reviewTime": "10 16, 2013"}
I need to extract data from fields "summary" and "reviewText" and store it in two new files for further analysis, like tokenization.
I am trying this:
import json
rt = open("review.txt", "a") #creates new file for storage
su = open("summary.txt", "a")
with open("/Users/anano/Desktop/MAXWELL/SPRING/NLP/Amazon_Instant_Video_5.json") as json_file:
for line in json_file: #runs the loop to extract info
data = json.loads(line)
rt.write(data['reviewText'])
su.write(data['summary'])
rt.close()
su.closed()
Because sentences in summary do not have suspension points (dots) at the end, it saves all strings as one sentence, like this:
A little bit boring for meExcellent Grown Up TVWay too boring for meRobson Green is mesmerizing
This makes tokenization impossible. How can I sove this problem?
All you need to do is adding n
to end of sentences. (n is an escape character for strings that is replaced with the new line object)
So, your code evulates to this:
import json
rt = open("review.txt", "a") #creates new file for storage
su = open("summary.txt", "a")
with open("/Users/anano/Desktop/MAXWELL/SPRING/NLP/Amazon_Instant_Video_5.json") as json_file:
for line in json_file: #runs the loop to extract info
data = json.loads(line)
rt.write(data['reviewText'] + 'n')
su.write(data['summary'] + 'n')
rt.close()
su.close()