How can I clean this string and leave only text (Python)

Question:

I have the following string in python:

"n[[["guns",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["china chinese spy balloon",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["aris hampers grand rapids",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["mountain lion p 22",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["real estate housing market",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["hunter biden",46,[143,362,396,357],{"lm":[],"zf":33,"zh":"Hunter Biden","zi":"American attorney","zl":8,"zp":{"gs_ssp":"eJzj4tLP1TcwycrOK88xYPTiySjNK0ktUkjKTEnNAwBulQip"},"zs":"https://encrypted-tbn0.gstatic.com/images?q\u003dtbn:ANd9GcQaO4eyFc6sDCa7A26Y_9g71clgC0Ot11Elt0KxAFiQo0Ey7Tp69FWxS8o\u0026s\u003d10"}],["maui firefighter tre evans dumaran",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["pope francis benedict",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["coast guard rescue stolen boat",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["lauren boebert",46,[143,362,396,357],{"lm":[],"zf":33,"zh":"Lauren Boebert","zi":"United States Representative","zl":8,"zp":{"gs_ssp":"eJzj4tVP1zc0zDIqMzCrMCswYPTiy0ksLUrNU0jKT01KLSoBAJDsCeg"},"zs":"https://encrypted-tbn0.gstatic.com/images?q\u003dtbn:ANd9GcS1qLJyZQJkVxsOTuP4gnADPLG5oBWe0LWSFClElzhcVrwVCfnNa_s64Zs\u0026s\u003d10"}]],{"ag":{"a":{"8":["Trending searches"]}}}"

how can I clean it using python so that it only outputs the text:

"guns",
"china chinese spy balloon",
"aris hampers grand rapids",
"mountain lion p 22",
….

Asked By: Sundios

||

Answers:

I am assuming you left off the last ] character. With the addition of that, you have a valid json string. You can just parse it and grab the things you want. Here I am assuming you want the strings from the lists:

import json

s = "n[[["guns",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["china chinese spy balloon",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["aris hampers grand rapids",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["mountain lion p 22",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["real estate housing market",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["hunter biden",46,[143,362,396,357],{"lm":[],"zf":33,"zh":"Hunter Biden","zi":"American attorney","zl":8,"zp":{"gs_ssp":"eJzj4tLP1TcwycrOK88xYPTiySjNK0ktUkjKTEnNAwBulQip"},"zs":"https://encrypted-tbn0.gstatic.com/images?q\u003dtbn:ANd9GcQaO4eyFc6sDCa7A26Y_9g71clgC0Ot11Elt0KxAFiQo0Ey7Tp69FWxS8o\u0026s\u003d10"}],["maui firefighter tre evans dumaran",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["pope francis benedict",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["coast guard rescue stolen boat",0,[143,362,396,357],{"zf":33,"zl":8,"zp":{"gs_ss":"1"}}],["lauren boebert",46,[143,362,396,357],{"lm":[],"zf":33,"zh":"Lauren Boebert","zi":"United States Representative","zl":8,"zp":{"gs_ssp":"eJzj4tVP1zc0zDIqMzCrMCswYPTiy0ksLUrNU0jKT01KLSoBAJDsCeg"},"zs":"https://encrypted-tbn0.gstatic.com/images?q\u003dtbn:ANd9GcS1qLJyZQJkVxsOTuP4gnADPLG5oBWe0LWSFClElzhcVrwVCfnNa_s64Zs\u0026s\u003d10"}]],{"ag":{"a":{"8":["Trending searches"]}}}]"

obj = json.loads(s)

def get_strings(item):
    if isinstance(item, str):
        yield item
    elif isinstance(item, list):
        for subitem in item:
            yield from get_strings(subitem)
            
list(get_strings(obj))

This will give you:

['guns',
 'china chinese spy balloon',
 'aris hampers grand rapids',
 'mountain lion p 22',
 'real estate housing market',
 'hunter biden',
 'maui firefighter tre evans dumaran',
 'pope francis benedict',
 'coast guard rescue stolen boat',
 'lauren boebert']

This assumes there’s nothing you want in those dictionaries (like: {"zf":33,"zl":8,"zp"). If there is, it’s simple enough to add another clause to deal with them, but you will need to figure out which text is junk and what is real (it all looked like junk to me).

Answered By: Mark
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.