How to process list of sets having multiple element in set

Question:

When I scrape websites for all the emails on each website and try to output it, I can get a given data frame which is a list of sets of multiple elements for each website:

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': [{'[email protected]', '[email protected]'}, set(),{'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}]}, 
                                    {'main_url': 'http://kirsebaergaarden.com', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}]},
                                     {'main_url': 'http://koglernes.dk', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'},set(), set(), {'[email protected]'}, {'[email protected]'}]},
                                      {'main_url': 'http://kongehojensbornehave.dk', 'emails': [set()]}
                                   ])

enter image description here

However, I want to process the data frame to look like the following:

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]','[email protected]', '[email protected]', '[email protected]', '[email protected]',  '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']},                                        
                                     {'main_url': 'http://kirsebaergaarden.com', 'emails': ['[email protected]']},
                                     {'main_url': 'http://koglernes.dk', 'emails': ['[email protected]']},
                                      {'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
                                   ])

enter image description here

How can it be achieved?

I have tried the following code but it only manage to return first element of first set only while running to error when there is no element in the email list for a given website:

URL_WITH_EMAILS_DF['emails'] = [', '.join(x.pop()) if not None else "" for x in URL_WITH_EMAILS_DF['emails'].values]

PS: As per first dataframe, I needed to get a set of multiple emails to be inserted because there can be multiple webpage for a single website and I do not want to take duplicate email from each web page.

Asked By: Shihab Ullah

||

Answers:

chain.from_iterable can solve this problem.

from itertools import chain
URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': [{'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]', '[email protected]'}]}, 
                                    {'main_url': 'http://kirsebaergaarden.com', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]', '[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}]},
                                     {'main_url': 'http://koglernes.dk', 'emails': [{'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}, {'[email protected]'}]},
                                      {'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
                                   ])


URL_WITH_EMAILS_DF['emails'] = URL_WITH_EMAILS_DF.emails.apply(lambda x: list(set(chain.from_iterable(x))))
Answered By: Chris
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.