Python on data analysis from CSV file

Question:

I’m a Python beginner. I had inspired by some Python courses.
This is the example CSV file below.

Name Location Number
Andrew Platt Andrew A B C 100
Steven Thunder Andrew A B C 50
Jeff England Steven A B C 30
Andrew England Jeff A B C 30

I want to get a result like that

['Andrew', 180
'Platt', 100
'Steven', 50
'Jeff', 60
'England', 60
'Andrew Platt', 100
'Platt Andrew', 100
'Steven Thunder', 50
'Thunder Andrew', 50
........]

Logic:

  1. One-word name, e.g. ‘Andrew’, as it shows rows 1, 2 and 4, so the result is 180 (100+50+30)
  2. Two-word name, e.g. ‘Andrew Platt’, as it shows row 1 only, so the result is 100
  3. Export result to a new CSV file
import csv
#from itertools import chain

#find one-word
filename=open('sample.csv', 'r')
file = csv.DictReader(filename)
one_word=[]
for col in file:
    one_word.append(col['Name'].split()) #find one-word
print(one_word)
#list(chain.from_iterable(one_word)) #this is another code I learned

#get result
#find two-word
#get result
#combine
#sorted by value
#export to a new CSV file

My problem is how to get value, i.e. 180…, which means I need to match the word, then get ‘Number’ and sum them all?

Note: the location is useless, it’s just a coding practice.

Updated:
Maybe make 2 lists, i.e. one-word and two-word, then zip them

Asked By: Peter

||

Answers:

You need to get the unique names, and find combinations of two names. Then you can find if each name (1 or 2 words) is included in the first column.

import pandas as pd
import numpy as np
import itertools
#this is your data
df = pd.DataFrame([['Andrew Platt Andrew', 'Steven Thunder Andrew', 'Jeff England Steven',
              'Andrew England Jeff'], [100,50,30,30]] ).transpose()
df.columns = ['names','x']

#get the unique names that appear in the columns
names = df.names.apply(lambda x : x.split(' '))
one_words = np.unique(names.sum())

#get all combinations of two names
two_words = [a+' '+b for a,b in itertools.combinations(one_words, 2)]


#fill the dictionnaries with the values 
d_1 = {w : df.loc[df.names.str.contains(w),'x'].sum() for w in one_words}
d_2 = {w : df.loc[df.names.str.contains(w),'x'].sum() for w in two_words}

d = d_1 | d_2 #merge the disctionnaries

The output :

{'Andrew': 180,
 'England': 60,
 'Jeff': 60,
 'Platt': 100,
 'Steven': 80,
 'Thunder': 50,
 'Andrew England': 30,
 'Andrew Jeff': 0,
 'Andrew Platt': 100,
 'Andrew Steven': 0,
 'Andrew Thunder': 0,
 'England Jeff': 30,
 'England Platt': 0,
 'England Steven': 30,
 'England Thunder': 0,
 'Jeff Platt': 0,
 'Jeff Steven': 0,
 'Jeff Thunder': 0,
 'Platt Steven': 0,
 'Platt Thunder': 0,
 'Steven Thunder': 50}
Answered By: endive1783

Looking at your expected result, I’m not sure how you get:

'Andrew Platt', 100
'Platt Andrew', 50

I see "Andrew Platt" and "Platt Andrew" in the first row, but both two-word combos should have the same value of 100, yes?

import csv
from collections import Counter
from itertools import combinations
from pprint import pprint

one_words = Counter()
two_words = Counter()

with open("input.csv", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        items = row["Name"].split(" ")

        # Unique one-word
        for item in set(items):
            one_words[item] += int(row["Number"])

        for two_word in combinations(items, 2):
            # Skip combos like [Andrew Andrew]
            if len(set(two_word)) == 1:
                continue

            print(f"row is {type(row)}")
            print(f"two_word is {type(two_word)}")
            print(f"two_words is {type(two_words)}")

            two_words[" ".join(two_word)] += int(row["Number"])


pprint(one_words)
pprint(two_words)

I got:

Counter({'Andrew': 180,
         'Platt': 100,
         'Steven': 80,
         'England': 60,
         'Jeff': 60,
         'Thunder': 50})
Counter({'Andrew Platt': 100,
         'Platt Andrew': 100,
         'Steven Thunder': 50,
         'Steven Andrew': 50,
         'Thunder Andrew': 50,
         'Jeff England': 30,
         'Jeff Steven': 30,
         'England Steven': 30,
         'Andrew England': 30,
         'Andrew Jeff': 30,
         'England Jeff': 30})

My debug-print statements print:

row is <class 'dict'>
two_word is <class 'tuple'>
two_words is <class 'collections.Counter'>
Answered By: Zach Young
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.