Python – openpyxl – Use openpyxl to get number of rows that contain a specific value

Question

I’m newer to Python. I’m using openpyxl for a SEO project for my brother and I’m trying to get a number of rows that contain a specific value in them.

I have a spreadsheet that looks something like this:

I want to write a program that will get the keywords and parse them to a string by state, so like:
Missouri = "search item 1, search item 2, search item 5, search item 6"
Illinois = "search item 3, search item 4"

I have thus far created a program like this:

    #first, import openpyxl
    import openpyxl

    #next, give location of file
    path = "testExcel.xlsx"

    #Open workbook by creating object
    wb_object = openpyxl.load_workbook(path)

    #Get workbook active sheet object
    sheet_object = wb_object.active

    #Getting the value of maximum rows
    #and column
    row = sheet_object.max_row
    column = sheet_object.max_column
    print("Total Rows:", row)
    print("Total Columns:", column)

    #printing the value of forth column, state
    #Loop will print all values
    #of first column
    print("nValue of fourth column")
    for i in range(4, row + 1):
        cell_object = sheet_object.cell(row=i, column=4)
        split_item_test = cell_object.value.split(",")
        split_item_test_result = split_item_test[0]
        state = split_item_test_result
        print(state)
        if (state == 'Missouri'):
            print(state.count('Missouri'))
    print("All good")

The problem is after doing this, I see that it prints 1 repeatedly, but not a total number for Missouri. I would like a total number of mentions of the state, and then eventually get it to a string with each search criteria.

Is this possible with openpyxl? Or will I need a different library?

Asked By: KatieJRise

||

Source

Answer 1

ranemirusG is right, there are several ways to obtain the same result. Here’s another option…I attempted to preserve your thought process, good luck.

print("nValue of fourth column")

missouri_list = [] # empty list
illinois_list = [] # empty list

for i in range(2, row+1): # It didn't look like "4, row+1" captured the full sheet, try (2, row+1)
    cell_object = sheet_object.cell(row=i, column=4)
    keyword = sheet_object.cell(row=i, column=1)
    keyword_fmt = keyword.value # Captures values in Keyword column
    split_item_test = cell_object.value.split(",")
    split_item_test_result = split_item_test[1] # 1 captures states
    state = split_item_test_result
    print(state)

    # simple if statement to capture results in a list
    if 'Missouri' in state:
        missouri_list.append(keyword_fmt)
    if 'Illinois' in state:
        illinois_list.append(keyword_fmt)
print(missouri_list)
print(len(missouri_list)) # Counts the number of occurances
print(illinois_list)
print(len(illinois_list)) # Counts the number of occurances
print("All good")

Answered By: j-p

Answer 2

Yes, it’s possible with openpyxl.
To achieve your real goal try something like this:

states_and_keywords  = {}
for i in range(4, row + 1):
    cell_object = sheet_object.cell(row=i, column=4)
    split_item_test = cell_object.value.split(",")
    split_item_test_result = split_item_test[1] #note that the element should be 1 for the state
    state = split_item_test_result.strip(" ") #trim whitespace (after comma)
    keyword = cell_object.offset(0,-3).value #this gets the value of the keyword for that row
    if state not in states_and_keywords:
        states_and_keywords[state] = [keyword]
    else:
        states_and_keywords[state].append(keyword) 
print(states_and_keywords)

Answered By: ranemirusG

Answer 3

Ok another option
This will create a dictionary ‘state_dict’ in the format per your question

Missouri = "search item 1, search item 2, search item 5, search item
6"
Illinois = "search item 3, search item 4"

...
print("nValue of fourth column")
state_dict = {}
for row in sheet_object.iter_rows(min_row=2, max_row=sheet_object.max_row):
    k = row[3].value.split(',')[1].strip()
    v = row[0].value
    if k in state_dict:
        state_dict[k] += [v]
    else:
        state_dict[k] = [v]

### Print values
for key, value in state_dict.items():
    print(f'{key}, Total {len(value)}', end='; ')
    for v in value:
        print(f'{v}', end=', ')
    print('')

Will create the dictionary ‘state_dict’ as so;

'Missouri' = {list: 4} ['search item 1', 'search item 2', 'search item 5', 'search item 6']
'Illinois' = {list: 2} ['search item 3', 'search item 4']
'Alabama' = {list: 1} ['search item 7']
'Colorado' = {list: 1} ['search item 8']

Print output

Value of fourth column
Missouri = Total 4; search item 1, search item 2, search item 5, search item 6, 
Illinois = Total 2; search item 3, search item 4, 
Alabama = Total 1; search item 7, 
Colorado = Total 1; search item 8,

###————–Additional Information ———————–###
Updated the state_dict to include the rank details for each item.
The output display now shows each items for each state in rank order. You have two options on how the data may be restricted;
Maximum rank to show, the variable rank_max = 100 determines the highest rank the output will display so if it’s set to 5 then only ranks 1, 2, 3, 4 and 5 will be displayed if the State has items with those ranks.
The total_ranks_to_display determines number of ranks to display. So regardless of the rank_max value this will restrict the number of ranks shows with rank 1 as top. Example; if you set rank_max to 10 and this would show 8 rows for Missouri then setting the total_ranks_to_display to 4 will mean only the top 4 ranks will show.
You can use either or both to achieve what you need I think.

...
print("nValue of fourth column")
state_dict = {}
for row in sheet_object.iter_rows(min_row=2, max_row=sheet_object.max_row):
    k = row[3].value.split(',')[1].strip()
    v = row[0].value
    r = row[2].value
    if k in state_dict:
        if r in state_dict[k]:
            state_dict[k][r] += [v]
        else:
            state_dict[k].update({r: [v]})
    else:
        state_dict[k] = {r: [v]}


rank_max = 10
total_ranks_to_display = 4
for key, value in state_dict.items():
    print(f'{key}')
    top_count = 0
    for i in range(1, rank_max):
        if i in state_dict[key]:
            top_count += 1
            cur_rank = state_dict[key][i]
            total = len(cur_rank)
            print(f'Rank: {i} Total: {total} {cur_rank}')
        if top_count == total_ranks_to_display:
            break

Example Output
Max rank is 10 and the total ranks to display is 4

Value of fourth column
Missouri
Rank: 1 Total: 1 ['search item 19']
Rank: 2 Total: 3 ['search item 5', 'search item 13', 'search item 18']
Rank: 3 Total: 2 ['search item 14', 'search item 20']
Rank: 4 Total: 1 ['search item 22']
Alabama
Rank: 1 Total: 2 ['search item 26', 'search item 28']
Rank: 2 Total: 2 ['search item 3', 'search item 12']
Rank: 3 Total: 1 ['search item 6']
Rank: 5 Total: 1 ['search item 15']
Illinois
Rank: 1 Total: 2 ['search item 11', 'search item 17']
Rank: 2 Total: 1 ['search item 24']
Rank: 3 Total: 1 ['search item 4']
Rank: 6 Total: 1 ['search item 23']
Colorado
Rank: 2 Total: 1 ['search item 21']
Rank: 3 Total: 3 ['search item 25', 'search item 27', 'search item 29']
Rank: 4 Total: 1 ['search item 30']
Rank: 6 Total: 2 ['search item 8', 'search item 16']

Answered By: moken

Python – openpyxl – Use openpyxl to get number of rows that contain a specific value

Question:

Answers: