Python pandas: how to create a new row based on missing value from a column?

Question:

Suppose I have a dataframe like this:

country year value
A 2008 1
A 2011 1
B 2008 1
B 2011 1

I want to add missing year per country, in this case 2009 and 2010, with desired output like this:

country year value
A 2008 1
A 2009
A 2010
A 2011 1
B 2008 1
B 2009
B 2010
B 2011 1

How can I do that?
Thanks in advance!

Asked By: newbiedataanalyst

||

Answers:

arr1 = [['A', 2008, 1],['A', 2011, 1],['B', 2008, 1],['B', 2011, 1]]

arr2 = [['A', 2008, 1],['A', 2009, None],['A', 2010, None],à['A', 2011, 1],['B', 2008, 1],['B', 2009, None],['B', 2010, None],['B', 2011, 1]]

for elm in arr2:
    if elm not in arr1:
        arr1.append(elm)

First let’s create your dataset for the MCVE sake:

import pandas as pd

frame = pd.DataFrame([
    {"country": "A", "year": 2008, "value": 1},
    {"country": "A", "year": 2011, "value": 1},
    {"country": "B", "year": 2008, "value": 1},
    {"country": "B", "year": 2011, "value": 1},
])

Then we create the missing data by ruling from min(year) to max(year):

extension = frame.groupby("country")["year"].agg(["min", "max"]).reset_index()
extension["year"] = extension.apply(lambda x: list(range(x["min"], x["max"] + 1)), axis=1)

#   country   min   max                      year
# 0       A  2008  2011  [2008, 2009, 2010, 2011]
# 1       B  2008  2011  [2008, 2009, 2010, 2011]

Exploding the structure gives the correct format but without values:

extension = extension.explode("year")[["country", "year"]]
extension["year"] = extension["year"].astype(int)

#   country  year
# 0       A  2008
# 0       A  2009
# 0       A  2010
# 0       A  2011
# 1       B  2008
# 1       B  2009
# 1       B  2010
# 1       B  2011

Then we merge back with the original data to get the values:

results = frame.merge(extension, how="right", on=["country", "year"])

#   country  year  value
# 0       A  2008    1.0
# 1       A  2009    NaN
# 2       A  2010    NaN
# 3       A  2011    1.0
# 4       B  2008    1.0
# 5       B  2009    NaN
# 6       B  2010    NaN
# 7       B  2011    1.0

The advantage of this method – in addition of being purely pandas – is that it is robust against data variation:

frame = pd.DataFrame([
    {"country": "A", "year": 2008, "value": 1},
    {"country": "A", "year": 2011, "value": 2},
    {"country": "B", "year": 2005, "value": 1},
    {"country": "B", "year": 2009, "value": 2},
    {"country": "C", "year": 2008, "value": 1},
    {"country": "C", "year": 2010, "value": 2},
    {"country": "C", "year": 2012, "value": 3},
])

#    country  year  value
# 0        A  2008    1.0
# 1        A  2009    NaN
# 2        A  2010    NaN
# 3        A  2011    2.0
# 4        B  2005    1.0
# 5        B  2006    NaN
# 6        B  2007    NaN
# 7        B  2008    NaN
# 8        B  2009    2.0
# 9        C  2008    1.0
# 10       C  2009    NaN
# 11       C  2010    2.0
# 12       C  2011    NaN
# 13       C  2012    3.0
Answered By: jlandercy

Let’s create a dataframe first as follows :

import pandas as pd
data = {'country' : ['A', 'A', 'B', 'B'], 
        'year' : ['2008', '2011', '2008', '2011'], 
        'value':[1,1,1,1]}
df = pd.DataFrame(data=data)

Created dataset :

  country  year  value
0       A  2008      1
1       A  2011      1
2       B  2008      1
3       B  2011      1

Lets define the years we need to consider :

yr_list = ['2008', '2009', '2010', '2011']

Lets modify the dataset based on our requirement :

for country in df['country'].unique() : 
  for yr in yr_list :
    if yr not in list(df.loc[df['country'] == country, 'year']): 
      update_data = {'country' : country, 'year' : yr}
      df = df.append(update_data, ignore_index = True)

final_df = df.sort_values(by = ['country', 'year'],ignore_index=True)
print(final_df) 

The final output :

  country  year  value
0       A  2008    1.0
1       A  2009    NaN
2       A  2010    NaN
3       A  2011    1.0
4       B  2008    1.0
5       B  2009    NaN
6       B  2010    NaN
7       B  2011    1.0
Answered By: Kovarthanan Kesavan

One option is with the complete function from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

Create a dictionary, with an anonymous function, containing all possible years:

new_years = {'year': lambda year: range(year.min(), year.max() + 1)}

Use the dictionary within complete, with the by parameter, so it is applied per group:

df.complete(new_years, by = 'country')
  country  year  value
0       A  2008    1.0
1       A  2009    NaN
2       A  2010    NaN
3       A  2011    1.0
4       B  2008    1.0
5       B  2009    NaN
6       B  2010    NaN
7       B  2011    1.0
Answered By: sammywemmy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.