How do I properly call a function and return an updated dataframe?
Question:
I am trying to process and update rows in a dataframe through a function, and return the dataframe to finish using it. When I try to return the dataframe to the original function call, it returns a series and not the expected column updates. A simple example is below:
df = pd.DataFrame(['adam', 'ed', 'dra','dave','sed','mike'], index =
['a', 'b', 'c', 'd', 'e', 'f'], columns=['A'])
def get_item(data):
comb=pd.DataFrame()
comb['Newfield'] = data #create new columns
comb['AnotherNewfield'] = 'y'
return pd.DataFrame(comb)
Caling a function using apply:
>>> newdf = df['A'].apply(get_item)
>>> newdf
a A Newfield AnotherNewfield
a adam st...
b A Newfield AnotherNewfield
e sed st...
c A Newfield AnotherNewfield
d dave st...
d A Newfield AnotherNewfield
d dave st...
e A Newfield AnotherNewfield
s NaN st...
f A Newfield AnotherNewfield
m NaN str(...
Name: A, dtype: object
>>> type(newdf)
<class 'pandas.core.series.Series'>
I assume that apply() is bad here, but am not quite sure how I ‘should’ be updating this dataframe via function otherwise.
Edit: I appologize but i seems I accidentally deleted the sample function on an edit. added it back here as I attempt a few other things I found in other posts.
Testing in a slightly different manner with individual variables – and returning multiple series variables -> seems to work so I will see if this is something I can do in my actual case and update.
def get_item(data):
value = data #create new columns
AnotherNewfield = 'y'
return pd.Series(value),pd.Series(AnotherNewfield)
df['B'], df['C'] = zip(*df['A'].apply(get_item))
Answers:
For anyone looking for a potential answer to this, I got the desired result when executing this code I found in another post. Will post that guy’s name to credit him, but this essentially allowed me to edit the function and get the data that was created in the different columns via the apply function:
def get_item(data):
value = data #create new columns using variables
AnotherNewfield = 'y'
return pd.Series(value),pd.Series(AnotherNewfield)
>>> df['B'], df['C'] = zip(*df['A'].apply(get_item))
>>> df
A B C
a adam (adam,) (y,)
b ed (ed,) (y,)
c dra (dra,) (y,)
d dave (dave,) (y,)
e sed (sed,) (y,)
f mike (mike,) (y,)
>>>
The only problem it brings is – the parenthesis and comma come with the data. I intend to get rid of that in the code outside of the function. Perhaps this
>>> df['B'] = df['B'].apply(lambda x: re.sub(r"[^a-zA-Z0-9-]+", ' ', str(x)))
>>> df
A B C
a adam adam (y,)
b ed ed (y,)
c dra dra (y,)
d dave dave (y,)
e sed sed (y,)
f mike mike (y,)
>>> df['C'] = df['C'].apply(lambda x: re.sub(r"[^a-zA-Z0-9-]+", ' ', str(x)))
>>> df
A B C
a adam adam y
b ed ed y
c dra dra y
d dave dave y
e sed sed y
f mike mike y
You could use groupby
with apply
to get dataframe from apply
call, like this:
import pandas as pd
# add new column B for groupby - we need single group only to do the trick
df = pd.DataFrame(
{'A':['adam', 'ed', 'dra','dave','sed','mike'], 'B': [1,1,1,1,1,1]},
index=['a', 'b', 'c', 'd', 'e', 'f'])
def get_item(data):
# create empty dataframe to be returned
comb=pd.DataFrame(columns=['Newfield', 'AnotherNewfield'], data=None)
# append series data (or any data) to dataframe's columns
comb['Newfield'] = comb['Newfield'].append(data['A'], ignore_index=True)
comb['AnotherNewfield'] = 'y'
# return complete dataframe
return comb
# use column B for group to get tuple instead of dataframe
newdf = df.groupby('B').apply(get_item)
# after processing the dataframe newdf contains MultiIndex - simply remove the 0-level (index col B with value 1 gained from groupby operation)
newdf.droplevel(0)
Output:
Newfield AnotherNewfield
0 adam y
1 ed y
2 dra y
3 dave y
4 sed y
5 mike y
This will work:
df = pd.DataFrame(['adam', 'ed', 'dra','dave','sed','mike'], index =['a', 'b', 'c', 'd', 'e', 'f'], columns=['A'])
def get_item(data):
comb=pd.DataFrame()
comb['Newfield'] = data #create new columns
comb['AnotherNewfield'] = 'y'
return comb
new_df = get_item(df)
I am trying to process and update rows in a dataframe through a function, and return the dataframe to finish using it. When I try to return the dataframe to the original function call, it returns a series and not the expected column updates. A simple example is below:
df = pd.DataFrame(['adam', 'ed', 'dra','dave','sed','mike'], index =
['a', 'b', 'c', 'd', 'e', 'f'], columns=['A'])
def get_item(data):
comb=pd.DataFrame()
comb['Newfield'] = data #create new columns
comb['AnotherNewfield'] = 'y'
return pd.DataFrame(comb)
Caling a function using apply:
>>> newdf = df['A'].apply(get_item)
>>> newdf
a A Newfield AnotherNewfield
a adam st...
b A Newfield AnotherNewfield
e sed st...
c A Newfield AnotherNewfield
d dave st...
d A Newfield AnotherNewfield
d dave st...
e A Newfield AnotherNewfield
s NaN st...
f A Newfield AnotherNewfield
m NaN str(...
Name: A, dtype: object
>>> type(newdf)
<class 'pandas.core.series.Series'>
I assume that apply() is bad here, but am not quite sure how I ‘should’ be updating this dataframe via function otherwise.
Edit: I appologize but i seems I accidentally deleted the sample function on an edit. added it back here as I attempt a few other things I found in other posts.
Testing in a slightly different manner with individual variables – and returning multiple series variables -> seems to work so I will see if this is something I can do in my actual case and update.
def get_item(data):
value = data #create new columns
AnotherNewfield = 'y'
return pd.Series(value),pd.Series(AnotherNewfield)
df['B'], df['C'] = zip(*df['A'].apply(get_item))
For anyone looking for a potential answer to this, I got the desired result when executing this code I found in another post. Will post that guy’s name to credit him, but this essentially allowed me to edit the function and get the data that was created in the different columns via the apply function:
def get_item(data):
value = data #create new columns using variables
AnotherNewfield = 'y'
return pd.Series(value),pd.Series(AnotherNewfield)
>>> df['B'], df['C'] = zip(*df['A'].apply(get_item))
>>> df
A B C
a adam (adam,) (y,)
b ed (ed,) (y,)
c dra (dra,) (y,)
d dave (dave,) (y,)
e sed (sed,) (y,)
f mike (mike,) (y,)
>>>
The only problem it brings is – the parenthesis and comma come with the data. I intend to get rid of that in the code outside of the function. Perhaps this
>>> df['B'] = df['B'].apply(lambda x: re.sub(r"[^a-zA-Z0-9-]+", ' ', str(x)))
>>> df
A B C
a adam adam (y,)
b ed ed (y,)
c dra dra (y,)
d dave dave (y,)
e sed sed (y,)
f mike mike (y,)
>>> df['C'] = df['C'].apply(lambda x: re.sub(r"[^a-zA-Z0-9-]+", ' ', str(x)))
>>> df
A B C
a adam adam y
b ed ed y
c dra dra y
d dave dave y
e sed sed y
f mike mike y
You could use groupby
with apply
to get dataframe from apply
call, like this:
import pandas as pd
# add new column B for groupby - we need single group only to do the trick
df = pd.DataFrame(
{'A':['adam', 'ed', 'dra','dave','sed','mike'], 'B': [1,1,1,1,1,1]},
index=['a', 'b', 'c', 'd', 'e', 'f'])
def get_item(data):
# create empty dataframe to be returned
comb=pd.DataFrame(columns=['Newfield', 'AnotherNewfield'], data=None)
# append series data (or any data) to dataframe's columns
comb['Newfield'] = comb['Newfield'].append(data['A'], ignore_index=True)
comb['AnotherNewfield'] = 'y'
# return complete dataframe
return comb
# use column B for group to get tuple instead of dataframe
newdf = df.groupby('B').apply(get_item)
# after processing the dataframe newdf contains MultiIndex - simply remove the 0-level (index col B with value 1 gained from groupby operation)
newdf.droplevel(0)
Output:
Newfield AnotherNewfield
0 adam y
1 ed y
2 dra y
3 dave y
4 sed y
5 mike y
This will work:
df = pd.DataFrame(['adam', 'ed', 'dra','dave','sed','mike'], index =['a', 'b', 'c', 'd', 'e', 'f'], columns=['A'])
def get_item(data):
comb=pd.DataFrame()
comb['Newfield'] = data #create new columns
comb['AnotherNewfield'] = 'y'
return comb
new_df = get_item(df)