How to unpack a Series of tuples in Pandas?

Question:

Sometimes I end up with a series of tuples/lists when using Pandas. This is common when, for example, doing a group-by and passing a function that has multiple return values:

import numpy as np
from scipy import stats
df = pd.DataFrame(dict(x=np.random.randn(100),
                       y=np.repeat(list("abcd"), 25)))
out = df.groupby("y").x.apply(stats.ttest_1samp, 0)
print out

y
a       (1.3066417476, 0.203717485506)
b    (0.0801133382517, 0.936811414675)
c      (1.55784329113, 0.132360504653)
d     (0.267999459642, 0.790989680709)
dtype: object

What is the correct way to “unpack” this structure so that I get a DataFrame with two columns?

A related question is how I can unpack either this structure or the resulting dataframe into two Series/array objects. This almost works:

t, p = zip(*out)

but it t is

 (array(1.3066417475999257),
 array(0.08011333825171714),
 array(1.557843291126335),
 array(0.267999459641651))

and one needs to take the extra step of squeezing it.

Asked By: mwaskom

||

Answers:

maybe:

>>> pd.DataFrame(out.tolist(), columns=['out-1','out-2'], index=out.index)
                  out-1     out-2
y                                
a   -1.9153853424536496  0.067433
b     1.277561889173181  0.213624
c  0.062021492729736116  0.951059
d    0.3036745009819999  0.763993

[4 rows x 2 columns]
Answered By: behzad.nouri

I believe you want this:

df=pd.DataFrame(out.tolist())
df.columns=['KS-stat', 'P-value']

result:

           KS-stat   P-value
0   -2.12978778869  0.043643
1    3.50655433879  0.001813
2    -1.2221274198  0.233527
3  -0.977154419818  0.338240
Answered By: CT Zhu

maybe this is most strightforward (most pythonic i guess):

out.apply(pd.Series)

if you would want to rename the columns to something more meaningful, than:

out.columns=['Kstats','Pvalue']

if you do not want the default name for the index:

out.index.name=None
Answered By: Siraj S.

I have met the similar problem. What I found 2 ways to solving it are exactly the answer of @CT ZHU and that of @Siraj S.

Here is my supplementary information you might be interested:
I have compared 2 ways and found the way of @CT ZHU performs much faster when the size of input grows.

Example:

#Python 3
import time
from statistics import mean
df_a = pd.DataFrame({'a':range(1000),'b':range(1000)})

#function to test
def func1(x):
    c = str(x)*3
    d = int(x)+100
    return c,d

# Siraj S's way
time_difference = []
for i in range(100):
    start = time.time()
    df_b = df_a['b'].apply(lambda x: func1(x)).apply(pd.Series)
    end = time.time()
    time_difference.append(end-start)

print(mean(time_difference))    
# 0.14907703161239624

# CT ZHU's way
time_difference = []
for i in range(100):
    start = time.time()
    df_b = pd.DataFrame(df_a['b'].apply(lambda x: func1(x)).tolist())
    end = time.time()
    time_difference.append(end-start)    

print(mean(time_difference)) 
# 0.0014058423042297363

PS: Please forgive my ugly code.

Answered By: Jeremy Z

not sure if the t, r are predefined somewhere, but if not, I am getting the two tuples passing to t and r by,

>>> t, r = zip(*out)
>>> t
(-1.776982300308175, 0.10543682705459552, -1.7206831272759038, 1.0062163376448068)
>>> r
(0.08824925924534484, 0.9169054844258786, 0.09817788453771065, 0.3243492942246433)

Thus, you could do this,

>>> df = pd.DataFrame(columns=['t', 'r'])
>>> df.t, df.r = zip(*out)
>>> df
          t         r
0 -1.776982  0.088249
1  0.105437  0.916905
2 -1.720683  0.098178
3  1.006216  0.324349
Answered By: Roy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.