How do I Pass a List of Series to a Pandas DataFrame?
Question:
I realize Dataframe takes a map of {‘series_name’:Series(data, index)}. However, it automatically sorts that map even if the map is an OrderedDict().
Is there a simple way to pass a list of Series(data, index, name=name) such that the order is preserved and the column names are the series.name? Is there an easy way if all the indices are the same for all the series?
I normally do this by just passing a numpy column_stack of series.values and specifying the column names. However, this is ugly and in this particular case the data is strings not floats.
Answers:
You could use pandas.concat
:
import pandas as pd
from pandas.util.testing import rands
data = [pd.Series([rands(4) for j in range(6)],
index=pd.date_range('1/1/2000', periods=6),
name='col'+str(i)) for i in range(4)]
df = pd.concat(data, axis=1, keys=[s.name for s in data])
print(df)
yields
col0 col1 col2 col3
2000-01-01 GqcN Lwlj Km7b XfaA
2000-01-02 lhNC nlSm jCYu XLVb
2000-01-03 sSRz PFby C1o5 0BJe
2000-01-04 khZb Ny9p crUY LNmc
2000-01-05 hmLp 4rVp xF2P OmD9
2000-01-06 giah psQb T5RJ oLSh
a = pd.Series(data=[1,2,3])
b = pd.Series(data=[4,5,6])
a.name = 'a'
b.name= 'b'
pd.DataFrame(zip(a,b), columns=[a.name, b.name])
or just concat dataframes
pd.concat([pd.DataFrame(a),pd.DataFrame(b)], axis=1)
In [53]: %timeit pd.DataFrame(zip(a,b), columns=[a.name, b.name])
1000 loops, best of 3: 362 us per loop
In [54]: %timeit pd.concat([pd.DataFrame(a),pd.DataFrame(b)], axis=1)
1000 loops, best of 3: 808 us per loop
Check out DataFrame.from_items
too
Simply passing the list of Series to DataFrame
then transposing seems to work too. It will also fill in any indices that are missing from one or the other Series.
import pandas as pd
from pandas.util.testing import rands
data = [pd.Series([rands(4) for j in range(6)],
index=pd.date_range('1/1/2000', periods=6),
name='col'+str(i)) for i in range(4)]
df = pd.DataFrame(data).T
print(df)
Build the list of series:
import pandas as pd
import numpy as np
> series = [pd.Series(np.random.rand(3), name=c) for c in list('abcdefg')]
First method pd.DataFrame.from_items
:
> pd.DataFrame.from_items([(s.name, s) for s in series])
a b c d e f g
0 0.071094 0.077545 0.299540 0.377555 0.751840 0.879995 0.933399
1 0.538251 0.066780 0.415607 0.796059 0.718893 0.679950 0.502138
2 0.096001 0.680868 0.883778 0.210488 0.642578 0.023881 0.250317
Second method pd.concat
:
> pd.concat(series, axis=1)
a b c d e f g
0 0.071094 0.077545 0.299540 0.377555 0.751840 0.879995 0.933399
1 0.538251 0.066780 0.415607 0.796059 0.718893 0.679950 0.502138
2 0.096001 0.680868 0.883778 0.210488 0.642578 0.023881 0.250317
You can first create an empty DataFrame and then use append()
to it.
df = pd.DataFrame()
then:
df = df.append(list_series)
I also like to make sure the previous script that created list_series won’t mess my dataframe up:
df.drop_duplicates(inplace=True)
This one is simpler:
import pandas as pd
from pandas.util.testing import rands
data = [pd.Series([rands(4) for j in range(6)],
index=pd.date_range('1/1/2000', periods=6),
name='col'+str(i)) for i in range(4)]
df = pd.DataFrame(data)
print(df)
which yields
2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05 2000-01-06
col0 oPg5 9Af9 SNfq vnCb ArCU 8Bhy
col1 IKmX xS0c yqCQ sVov 92CN WIyH
col2 1x2s JBk7 Z5vh km7k ed1F pIDt
col3 m9M3 mxil 1v72 Fkme YooA 5H5b
, or try this one
df = pd.DataFrame(data).T
print(df)
to yield
col0 col1 col2 col3
2000-01-01 6zbm UfrI isNy wVv0
2000-01-02 Kgej 0SN4 thDS 7BP2
2000-01-03 mcTx BGDI 5BJC mUdg
2000-01-04 iVSP 6Rim 6gg9 fY2A
2000-01-05 HzEU giJ6 HFD1 dE98
2000-01-06 wYCi nWmp jqLz GwKz
I realize Dataframe takes a map of {‘series_name’:Series(data, index)}. However, it automatically sorts that map even if the map is an OrderedDict().
Is there a simple way to pass a list of Series(data, index, name=name) such that the order is preserved and the column names are the series.name? Is there an easy way if all the indices are the same for all the series?
I normally do this by just passing a numpy column_stack of series.values and specifying the column names. However, this is ugly and in this particular case the data is strings not floats.
You could use pandas.concat
:
import pandas as pd
from pandas.util.testing import rands
data = [pd.Series([rands(4) for j in range(6)],
index=pd.date_range('1/1/2000', periods=6),
name='col'+str(i)) for i in range(4)]
df = pd.concat(data, axis=1, keys=[s.name for s in data])
print(df)
yields
col0 col1 col2 col3
2000-01-01 GqcN Lwlj Km7b XfaA
2000-01-02 lhNC nlSm jCYu XLVb
2000-01-03 sSRz PFby C1o5 0BJe
2000-01-04 khZb Ny9p crUY LNmc
2000-01-05 hmLp 4rVp xF2P OmD9
2000-01-06 giah psQb T5RJ oLSh
a = pd.Series(data=[1,2,3])
b = pd.Series(data=[4,5,6])
a.name = 'a'
b.name= 'b'
pd.DataFrame(zip(a,b), columns=[a.name, b.name])
or just concat dataframes
pd.concat([pd.DataFrame(a),pd.DataFrame(b)], axis=1)
In [53]: %timeit pd.DataFrame(zip(a,b), columns=[a.name, b.name])
1000 loops, best of 3: 362 us per loop
In [54]: %timeit pd.concat([pd.DataFrame(a),pd.DataFrame(b)], axis=1)
1000 loops, best of 3: 808 us per loop
Check out DataFrame.from_items
too
Simply passing the list of Series to DataFrame
then transposing seems to work too. It will also fill in any indices that are missing from one or the other Series.
import pandas as pd
from pandas.util.testing import rands
data = [pd.Series([rands(4) for j in range(6)],
index=pd.date_range('1/1/2000', periods=6),
name='col'+str(i)) for i in range(4)]
df = pd.DataFrame(data).T
print(df)
Build the list of series:
import pandas as pd
import numpy as np
> series = [pd.Series(np.random.rand(3), name=c) for c in list('abcdefg')]
First method pd.DataFrame.from_items
:
> pd.DataFrame.from_items([(s.name, s) for s in series])
a b c d e f g
0 0.071094 0.077545 0.299540 0.377555 0.751840 0.879995 0.933399
1 0.538251 0.066780 0.415607 0.796059 0.718893 0.679950 0.502138
2 0.096001 0.680868 0.883778 0.210488 0.642578 0.023881 0.250317
Second method pd.concat
:
> pd.concat(series, axis=1)
a b c d e f g
0 0.071094 0.077545 0.299540 0.377555 0.751840 0.879995 0.933399
1 0.538251 0.066780 0.415607 0.796059 0.718893 0.679950 0.502138
2 0.096001 0.680868 0.883778 0.210488 0.642578 0.023881 0.250317
You can first create an empty DataFrame and then use append()
to it.
df = pd.DataFrame()
then:
df = df.append(list_series)
I also like to make sure the previous script that created list_series won’t mess my dataframe up:
df.drop_duplicates(inplace=True)
This one is simpler:
import pandas as pd
from pandas.util.testing import rands
data = [pd.Series([rands(4) for j in range(6)],
index=pd.date_range('1/1/2000', periods=6),
name='col'+str(i)) for i in range(4)]
df = pd.DataFrame(data)
print(df)
which yields
2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05 2000-01-06
col0 oPg5 9Af9 SNfq vnCb ArCU 8Bhy
col1 IKmX xS0c yqCQ sVov 92CN WIyH
col2 1x2s JBk7 Z5vh km7k ed1F pIDt
col3 m9M3 mxil 1v72 Fkme YooA 5H5b
, or try this one
df = pd.DataFrame(data).T
print(df)
to yield
col0 col1 col2 col3
2000-01-01 6zbm UfrI isNy wVv0
2000-01-02 Kgej 0SN4 thDS 7BP2
2000-01-03 mcTx BGDI 5BJC mUdg
2000-01-04 iVSP 6Rim 6gg9 fY2A
2000-01-05 HzEU giJ6 HFD1 dE98
2000-01-06 wYCi nWmp jqLz GwKz