Reshaping DataFrame with pandas

Question:

So I’m working with pandas on python. I collect data indexed by timestamps with multiple ways.

This means I can have one index with 2 features available (and the others with NaN values, it’s normal) or all features, it depends.

So my problem is when I add some data with multiple values for the same indices, see the example below :

Imagine this is the set we’re adding new data :

Index col1 col2
    1   a    A
    2   b    B
    3   c    C

This the data we will add:

Index new col 
    1      z    
    1      y    

Then the result is this :

Index col1 col2 new col
    1   a    A    NaN
    1   NaN  NaN  z
    1   NaN  NaN  y
    2   b    B    NaN
    3   c    C    NaN

So instead of that, I would like the result to be :

Index col1 col2 new col1 new col2
    1   a    A    z        y
    2   b    B    NaN      NaN
    3   c    C    NaN      NaN

I want that instead of having multiples indexes in 1 feature, there will be 1 index for multiple features.

I don’t know if this is understandable. Another way is to say that I want this : number of values per timestamp=number of features instead of =numbers of indices.

Asked By: 3V's

||

Answers:

This solution assumes the data that you need to add is a series.

Original df:

df = pd.DataFrame(np.random.randint(0,3,size=(3,3)),columns = list('ABC'),index = [1,2,3])

Data to add (series):

s = pd.Series(['x','y'],index = [1,1])

Solution:

df.join(s.to_frame()
        .assign(cc = lambda x: x.groupby(level=0)
                .cumcount().add(1))
        .set_index('cc',append=True)[0]
        .unstack()
        .rename('New Col{}'.format,axis=1))

Output:

   A  B  C New Col1 New Col2
1  1  2  2        x        y
2  0  1  2      NaN      NaN
3  2  2  0      NaN      NaN
Answered By: rhug123

Alternative answer (maybe more simplistic, probably less pythonic). I think you need to look at converting wide data to long data and back again in general (pivot and transpose might be good things to look up for this), but I also think there are some possible problems in your question. You don’t mention new col 1 and new col 2 in the declaration of the subsequent arrays.

Here’s my declarations of your data frames:

d = {'index': [1, 2, 3],'col1': ['a', 'b', 'c'], 'col2': ['A', 'B', 'C']}
df = pd.DataFrame(data=d)

e1 = {'index': [1], 'new col1': ['z']}
dfe1 = pd.DataFrame(data=e1)

e2 = {'index': [1], 'new col2': ['y']}
dfe2 = pd.DataFrame(data=e2)

They look like this:

index   new col1
1       z

and this:

index   new col2
1       y

Notice that I declare your new columns as part of the data frames. Once they’re declared like that, it’s just a matter of merging:

dfr = pd.merge(df, dfe, on='index', how="outer")
dfr1 = pd.merge(df, dfe1, on='index', how="outer")
dfr2 = pd.merge(dfr1, dfe2, on='index', how="outer")

And the output looks like this:

    index   col1    col2    new col1    new col2
    1       a       A       z           y
    2       b       B       NaN         NaN
    3       c       C       NaN         NaN
Answered By: Pam

I think one problem may arise in the way you first create your second data frame.
Actually, expanding the number of feature depending on its content is what makes this reformatting a bit annoying here (as you could see for yourself, when writing two new column names out of the bare assumption that this reflect the number of feature observed at every timestamps).

Here is yet another solution, this tries to be a bit more explicit in the step taken than rhug123’s answer.

# Initial dataFrames
a = pd.DataFrame({'col1':['a', 'b', 'c'], 'col2':['A', 'B', 'C']}, index=range(1, 4))
b = pd.DataFrame({'new col':['z', 'y']}, index=[1, 1])

Now the only important step is basically transposing your second DataFrame, while here you also need to intorduce two new column names.
We will do this grouping of the second dataframe according to its content (y, z, …):

c = b.groupby(b.index)['new col'].apply(list) # this has also one index per timestamp, but all features are grouped in a list

# New column names:
cols = ['New col%d'%(k+1) for in range(b.value_counts().sum())]
# Expanding dataframe "c" for each new column
d = pd.DataFrame(c.to_list(), index=b.index.unique(), columns=cols)

# Merge
a.join(d, how='outer')

Output:

  col1 col2 New col1 New col2
1    a    A        z        y
2    b    B      NaN      NaN
3    c    C      NaN      NaN

Finally, one problem encountered with both my answer and the one from rhug123, is that as for now it won’t deal with another feature at a different timestamp correctly. Not sure what the OP expects here.
For example if b is:

  new col
1       z
1       y
2       x

The merged output will be:

  col1 col2 New col1 New col2
1    a    A        z        y
2    b    B        x     None
3    c    C      NaN      NaN
Answered By: H. Rev.
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.