How to plot multiple dataframes with different lenghts into one plot

Question:

the question

I have a lot of dataframes (approx. 300) that I would like to plot into one line chart. The issue is – they all have different number of values (lengths). How to normalise the dataframes, so they can be plotted on same chart?

I can convert dataframes to list or series or whatsoever.

example

  • df1: [5, 3, 10, 7, 10...]
  • df2: [2, 4, 5, 7, 2, 1, 3, 0, 1]
  • adjusted df2: thats desired state – same len as df1
df1 df2 adjusted df2
5   2   2
3       2
10      2
7   4   4
10      4
5       4
8   6   6
6       6
5       6
6   7   7
2       7
1       7
5   2   2
3       2
6       2
9   1   1
9       1
7       1
10  3   3
2       3
7       3
7   0   0
6       0
1       0
6   1   1
9       1

adjusted example – with datetimes

I have also datetimes/timestamps of each record, but I have ignored them from the example, cuz I thought they are not relevant

                datetime       value
5448 2020-01-19 22:05:00  166.300003
5449 2020-01-19 22:10:00  165.259995
5450 2020-01-19 22:15:00  164.699997
5451 2020-01-19 22:20:00  165.380005
5452 2020-01-19 22:25:00  166.179993
5453 2020-01-19 22:30:00  162.630005
5424 2020-01-19 22:35:00  162.550003
5425 2020-01-19 22:40:00  161.990005
5426 2020-01-19 22:45:00  161.750000
5427 2020-01-10 22:50:00  161.440002
                 datetime       value
15900 2020-02-25 11:55:00  262.510010
15901 2020-02-25 12:00:00  263.179993
15902 2020-02-25 12:05:00  262.260010
15903 2020-02-25 12:10:00  261.959991
15904 2020-02-25 12:15:00  262.179993
15905 2020-02-25 12:20:00  261.299988
15906 2020-02-25 12:25:00  261.579987
15907 2020-02-25 12:30:00  261.890015
15908 2020-02-25 12:35:00  262.820007
15909 2020-02-25 12:40:00  262.010010
15910 2020-02-25 12:45:00  261.630005
15911 2020-02-25 12:50:00  261.109985
15912 2020-02-25 12:55:00  261.149994
15913 2020-02-25 13:00:00  260.679993
15914 2020-02-25 13:05:00  261.929993
15915 2020-02-25 13:10:00  260.880005
15916 2020-02-25 13:15:00  259.929993
                 datetime       value
16407 2020-02-27 06:10:00  224.860001
16408 2020-02-27 06:15:00  224.240005
16409 2020-02-27 06:20:00  223.610001
16410 2020-02-27 06:25:00  223.490005
16411 2020-02-27 06:30:00  223.199997

So the plot will look like this
example chart

possible options

interpolate values

The idea is similar as example above, but I am not sure how to "shift values" to the dataframes length will match

use different X axis for each dataframes

Not sure how exacly do this yet

I am gonna dig and try more and more options. I will be glad for any help

Asked By: FN_

||

Answers:

You can do something like this

import numpy as np
import matplotlib.pyplot as plt

df1 = [5, 3, 10, 7, 10]
df2 = [2, 4, 5, 7, 2, 1, 3, 0, 1, 3]

arr1 = []
for i in range(len(df1)):
    arr1.append([i,df1[i]])
    
arr1 = np.array(arr1, dtype='int')
x1 = arr1[:,0]
y1 = arr1[:,1]

arr2 = []
for j in range(len(df2)):
    arr2.append([j,df2[j]])
    
arr2 = np.array(arr2, dtype='int')
x2 = arr2[:,0]
y2 = arr2[:,1]

fig = plt.figure(figsize=(9,4))
ax1 = fig.add_subplot(111)
ax1.plot(x1,y1, color='red')
ax1.plot(x2,y2, color='blue')
plt.xlabel('Index')
ax1.legend(["Serie A", "Serie B"], loc ="upper right")
plt.title('Seria A and Serie B', loc='left', color='grey')
plt.grid()
plt.show()
Answered By: RaDmAn2222

In case you just want to make them same size:

If you already have pandas dataframes:

import pandas as pd
import math

# generate sample data
adf = pd.DataFrame({"df1" : [5,3,10,7,10,5,8,6,5,6,2,1,5,3,6,9,9,9,7,10,2,7,7,6,1,6,9]})
bdf = pd.DataFrame({"df2" : [2,4,6,7,2,1,3,0,1]})

# add the bdf values to adf dataframe
adf['bdf'] = [bdf['df2'].iloc[math.floor(i/len(adf.df1)*len(bdf.df2))] for i in range(len(adf.df1))]

adf
#Out[3]: 
#    df1  bdf
#0     5    2
#1     3    2
#2    10    2
#3     7    4
#4    10    4
#5     5    4
#6     8    6
#7     6    6
#8     5    6
#9     6    7
#10    2    7
#11    1    7
#12    5    2
#13    3    2
#14    6    2
#15    9    1
#16    9    1
#17    9    1
#18    7    3
#19   10    3
#20    2    3
#21    7    0
#22    7    0
#23    6    0
#24    1    1
#25    6    1
#26    9    1

If you have just a list of values:

a = [5,3,10,7,10,5,8,6,5,6,2,1,5,3,6,9,9,9,7,10,2,7,7,6,1,6,9]
b = [2,4,6,7,2,1,3,0,1]
b_adjusted = [b[math.floor(i/len(a)*len(b))] for i in range(len(a))]
b_adjusted
#Out[5]: 
#[2,
# 2,
# 2,
# 4,
# 4,
# 4,
# 6,
# 6,
# 6,
# 7,
# 7,
# 7,
# 2,
# 2,
# 2,
# 1,
# 1,
# 1,
# 3,
# 3,
# 3,
# 0,
# 0,
# 0,
# 1,
# 1,
# 1]
Answered By: Ehsan Hamzei

This assumes that the index/x values have no meaning for the individual dataframes.
Then you can normalize the indices via the MinMaxScaler and plot them like this into a single plot:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df1 = pd.DataFrame(np.arange(16)*np.random.random(16)) # length 16
df2 = pd.DataFrame(np.arange(7)*np.random.random(7)) 
df3 = pd.DataFrame(np.arange(2)*np.random.random(2)) # length 2

fig, ax = plt.subplots()
for df in [df1,df2, df3]:
    df = df.copy()
    df.index =  MinMaxScaler().fit_transform( df.index.values.reshape((-1,1))).flatten() )
    df.plot(ax=ax)

plt.show()

enter image description here

Answered By: Daraan

Updated Answer

After you’ve clarified the problem in your question and comments, you’re actual intention is not to combine the datasets but to simply plot the datasets such that the first and last points match up. This can easily be done by creating an x array for each data frame that goes from 0 to 1 but has the N points, where N is the length of the data frame; I did this using np.linspace.

Shown below is the code and plots for the original and time series data. Since you want to see the trends, for the time series data I subtracted each of them by the first value so they all start at 0.

Using the original data:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

plt.close("all")

df1 = pd.DataFrame({"data": [5, 3, 10, 7, 10, 5, 8, 6, 5, 6, 2, 1, 5, 3, 6, 9, 9, 9, 7, 10, 2, 7, 7, 6, 1, 6, 9]})
df2 = pd.DataFrame({"data": [2, 4, 6, 7, 2, 1, 3, 0, 1]})

fig, ax = plt.subplots()
for i, df in enumerate([df1, df2], start=1):
    x = np.linspace(0, 1, len(df))
    ax.plot(x, df["data"], label=f"df{i}")
ax.legend()
fig.show()

Using the time-series data:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

plt.close("all")

df1 = pd.DataFrame({"datetime": ["2020-01-19 22:05:00",
                                 "2020-01-19 22:10:00",
                                 "2020-01-19 22:15:00",
                                 "2020-01-19 22:20:00",
                                 "2020-01-19 22:25:00",
                                 "2020-01-19 22:30:00",
                                 "2020-01-19 22:35:00",
                                 "2020-01-19 22:40:00",
                                 "2020-01-19 22:45:00",
                                 "2020-01-10 22:50:00",],
                    "value": [166.300003,
                              165.259995,
                              164.699997,
                              165.380005,
                              166.179993,
                              162.630005,
                              162.550003,
                              161.990005,
                              161.750000,
                              161.440002,
                              ]})
df2 = pd.DataFrame({"datetime": ["2020-02-25 11:55:00",
                                 "2020-02-25 12:00:00",
                                 "2020-02-25 12:05:00",
                                 "2020-02-25 12:10:00",
                                 "2020-02-25 12:15:00",
                                 "2020-02-25 12:20:00",
                                 "2020-02-25 12:25:00",
                                 "2020-02-25 12:30:00",
                                 "2020-02-25 12:35:00",
                                 "2020-02-25 12:40:00",
                                 "2020-02-25 12:45:00",
                                 "2020-02-25 12:50:00",
                                 "2020-02-25 12:55:00",
                                 "2020-02-25 13:00:00",
                                 "2020-02-25 13:05:00",
                                 "2020-02-25 13:10:00",
                                 "2020-02-25 13:15:00",],
                    "value": [262.510010,
                              263.179993,
                              262.260010,
                              261.959991,
                              262.179993,
                              261.299988,
                              261.579987,
                              261.890015,
                              262.820007,
                              262.010010,
                              261.630005,
                              261.109985,
                              261.149994,
                              260.679993,
                              261.929993,
                              260.880005,
                              259.929993,]})
df3 = pd.DataFrame({"datetime": ["2020-02-27 06:10:00",
                                 "2020-02-27 06:15:00",
                                 "2020-02-27 06:20:00",
                                 "2020-02-27 06:25:00",
                                 "2020-02-27 06:30:00",],
                    "value": [224.860001,
                              224.240005,
                              223.610001,
                              223.490005,
                              223.199997,]})

fig, ax = plt.subplots()
for i, df in enumerate([df1, df2, df3], start=1):
    x = np.linspace(0, 1, len(df))
    ax.plot(x, df["value"] - df["value"].iloc[0], label=f"df{i}")
ax.legend()
fig.show()

enter image description here


Original Answer

Although I think this is not the right way to deal with your data, I will show you how to do what you ask.

Before we implement any of the methods, you will need to combine the data with the alignment you showed. To do this, you can reindex df2 in steps of len(df1)/len(df2). I do this using the following code:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"df1": [5, 3, 10, 7, 10, 5, 8, 6, 5, 6, 2, 1, 5, 3, 6, 9, 9, 9, 7, 10, 2, 7, 7, 6, 1, 6, 9]})
df2 = pd.DataFrame({"df2": [2, 4, 6, 7, 2, 1, 3, 0, 1]})
df2.index = np.arange(0, len(df1), len(df1)//len(df2))

The two DataFrames can then be joined on the index column using this next line.

df_combined = df1.join(df2, how="outer")

So, after those two steps, we have df_combined.

    df1  df2
0     5  2.0
1     3  NaN
2    10  NaN
3     7  4.0
4    10  NaN
5     5  NaN
6     8  6.0
7     6  NaN
8     5  NaN
9     6  7.0
10    2  NaN
11    1  NaN
12    5  2.0
13    3  NaN
14    6  NaN
15    9  1.0
16    9  NaN
17    9  NaN
18    7  3.0
19   10  NaN
20    2  NaN
21    7  0.0
22    7  NaN
23    6  NaN
24    1  1.0
25    6  NaN
26    9  NaN

Note: Because NaNs were added, the df2 column is now dtype=float.

You seem to mention two methods, filling df2 with the data with the same value and interpolating. For filling, you can simply call the ffill() method, which will forward-fill NaNs, i.e. if the value is 2 before a NaN, it will fill that in with a 2.

df_combined["df2_ffill"] = df_combined["df2"].ffill()

To interpolate, you can call the interpolate() method, which, by default, will linearly interpolate the NaNs based on the values before and after the NaN (or group of NaNs).

df_combined["df2_interpolate"] = df_combined["df2"].interpolate()

So, we have df_combined:

    df1  df2  df2_ffill  df2_interpolate
0     5  2.0        2.0         2.000000
1     3  NaN        2.0         2.666667
2    10  NaN        2.0         3.333333
3     7  4.0        4.0         4.000000
4    10  NaN        4.0         4.666667
5     5  NaN        4.0         5.333333
6     8  6.0        6.0         6.000000
7     6  NaN        6.0         6.333333
8     5  NaN        6.0         6.666667
9     6  7.0        7.0         7.000000
10    2  NaN        7.0         5.333333
11    1  NaN        7.0         3.666667
12    5  2.0        2.0         2.000000
13    3  NaN        2.0         1.666667
14    6  NaN        2.0         1.333333
15    9  1.0        1.0         1.000000
16    9  NaN        1.0         1.666667
17    9  NaN        1.0         2.333333
18    7  3.0        3.0         3.000000
19   10  NaN        3.0         2.000000
20    2  NaN        3.0         1.000000
21    7  0.0        0.0         0.000000
22    7  NaN        0.0         0.333333
23    6  NaN        0.0         0.666667
24    1  1.0        1.0         1.000000
25    6  NaN        1.0         1.000000
26    9  NaN        1.0         1.000000

Here is the data plotted:

Now, as I said in the beginning, this is not the best way to deal with the data because there is no "right" way to make up data. The best thing to do is what you mention last, which is to plot against some other variable; the DataFrame index is not a good option to plot against because it has no meaning.

Assuming this data is some dependent data, i.e. it was computed or derived by some process, there is probably an independent variable associated with it. Maybe it’s time, such as seconds, days, or years, or it’s distance, such as feet, meters, or miles. That is what you should actually plot against. If you have that information for df1 and df2, then plotting each against the corresponding independent data will line up the graphs properly and you wouldn’t have to combine them and make up data.

Answered By: jared