Pandas merge two datasets with same number of rows
Question:
I have two tables with same number of rows (second table is computed from first one by processing of text inside T1). I have both of them stored as pandas dataframe. T2 is no common column with T1. This is example because my tables are huge:
T1:
| name | street | city |
|-------|---------|--------|
| David | street1 | Prague |
| John | street2 | Berlin |
| Joe | street3 | London |
T2:
| computed1 | computed2 |
|-----------|-----------|
| 0.5 | 0.3 |
| 0.2 | 0.8 |
| 0.1 | 0.6 |
Merged:
| name | street | city | computed1 | computed2 |
|-------|---------|--------|-----------|-----------|
| David | street1 | Prague | 0.5 | 0.3 |
| John | street2 | Berlin | 0.2 | 0.8 |
| Joe | street3 | London | 0.1 | 0.6 |
I tried these commands:
pd.concat([T1,T2])
pd.merge([T1,T2])
result=T1.join(T1)
With concat and merge I will get only first thousand combined and rest is filled with nan (I double checked that both are same size), and with .join it not combine them because there is nothing in common.
Is there any way how to combine these two tables in pandas?
Thanks
Answers:
You need reset_index()
before concat
for default indices:
df = pd.concat([T1.reset_index(drop=True),T2.reset_index(drop=True)], axis=1)
Another way would be to merge on the index values:
df = T1.reset_index().merge(T2.reset_index(), left_index=True, right_index=True, how='left)
I want to add that pd.concat can do what you want by just providing the axis as columns. like this:
pd.concat([T1,T2],axis=1)
Assuming the names of my dataframes that I want to merge is
X_Numerical #shape = (4055, 5) #This dataframe has no NaN rows
X_Categorical #shape = (4055, 13) #This dataframe has no NaN rows
I think the following methods could work but it has a downside:
X_features_main = pd.concat([X_Categorical, X_Numerical.reindex(X_Categorical.index)], axis=1) #shape = (4055, 18) & will have NaN vaules
Downside:
- You will end up with various NAN values as showing in the Pandas documentations at https://pandas.pydata.org/docs/user_guide/merging.html
The following method overcome that downside.
- Adding a new column called index for each of the DataFrames
X_Categorical['index'] = range(1, len(X_Categorical) + 1)
X_Numerical['index'] = range(1, len(X_Numerical) + 1)
- Merging the two DataFrames based on that common column
X_features_main = X_Categorical.merge(X_Numerical, on = 'index', how = 'left')
- Dropping the column the merging was done at
X_features_main = X_features_main.drop(['index'], axis=1)
I have two tables with same number of rows (second table is computed from first one by processing of text inside T1). I have both of them stored as pandas dataframe. T2 is no common column with T1. This is example because my tables are huge:
T1:
| name | street | city |
|-------|---------|--------|
| David | street1 | Prague |
| John | street2 | Berlin |
| Joe | street3 | London |
T2:
| computed1 | computed2 |
|-----------|-----------|
| 0.5 | 0.3 |
| 0.2 | 0.8 |
| 0.1 | 0.6 |
Merged:
| name | street | city | computed1 | computed2 |
|-------|---------|--------|-----------|-----------|
| David | street1 | Prague | 0.5 | 0.3 |
| John | street2 | Berlin | 0.2 | 0.8 |
| Joe | street3 | London | 0.1 | 0.6 |
I tried these commands:
pd.concat([T1,T2])
pd.merge([T1,T2])
result=T1.join(T1)
With concat and merge I will get only first thousand combined and rest is filled with nan (I double checked that both are same size), and with .join it not combine them because there is nothing in common.
Is there any way how to combine these two tables in pandas?
Thanks
You need reset_index()
before concat
for default indices:
df = pd.concat([T1.reset_index(drop=True),T2.reset_index(drop=True)], axis=1)
Another way would be to merge on the index values:
df = T1.reset_index().merge(T2.reset_index(), left_index=True, right_index=True, how='left)
I want to add that pd.concat can do what you want by just providing the axis as columns. like this:
pd.concat([T1,T2],axis=1)
Assuming the names of my dataframes that I want to merge is
X_Numerical #shape = (4055, 5) #This dataframe has no NaN rows
X_Categorical #shape = (4055, 13) #This dataframe has no NaN rows
I think the following methods could work but it has a downside:
X_features_main = pd.concat([X_Categorical, X_Numerical.reindex(X_Categorical.index)], axis=1) #shape = (4055, 18) & will have NaN vaules
Downside:
- You will end up with various NAN values as showing in the Pandas documentations at https://pandas.pydata.org/docs/user_guide/merging.html
The following method overcome that downside.
- Adding a new column called index for each of the DataFrames
X_Categorical['index'] = range(1, len(X_Categorical) + 1)
X_Numerical['index'] = range(1, len(X_Numerical) + 1)
- Merging the two DataFrames based on that common column
X_features_main = X_Categorical.merge(X_Numerical, on = 'index', how = 'left')
- Dropping the column the merging was done at
X_features_main = X_features_main.drop(['index'], axis=1)