How to create a DataFrame while preserving order of the columns?
Question:
How can I create a DataFrame from multiple numpy
arrays, Pandas
Series, or Pandas
DataFrame’s while preserving the order of the columns?
For example, I have these two numpy
arrays and I want to combine them as a Pandas
DataFrame.
foo = np.array( [ 1, 2, 3 ] )
bar = np.array( [ 4, 5, 6 ] )
If I do this, the bar
column would come first because dict
doesn’t preserve order.
pd.DataFrame( { 'foo': pd.Series(foo), 'bar': pd.Series(bar) } )
bar foo
0 4 1
1 5 2
2 6 3
I can do this, but it gets tedious when I need to combine many variables.
pd.DataFrame( { 'foo': pd.Series(foo), 'bar': pd.Series(bar) }, columns = [ 'foo', 'bar' ] )
EDIT: Is there a way to specify the variables to be joined and to organize the column order in one operation? That is, I don’t mind using multiple lines to complete the entire operation, but I’d rather not having to specify the variables to be joined multiple times (since I will be changing the code a lot and this is pretty error prone).
EDIT2: One more point. If I want to add or remove one of the variables to be joined, I only want to add/remove in one place.
Answers:
After having created your dataframe, you can simply reorder the columns the way you want by using
df= df[['foo','bar']]
Original Solution: Incorrect Usage of collections.OrderedDict
In my original solution, I proposed to use OrderedDict
from the collections
package in python’s standard library.
>>> import numpy as np
>>> import pandas as pd
>>> from collections import OrderedDict
>>>
>>> foo = np.array( [ 1, 2, 3 ] )
>>> bar = np.array( [ 4, 5, 6 ] )
>>>
>>> pd.DataFrame( OrderedDict( { 'foo': pd.Series(foo), 'bar': pd.Series(bar) } ) )
foo bar
0 1 4
1 2 5
2 3 6
Right Solution: Passing Key-Value Tuple Pairs for Order Preservation
However, as noted, if a normal dictionary is passed to OrderedDict
, the order may still not be preserved since the order is randomized when constructing the dictionary. However, a work around is to convert a list of key-value tuple pairs into an OrderedDict
, as suggested from this SO post:
>>> import numpy as np
>>> import pandas as pd
>>> from collections import OrderedDict
>>>
>>> a = np.array( [ 1, 2, 3 ] )
>>> b = np.array( [ 4, 5, 6 ] )
>>> c = np.array( [ 7, 8, 9 ] )
>>>
>>> pd.DataFrame( OrderedDict( { 'a': pd.Series(a), 'b': pd.Series(b), 'c': pd.Series(c) } ) )
a c b
0 1 7 4
1 2 8 5
2 3 9 6
>>> pd.DataFrame( OrderedDict( (('a', pd.Series(a)), ('b', pd.Series(b)), ('c', pd.Series(c))) ) )
a b c
0 1 4 7
1 2 5 8
2 3 6 9
I couldn’t comment to ask, but how will you specify the order of the columns in the first place (since you can’t with a regular dictionary)?
If you want to maintain an ordered dictionary:
from collections import OrderedDict
import numpy as np
import pandas as pd
data = OrderedDict()
data['foo'] = np.array([1, 2, 3])
data['bar'] = np.array([4, 5, 6])
df = pd.DataFrame(data)
If you just have a list of keys for order:
data = {key: value for key, value in data.iteritems()}
df = pd.concat(data.values(), keys=['foo', 'bar'], axis=1)
@tfv’s answer is likely the most concise way to do what you want.
This may be an other way to approach it:
foo = np.array( [ 1, 2, 3 ] )
bar = np.array( [ 4, 5, 6 ] )
stacked = np.vstack((x,y)).T
stacked
array([[1, 4],
[2, 5],
[3, 6]])
new_df = pd.DataFrame(stacked, columns = ['foo', 'bar'] )
new_df
foo bar
0 1 4
1 2 5
2 3 6
>>> pd.concat([pd.Series(eval(col), name=col) for col in ['foo', 'bar']], axis=1)
foo bar
0 1 4
1 2 5
2 3 6
This works using eval
. Your list of column names must match the corresponding variable name.
>>> eval('foo')
array([1, 2, 3])
To preserve column order pass in your numpy arrays as a list of tuples to DataFrame.from_items
:
>>> df = pd.DataFrame.from_items([('foo', foo), ('bar', bar)])
foo bar
0 1 4
1 2 5
2 3 6
Update
From pandas 0.23 from_items
is deprecated and will be removed. So pass the numpy
arrays using from_dict
. To use from_dict
you need to pass the items as a dictionary:
>>> from collections import OrderedDict as OrderedDict
>>> df = pd.DataFrame.from_dict(OrderedDict(zip(['foo', 'bar'], [foo, bar])))
From python 3.7 you can depend on insertion order being preserved (see https://mail.python.org/pipermail/python-dev/2017-December/151283.html) so:
>>> df = pd.DataFrame.from_dict(dict(zip(['foo', 'bar'], [foo, bar])))
or simply:
>>> df = pd.DataFrame(dict(zip(['foo', 'bar'], [foo, bar])))
Make the dataframe with just the data in it, and transpose it.
Then add the columns.
>>> foo = np.array( [ 1, 2, 3 ] )
>>> bar = np.array( [ 4, 5, 6 ] )
>>>
>>> df = pd.DataFrame([foo, bar]).T
>>> df.columns = ['foo','bar']
>>> df
foo bar
0 1 4
1 2 5
2 3 6
Another sketchy solution might be to pass a X_ to the title of the column where X is the number of the order of the column:
pd.DataFrame( { '2_foo': pd.Series(foo), '1_bar': pd.Series(bar) } )
And after that you can use columns or something to rename the columns again!
The less pythonic code in the world!!!
Good luck mates!
Use the columns
keyword when creating the DataFrame
:
pd.DataFrame({'foo': foo, 'bar': bar}, columns=['foo', 'bar'])
Also, note that you don’t need to create the Series.
What I did is as follow:
# Creating list of dict
list_of_dicts = ({'key1':'valueA', 'key2':'valueB},{'key1':'valueC', 'key2':'valueD}')
#getting list of keys from the dict
keys_list = list(list_of_dicts.keys())
# and finally
df = pd.DataFrame(list_of_dicts, columns = keys_list)
Worked perfectly for me.
The question is very old but wish to provide my solution to the question of “Preserving the order of columns while reading a csv file into pandas data frame”:
import numpy as np
import pandas as pd
# Get column count as a list
cols = np.arange(0, hmprice.shape[1])
df = pd.read_csv('train.csv', usecols=cols)
df.head()
Default Order of dataframe:
Preserved order of dataframe:
NOTE: The attribute usecols can take either column names or column indices; but pandas doesn’t honor “any other order” of column names or column indices.
For example,
df = pd.read_csv('train.csv', usecols=[1, 2, 3])<br/>
or
df = pd.read_csv('train.csv', usecols=[3, 2, 1])<br/>
gives the same result.
Since Python3.7, dicts are ordered by default, so you can just use your dict source to preserve the column ordering:
data = [{'foo': 1, 'bar': 4}, {'foo': 2, 'bar': 5}, {'foo': 3, 'bar': 6}]
pd.DataFrame(data, columns=data[0])
How can I create a DataFrame from multiple numpy
arrays, Pandas
Series, or Pandas
DataFrame’s while preserving the order of the columns?
For example, I have these two numpy
arrays and I want to combine them as a Pandas
DataFrame.
foo = np.array( [ 1, 2, 3 ] )
bar = np.array( [ 4, 5, 6 ] )
If I do this, the bar
column would come first because dict
doesn’t preserve order.
pd.DataFrame( { 'foo': pd.Series(foo), 'bar': pd.Series(bar) } )
bar foo
0 4 1
1 5 2
2 6 3
I can do this, but it gets tedious when I need to combine many variables.
pd.DataFrame( { 'foo': pd.Series(foo), 'bar': pd.Series(bar) }, columns = [ 'foo', 'bar' ] )
EDIT: Is there a way to specify the variables to be joined and to organize the column order in one operation? That is, I don’t mind using multiple lines to complete the entire operation, but I’d rather not having to specify the variables to be joined multiple times (since I will be changing the code a lot and this is pretty error prone).
EDIT2: One more point. If I want to add or remove one of the variables to be joined, I only want to add/remove in one place.
After having created your dataframe, you can simply reorder the columns the way you want by using
df= df[['foo','bar']]
Original Solution: Incorrect Usage of collections.OrderedDict
In my original solution, I proposed to use OrderedDict
from the collections
package in python’s standard library.
>>> import numpy as np
>>> import pandas as pd
>>> from collections import OrderedDict
>>>
>>> foo = np.array( [ 1, 2, 3 ] )
>>> bar = np.array( [ 4, 5, 6 ] )
>>>
>>> pd.DataFrame( OrderedDict( { 'foo': pd.Series(foo), 'bar': pd.Series(bar) } ) )
foo bar
0 1 4
1 2 5
2 3 6
Right Solution: Passing Key-Value Tuple Pairs for Order Preservation
However, as noted, if a normal dictionary is passed to OrderedDict
, the order may still not be preserved since the order is randomized when constructing the dictionary. However, a work around is to convert a list of key-value tuple pairs into an OrderedDict
, as suggested from this SO post:
>>> import numpy as np
>>> import pandas as pd
>>> from collections import OrderedDict
>>>
>>> a = np.array( [ 1, 2, 3 ] )
>>> b = np.array( [ 4, 5, 6 ] )
>>> c = np.array( [ 7, 8, 9 ] )
>>>
>>> pd.DataFrame( OrderedDict( { 'a': pd.Series(a), 'b': pd.Series(b), 'c': pd.Series(c) } ) )
a c b
0 1 7 4
1 2 8 5
2 3 9 6
>>> pd.DataFrame( OrderedDict( (('a', pd.Series(a)), ('b', pd.Series(b)), ('c', pd.Series(c))) ) )
a b c
0 1 4 7
1 2 5 8
2 3 6 9
I couldn’t comment to ask, but how will you specify the order of the columns in the first place (since you can’t with a regular dictionary)?
If you want to maintain an ordered dictionary:
from collections import OrderedDict
import numpy as np
import pandas as pd
data = OrderedDict()
data['foo'] = np.array([1, 2, 3])
data['bar'] = np.array([4, 5, 6])
df = pd.DataFrame(data)
If you just have a list of keys for order:
data = {key: value for key, value in data.iteritems()}
df = pd.concat(data.values(), keys=['foo', 'bar'], axis=1)
@tfv’s answer is likely the most concise way to do what you want.
This may be an other way to approach it:
foo = np.array( [ 1, 2, 3 ] )
bar = np.array( [ 4, 5, 6 ] )
stacked = np.vstack((x,y)).T
stacked
array([[1, 4],
[2, 5],
[3, 6]])
new_df = pd.DataFrame(stacked, columns = ['foo', 'bar'] )
new_df
foo bar
0 1 4
1 2 5
2 3 6
>>> pd.concat([pd.Series(eval(col), name=col) for col in ['foo', 'bar']], axis=1)
foo bar
0 1 4
1 2 5
2 3 6
This works using eval
. Your list of column names must match the corresponding variable name.
>>> eval('foo')
array([1, 2, 3])
To preserve column order pass in your numpy arrays as a list of tuples to DataFrame.from_items
:
>>> df = pd.DataFrame.from_items([('foo', foo), ('bar', bar)])
foo bar
0 1 4
1 2 5
2 3 6
Update
From pandas 0.23 from_items
is deprecated and will be removed. So pass the numpy
arrays using from_dict
. To use from_dict
you need to pass the items as a dictionary:
>>> from collections import OrderedDict as OrderedDict
>>> df = pd.DataFrame.from_dict(OrderedDict(zip(['foo', 'bar'], [foo, bar])))
From python 3.7 you can depend on insertion order being preserved (see https://mail.python.org/pipermail/python-dev/2017-December/151283.html) so:
>>> df = pd.DataFrame.from_dict(dict(zip(['foo', 'bar'], [foo, bar])))
or simply:
>>> df = pd.DataFrame(dict(zip(['foo', 'bar'], [foo, bar])))
Make the dataframe with just the data in it, and transpose it.
Then add the columns.
>>> foo = np.array( [ 1, 2, 3 ] )
>>> bar = np.array( [ 4, 5, 6 ] )
>>>
>>> df = pd.DataFrame([foo, bar]).T
>>> df.columns = ['foo','bar']
>>> df
foo bar 0 1 4 1 2 5 2 3 6
Another sketchy solution might be to pass a X_ to the title of the column where X is the number of the order of the column:
pd.DataFrame( { '2_foo': pd.Series(foo), '1_bar': pd.Series(bar) } )
And after that you can use columns or something to rename the columns again!
The less pythonic code in the world!!!
Good luck mates!
Use the columns
keyword when creating the DataFrame
:
pd.DataFrame({'foo': foo, 'bar': bar}, columns=['foo', 'bar'])
Also, note that you don’t need to create the Series.
What I did is as follow:
# Creating list of dict
list_of_dicts = ({'key1':'valueA', 'key2':'valueB},{'key1':'valueC', 'key2':'valueD}')
#getting list of keys from the dict
keys_list = list(list_of_dicts.keys())
# and finally
df = pd.DataFrame(list_of_dicts, columns = keys_list)
Worked perfectly for me.
The question is very old but wish to provide my solution to the question of “Preserving the order of columns while reading a csv file into pandas data frame”:
import numpy as np
import pandas as pd
# Get column count as a list
cols = np.arange(0, hmprice.shape[1])
df = pd.read_csv('train.csv', usecols=cols)
df.head()
Default Order of dataframe:
Preserved order of dataframe:
NOTE: The attribute usecols can take either column names or column indices; but pandas doesn’t honor “any other order” of column names or column indices.
For example,
df = pd.read_csv('train.csv', usecols=[1, 2, 3])<br/>
or
df = pd.read_csv('train.csv', usecols=[3, 2, 1])<br/>
gives the same result.
Since Python3.7, dicts are ordered by default, so you can just use your dict source to preserve the column ordering:
data = [{'foo': 1, 'bar': 4}, {'foo': 2, 'bar': 5}, {'foo': 3, 'bar': 6}]
pd.DataFrame(data, columns=data[0])