When to use pandas series, numpy ndarrays or simply python dictionaries?
Question:
I am new to learning Python, and some of its libraries (numpy, pandas).
I have found a lot of documentation on how numpy ndarrays, pandas series and python dictionaries work.
But owing to my inexperience with Python, I have had a really hard time determining when to use each one of them. And I haven’t found any best-practices that will help me understand and decide when it is better to use each type of data structure.
As a general matter, are there any best practices for deciding which, if any, of these three data structures a specific data set should be loaded into?
Answers:
Pandas in general is used for financial time series data/economics data (it has a lot of built in helpers to handle financial data).
Numpy is a fast way to handle large arrays multidimensional arrays for scientific computing (scipy also helps). It also has easy handling for what are called sparse arrays (large arrays with very little data in them).
One of key advantages of numpy is the C bindings that allow for massive speeds ups in large array computation along with some built in functions for things like linear algebra/ signal processing capabilities.
Both packages address some of the deficiencies that were identified with the existing built-in data types with python. As a general rule of thumb, with incomplete real world data (NaNs, outliers, etc), you will end up needing to write all types of functions that address these issues; with the above packages you can build on the work of others. If your program is generating the data for your data type internally, you can probably use the more simplistic native data structures (not just python dictionaries).
See the post by the author of Pandas for some comparison
Numpy is very fast with arrays, matrix, math.
Pandas series have indexes, sometimes it’s very useful to sort or join data.
Dictionaries is a slow beast, but sometimes it’s very handy too.
So, as it was already was mentioned, it depends on use case which data types and tools to use.
The rule of thumb that I usually apply: use the simplest data structure that still satisfies your needs. If we rank the data structures from most simple to least simple, it usually ends up like this:
- Dictionaries / lists
- Numpy arrays
- Pandas series / dataframes
So first consider dictionaries / lists. If these allow you to do all data operations that you need, then all is fine. If not, start considering numpy arrays. Some typical reasons for moving to numpy arrays are:
- Your data is 2-dimensional (or higher). Although nested dictionaries/lists can be used to represent multi-dimensional data, in most situations numpy arrays will be more efficient.
- You have to perform a bunch of numerical calculations. As already pointed out by zhqiat, numpy will give a significant speed-up in this case. Furthermore numpy arrays come bundled with a large amount of mathematical functions.
Then there are also some typical reasons for going beyond numpy arrays and to the more-complex but also more-powerful pandas series/dataframes:
- You have to merge multiple data sets with each other, or do reshaping/reordering of your data. This diagram gives a nice overview of all the ‘data wrangling’ operations that pandas allows you to do.
- You have to import data from or export data to a specific file format like Excel, HDF5 or SQL. Pandas comes with convenient import/export functions for this.
If you want to an answer which tells you to stick with just one type of data structures, here goes one: use pandas series/dataframe structures.
The pandas series object can be seen as an enhanced numpy 1D array and the pandas dataframe can be seen as an enhanced numpy 2D array. The main difference is that pandas series and pandas dataframes has explicit index, while numpy arrays has implicit indexation. So, in any python code that you think to use something like
import numpy as np
a = np.array([1,2,3])
you can just use
import pandas as pd
a = pd.Series([1,2,3])
All the functions and methods from numpy arrays will work with pandas series. In analogy, the same can be done with dataframes and numpy 2D arrays.
A further question you might have can be about the performance differences between a numpy array and pandas series. Here is a post that shows the differences in performance using these two tools: performance of pandas series vs numpy arrays.
Please note that even in an explicit way pandas series has a subtle worse in performance when compared to numpy, you can solve this by just calling the values method on a pandas series:
a.values
The result of apply the values method on a pandas series will be a numpy array!
I would say that pandas
lets you index and slice off of strings and create data frames directly from dictionaries, whereas numpy
is mostly nested lists. Other than that, they are pretty much exactly the same (pandas
is built on top of numpy
). So pandas
"feels" more natural to use for database-like data (e.g. csv, excel, and sql files), whereas numpy
"feels" more natural for numeric processing of data (e.g. signals, images, etc.). Granted, you can do many of the same things in both libraries; you can even create pandas
data frames from numpy
arrays and vice-versa.
One major difference (something to watch out for) is slicing in pandas
is inclusive whereas numpy
is exclusive (i.e. 0:10
in pandas
is "0 up to and including 10" whereas it is "0 up to, but not including 10" in numpy
). This is intuitively due to the fact that since pandas
permits slicing on strings, it doesn’t make much sense to slice, say, "up to but not including a column of name x
" (shout out to Corey Schafer for that insight (see about 30 mins in): Python Pandas Tutorial (Part 2)).
Other than that, pandas
utilizes the same slicing, indexing, and fancy indexing notation as numpy
(minus the ability for strings) and the same kinds of "gotcha’s" with respect to different operations creating views vs copies of data. (An excellent numpy
tutorial is a Numpy lecture from SciPy 2019 by Alex Chabot-Leclerc).
Ultimately, I would say pandas
is a database analyst’s best friend while numpy
is a data scientists friend. Personally, I use pandas
to pull data from the real world, sort it, and preprocess it. Then I convert this data into numpy
arrays where necessary to do more serious/intensive numeric computing. PLEASE NOTE: This is purely opinion. There is no right answer.
That being said, I highly recommend getting to know and understand numpy
first (highly recommend the Alex Chabot-Leclerc video). Afterwards, pandas
will make a lot more sense.
I am new to learning Python, and some of its libraries (numpy, pandas).
I have found a lot of documentation on how numpy ndarrays, pandas series and python dictionaries work.
But owing to my inexperience with Python, I have had a really hard time determining when to use each one of them. And I haven’t found any best-practices that will help me understand and decide when it is better to use each type of data structure.
As a general matter, are there any best practices for deciding which, if any, of these three data structures a specific data set should be loaded into?
Pandas in general is used for financial time series data/economics data (it has a lot of built in helpers to handle financial data).
Numpy is a fast way to handle large arrays multidimensional arrays for scientific computing (scipy also helps). It also has easy handling for what are called sparse arrays (large arrays with very little data in them).
One of key advantages of numpy is the C bindings that allow for massive speeds ups in large array computation along with some built in functions for things like linear algebra/ signal processing capabilities.
Both packages address some of the deficiencies that were identified with the existing built-in data types with python. As a general rule of thumb, with incomplete real world data (NaNs, outliers, etc), you will end up needing to write all types of functions that address these issues; with the above packages you can build on the work of others. If your program is generating the data for your data type internally, you can probably use the more simplistic native data structures (not just python dictionaries).
See the post by the author of Pandas for some comparison
Numpy is very fast with arrays, matrix, math.
Pandas series have indexes, sometimes it’s very useful to sort or join data.
Dictionaries is a slow beast, but sometimes it’s very handy too.
So, as it was already was mentioned, it depends on use case which data types and tools to use.
The rule of thumb that I usually apply: use the simplest data structure that still satisfies your needs. If we rank the data structures from most simple to least simple, it usually ends up like this:
- Dictionaries / lists
- Numpy arrays
- Pandas series / dataframes
So first consider dictionaries / lists. If these allow you to do all data operations that you need, then all is fine. If not, start considering numpy arrays. Some typical reasons for moving to numpy arrays are:
- Your data is 2-dimensional (or higher). Although nested dictionaries/lists can be used to represent multi-dimensional data, in most situations numpy arrays will be more efficient.
- You have to perform a bunch of numerical calculations. As already pointed out by zhqiat, numpy will give a significant speed-up in this case. Furthermore numpy arrays come bundled with a large amount of mathematical functions.
Then there are also some typical reasons for going beyond numpy arrays and to the more-complex but also more-powerful pandas series/dataframes:
- You have to merge multiple data sets with each other, or do reshaping/reordering of your data. This diagram gives a nice overview of all the ‘data wrangling’ operations that pandas allows you to do.
- You have to import data from or export data to a specific file format like Excel, HDF5 or SQL. Pandas comes with convenient import/export functions for this.
If you want to an answer which tells you to stick with just one type of data structures, here goes one: use pandas series/dataframe structures.
The pandas series object can be seen as an enhanced numpy 1D array and the pandas dataframe can be seen as an enhanced numpy 2D array. The main difference is that pandas series and pandas dataframes has explicit index, while numpy arrays has implicit indexation. So, in any python code that you think to use something like
import numpy as np
a = np.array([1,2,3])
you can just use
import pandas as pd
a = pd.Series([1,2,3])
All the functions and methods from numpy arrays will work with pandas series. In analogy, the same can be done with dataframes and numpy 2D arrays.
A further question you might have can be about the performance differences between a numpy array and pandas series. Here is a post that shows the differences in performance using these two tools: performance of pandas series vs numpy arrays.
Please note that even in an explicit way pandas series has a subtle worse in performance when compared to numpy, you can solve this by just calling the values method on a pandas series:
a.values
The result of apply the values method on a pandas series will be a numpy array!
I would say that pandas
lets you index and slice off of strings and create data frames directly from dictionaries, whereas numpy
is mostly nested lists. Other than that, they are pretty much exactly the same (pandas
is built on top of numpy
). So pandas
"feels" more natural to use for database-like data (e.g. csv, excel, and sql files), whereas numpy
"feels" more natural for numeric processing of data (e.g. signals, images, etc.). Granted, you can do many of the same things in both libraries; you can even create pandas
data frames from numpy
arrays and vice-versa.
One major difference (something to watch out for) is slicing in pandas
is inclusive whereas numpy
is exclusive (i.e. 0:10
in pandas
is "0 up to and including 10" whereas it is "0 up to, but not including 10" in numpy
). This is intuitively due to the fact that since pandas
permits slicing on strings, it doesn’t make much sense to slice, say, "up to but not including a column of name x
" (shout out to Corey Schafer for that insight (see about 30 mins in): Python Pandas Tutorial (Part 2)).
Other than that, pandas
utilizes the same slicing, indexing, and fancy indexing notation as numpy
(minus the ability for strings) and the same kinds of "gotcha’s" with respect to different operations creating views vs copies of data. (An excellent numpy
tutorial is a Numpy lecture from SciPy 2019 by Alex Chabot-Leclerc).
Ultimately, I would say pandas
is a database analyst’s best friend while numpy
is a data scientists friend. Personally, I use pandas
to pull data from the real world, sort it, and preprocess it. Then I convert this data into numpy
arrays where necessary to do more serious/intensive numeric computing. PLEASE NOTE: This is purely opinion. There is no right answer.
That being said, I highly recommend getting to know and understand numpy
first (highly recommend the Alex Chabot-Leclerc video). Afterwards, pandas
will make a lot more sense.