Python - What are the major improvement of Pandas over Numpy/Scipy Python - What are the major improvement of Pandas over Numpy/Scipy numpy numpy

Python - What are the major improvement of Pandas over Numpy/Scipy


Pandas is not particularly revolutionary and does use the NumPy and SciPy ecosystem to accomplish it's goals along with some key Cython code. It can be seen as a simpler API to the functionality with the addition of key utilities like joins and simpler group-by capability that are particularly useful for people with Table-like data or time-series. But, while not revolutionary, Pandas does have key benefits.

For a while I had also perceived Pandas as just utilities on top of NumPy for those who liked the DataFrame interface. However, I now see Pandas as providing these key features (this is not comprehensive):

  1. Array of Structures (independent-storage of disparate types instead of the contiguous storage of structured arrays in NumPy) --- this will allow faster processing in many cases.
  2. Simpler interfaces to common operations (file-loading, plotting, selection, and joining / aligning data) make it easy to do a lot of work in little code.
  3. Index arrays which mean that operations are always aligned instead of having to keep track of alignment yourself.
  4. Split-Apply-Combine is a powerful way of thinking about and implementing data-processing

However, there are downsides to Pandas:

  1. Pandas is basically a user-interface library and not particularly suited for writing library code. The "automatic" features can lull you into repeatedly using them even when you don't need to and slowing down code that gets called over and over again.
  2. Pandas typically takes up more memory as it is generous with the creation of object arrays to solve otherwise sticky problems of things like string handling.
  3. If your use-case is outside the realm of what Pandas was designed to do, it gets clunky quickly. But, within the realms of what it was designed to do, Pandas is powerful and easy to use for quick data analysis.


I feel like characterising Pandas as "improving on" Numpy/SciPy misses much of the point. Numpy/Scipy are quite focussed on efficient numeric calculation and solving numeric problems of the sort that scientists and engineers often solve. If your problem starts out with formulae and involves numerical solution from there, you're probably good with those two.

Pandas is much more aligned with problems that start with data stored in files or databases and which contain strings as well as numbers. Consider the problem of reading data from a database query. In Pandas, you can read_sql_query directly and have a usable version of the data in one line. There is no equivalent functionality in Numpy/SciPy.

For data featuring strings or discrete rather than continuous data, there is no equivalent to the groupby capability, or the database-like joining of tables on matching values.

For time series, there is the massive benefit of handling time series data using a datetime index, which allows you to resample smoothly to different intervals, fill in values and plot your series incredibly easily.

Since many of my problems start their lives in spreadsheets, I am also very grateful for the relatively transparent handling of Excel files in both .xls and .xlsx formats with a uniform interface.

There is also a greater ecosystem, with packages like seaborn enabling more fluent statistical analysis and model fitting than is possible with the base numpy/scipy stuff.


A main point is that it introduces new data structures like dataframes, panels etc. and has good interfaces to other structure and libs. So in generally its more an great extension to the python ecosystem than an improvement over other libs. For me its a great tool among others like numpy, bcolz. Often i use it to reshape my data, get an overview before starting to do data mining etc.