Sample datasets in Pandas Sample datasets in Pandas python python

Sample datasets in Pandas


Since I originally wrote this answer, I have updated it with the many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I amalready using (usually seaborn or pandas). If you need offline access,installing the data set with Quilt seems to be the only option.

Seaborn

The brilliant plotting package seaborn has several built-in sample data sets.

import seaborn as snsiris = sns.load_dataset('iris')iris.head()
   sepal_length  sepal_width  petal_length  petal_width species0           5.1          3.5           1.4          0.2  setosa1           4.9          3.0           1.4          0.2  setosa2           4.7          3.2           1.3          0.2  setosa3           4.6          3.1           1.5          0.2  setosa4           5.0          3.6           1.4          0.2  setosa

Pandas

If you do not want to import seaborn, but still want to access its sampledata sets, you can use @andrewwowens's approach for the seaborn sampledata:

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

Note that the sample data sets containing categorical columns have their columntype modified by sns.load_dataset() and the result might not be the sameby getting it from the url directly. The iris and tips sample data sets are alsoavailable in the pandas github repo here.

R sample datasets

Since any dataset can be read via pd.read_csv(), it is possible to access allR's sample data sets by copying the URLs from this R data setrepository.

Additional ways of loading the R sample data sets includestatsmodel

import statsmodels.api as smiris = sm.datasets.get_rdataset('iris').data

and PyDataset

from pydataset import datairis = data('iris')

scikit-learn

scikit-learn returns sample data as numpy arrays rather than a pandas dataframe.

from sklearn.datasets import load_irisiris = load_iris()# `iris.data` holds the numerical values# `iris.feature_names` holds the numerical column names# `iris.target` holds the categorical (species) values (as ints)# `iris.target_names` holds the unique categorical names

Quilt

Quilt is a dataset manager created to facilitatedataset management. It includes many common sample datasets, such asseveral from the uciml samplerepository. The quick startpage shows how to installand import the iris data set:

# In your terminal$ pip install quilt$ quilt install uciml/iris

After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.

import quilt.data.uciml.iris as iriris = ir.tables.iris()
   sepal_length  sepal_width  petal_length  petal_width        class0           5.1          3.5           1.4          0.2  Iris-setosa1           4.9          3.0           1.4          0.2  Iris-setosa2           4.7          3.2           1.3          0.2  Iris-setosa3           4.6          3.1           1.5          0.2  Iris-setosa4           5.0          3.6           1.4          0.2  Iris-setosa

Quilt also support dataset versioning and include a shortdescription of each dataset.


The rpy2 module is made for this:

from rpy2.robjects import r, pandas2ripandas2ri.activate()r['iris'].head()

yields

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species1           5.1          3.5           1.4          0.2  setosa2           4.9          3.0           1.4          0.2  setosa3           4.7          3.2           1.3          0.2  setosa4           4.6          3.1           1.5          0.2  setosa5           5.0          3.6           1.4          0.2  setosa

Up to pandas 0.19 you could use pandas' own rpy interface:

import pandas.rpy.common as rcomiris = rcom.load_data('iris')print(iris.head())

yields

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species1           5.1          3.5           1.4          0.2  setosa2           4.9          3.0           1.4          0.2  setosa3           4.7          3.2           1.3          0.2  setosa4           4.6          3.1           1.5          0.2  setosa5           5.0          3.6           1.4          0.2  setosa

rpy2 also provides a way to convert R objects into Python objects:

import pandas as pdimport rpy2.robjects as roimport rpy2.robjects.conversion as conversionfrom rpy2.robjects import pandas2ripandas2ri.activate()R = ro.rdf = conversion.ri2py(R['mtcars'])print(df.head())

yields

    mpg  cyl  disp   hp  drat     wt   qsec  vs  am  gear  carb0  21.0    6   160  110  3.90  2.620  16.46   0   1     4     41  21.0    6   160  110  3.90  2.875  17.02   0   1     4     42  22.8    4   108   93  3.85  2.320  18.61   1   1     4     13  21.4    6   258  110  3.08  3.215  19.44   1   0     3     14  18.7    8   360  175  3.15  3.440  17.02   0   0     3     2


Any publically available .csv file can be loaded into pandas extremely quickly using its URL. Here is an example using the iris dataset originally from the UCI archive.

import pandas as pdfile_name = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"df = pd.read_csv(file_name)df.head()

The output here being the .csv file header you just loaded from the given URL.

>>> df.head()   sepal_length  sepal_width  petal_length  petal_width species0           5.1          3.5           1.4          0.2  setosa1           4.9          3.0           1.4          0.2  setosa2           4.7          3.2           1.3          0.2  setosa3           4.6          3.1           1.5          0.2  setosa4           5.0          3.6           1.4          0.2  setosa

A memorable short URL for the same is https://j​.mp/iriscsv. This short URL will work only if it's typed and not if it's copy-pasted.