Are there any example data sets for Python?
You can use rpy2
package to access all R datasets from Python.
Set up the interface:
>>> from rpy2.robjects import r, pandas2ri>>> def data(name): ... return pandas2ri.ri2py(r[name])
Then call data()
with any dataset's name of the available datasets (just like in R
)
>>> df = data('iris')>>> df.describe() Sepal.Length Sepal.Width Petal.Length Petal.Widthcount 150.000000 150.000000 150.000000 150.000000mean 5.843333 3.057333 3.758000 1.199333std 0.828066 0.435866 1.765298 0.762238min 4.300000 2.000000 1.000000 0.10000025% 5.100000 2.800000 1.600000 0.30000050% 5.800000 3.000000 4.350000 1.30000075% 6.400000 3.300000 5.100000 1.800000max 7.900000 4.400000 6.900000 2.500000
To see a list of the available datasets with a description for each:
>>> print(r.data())
Note: rpy2 requires R
installation with setting R_HOME
variable, and pandas
must be installed as well.
UPDATE
I just created PyDataset, which is a simple module to make loading a dataset from Python as easy as R
's (and it does not require R
installation, only pandas
).
To start using it, install the module:
$ pip install pydataset
Then just load up any dataset you wish (currently around 757 datasets available):
from pydataset import datatitanic = data('titanic')
There are also datasets available from the Scikit-Learn library.
from sklearn import datasets
There are multiple datasets within this package. Some of the Toy Datasets are:
load_boston() Load and return the boston house-prices dataset (regression).load_iris() Load and return the iris dataset (classification).load_diabetes() Load and return the diabetes dataset (regression).load_digits([n_class]) Load and return the digits dataset (classification).load_linnerud() Load and return the linnerud dataset (multivariate regression).
I originally posted this over at the related question Sample Datasets in Pandas, but since it is relevant outside pandas I am including it here as well.
There are many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I amalready using (usually seaborn or pandas). If you need offline access,installing the data set with Quilt seems to be the only option.
Seaborn
The brilliant plotting package seaborn
has several built-in sample data sets.
import seaborn as snsiris = sns.load_dataset('iris')iris.head()
sepal_length sepal_width petal_length petal_width species0 5.1 3.5 1.4 0.2 setosa1 4.9 3.0 1.4 0.2 setosa2 4.7 3.2 1.3 0.2 setosa3 4.6 3.1 1.5 0.2 setosa4 5.0 3.6 1.4 0.2 setosa
Pandas
If you do not want to import seaborn
, but still want to access its sampledata sets, you can read the seaborn sample data from its URL:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
Note that the sample data sets containing categorical columns have their columntype modified by sns.load_dataset()
and the result might not be the sameby getting it from the url directly. The iris and tips sample data sets are alsoavailable in the pandas github repo here.
R sample datasets
Since any dataset can be read via pd.read_csv()
, it is possible to access allR's sample data sets by copying the URLs from this R data setrepository.
Additional ways of loading the R sample data sets includestatsmodel
import statsmodels.api as smiris = sm.datasets.get_rdataset('iris').data
and PyDataset
from pydataset import datairis = data('iris')
scikit-learn
scikit-learn
returns sample data as numpy arrays rather than a pandas dataframe.
from sklearn.datasets import load_irisiris = load_iris()# `iris.data` holds the numerical values# `iris.feature_names` holds the numerical column names# `iris.target` holds the categorical (species) values (as ints)# `iris.target_names` holds the unique categorical names
Quilt
Quilt is a dataset manager created to facilitatedataset management. It includes many common sample datasets, such asseveral from the uciml samplerepository. The quick startpage shows how to installand import the iris data set:
# In your terminal$ pip install quilt$ quilt install uciml/iris
After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.
import quilt.data.uciml.iris as iriris = ir.tables.iris()
sepal_length sepal_width petal_length petal_width class0 5.1 3.5 1.4 0.2 Iris-setosa1 4.9 3.0 1.4 0.2 Iris-setosa2 4.7 3.2 1.3 0.2 Iris-setosa3 4.6 3.1 1.5 0.2 Iris-setosa4 5.0 3.6 1.4 0.2 Iris-setosa
Quilt also support dataset versioning and include a shortdescription of each dataset.