loading EMNIST-letters dataset

python python-3.x numpy scipy mnist

Because of the way the dataset is structured, the array of image arrays can be accessed with mat['dataset'][0][0][0][0][0][0] and the array of label arrays with mat['dataset'][0][0][0][0][0][1]. For instance, print(mat['dataset'][0][0][0][0][0][0][0]) will print out the pixel values of the first image, and print(mat['dataset'][0][0][0][0][0][1][0]) will print the first image's label.

For a less...convoluted dataset, I'd actually recommend using the CSV version of the EMNIST dataset on Kaggle: https://www.kaggle.com/crawford/emnist, where each row is a separate image, there are 785 columns where the first column = class_label and each column after represents one pixel value (784 total for a 28 x 28 image).

python python-3.x numpy scipy mnist

@Josh Payne's answer is correct, but I'll expand on it for those who want to use the .mat file with an emphasis on typical data splits.

The data itself has already been split up in to a training and test set. Here's how I accessed the data:

    from scipy import io as sio    mat = sio.loadmat('emnist-letters.mat')    data = mat['dataset']    X_train = data['train'][0,0]['images'][0,0]    y_train = data['train'][0,0]['labels'][0,0]    X_test = data['test'][0,0]['images'][0,0]    y_test = data['test'][0,0]['labels'][0,0]

There is an additional field 'writers' (e.g. data['train'][0,0]['writers'][0,0]) that distinguishes the original sample writer. Finally, there is another field data['mapping'], but I'm not sure what it is mapping the digits to.

In addition, in Secion II D, the EMNIST paper states that "the last portion of the training set, equal in size to the testing set, is set aside as a validation set". Strangely, the .mat file training/testing size does not match the number listed in Table II, but it does match the size in Fig. 2.

    val_start = X_train.shape[0] - X_test.shape[0]    X_val = X_train[val_start:X_train.shape[0],:]    y_val = y_train[val_start:X_train.shape[0]]    X_train = X_train[0:val_start,:]    y_train = y_train[0:val_start]

If you don't want a validation set it is fine to leave these samples in the training set.

Also, if you would like to reshape the data into 2D, 28x28 sized images instead of a 1D 784 array, to get the correct image orientation you'll need to do a numpy reshape using Fortran ordering (Matlab uses column-major ordering, just like Fortran. reference). e.g. -

    X_train = X_train.reshape( (X_train.shape[0], 28, 28), order='F')

python python-3.x numpy scipy mnist

An alternative solution is to use the EMNIST python package. (Full details at https://pypi.org/project/emnist/)

This lets you pip install emnist in your environment then import the datasets (they will download when you run the program for the first time).

Example from the site:

  >>> from emnist import extract_training_samples  >>> images, labels = extract_training_samples('digits')  >>> images.shape  (240000, 28, 28)  >>> labels.shape  (240000,)

You can also list the datasets

 >>> from emnist import list_datasets  >>> list_datasets()  ['balanced', 'byclass', 'bymerge', 'digits', 'letters', 'mnist']

And replace 'digits' in the first example with your choice.

This gives you all the data in numpy arrays which I have found makes things easy to work with.

CodeHunter

loading EMNIST-letters dataset

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last