How to create a Image Dataset just like MNIST dataset? How to create a Image Dataset just like MNIST dataset? python python

How to create a Image Dataset just like MNIST dataset?


You can either write a function that loads all your images and stack them into a numpy array if all fits in RAM or use Keras ImageDataGenerator (https://keras.io/preprocessing/image/) which includes a function flow_from_directory. You can find an example here https://gist.github.com/fchollet/0830affa1f7f19fd47b06d4cf89ed44d.


You should write your own function to load all the images or do it like:

imagePaths = sorted(list(paths.list_images(args["testset"])))# loop over the input imagesfor imagePath in imagePaths:    # load the image, pre-process it, and store it in the data list    image = cv2.imread(imagePath)    image = cv2.resize(image, (IMAGE_DIMS[1], IMAGE_DIMS[0]))    image = img_to_array(image)    data.append(image)    # extract the class label from the image path and update the    # labels listdata = np.array(data, dtype="float") / 255.0


I might be late, but I am posting my answer to help others who visit this question in search of an answer. In this answer, I will be explaining the dataset type, how to generate such datasets, and how to load those files.

What is the file format

These datasets are datasets already vectorized and in Numpy format. Check here (Keras Datasets Documentation) for the reference. These datasets are stored in .npz file format. Check here(MNIST digits classification dataset). Here is a code block copied from the documentation for reference.

tf.keras.datasets.mnist.load_data(path="mnist.npz")

Once you generate a .npz file you can use it the way you use the mnist default datasets.

How to generate a .npz file

Here is how to generate such a dataset from all the images in a folder

#generate and save filefrom PIL import Imageimport osimport numpy as nppath_to_files = "./images/"    vectorized_images = []for _, file in enumerate(os.listdir(path_to_files)):    image = Image.open(path_to_files + file)    image_array = np.array(image)    vectorized_images.append(image_array)        # save as DataX or any other name. But the same element name is to be used while loading it back. np.savez("./mnistlikedataset.npz",DataX=vectorized_images) 

if you want to use save more than one element you can do something like this with appropriate other changes to code.

np.savez("./mnistlikedataset.npz",DataX=vectorized_images_x,DataY=vectorized_images_Y)

How to load the data file

#load and use fileimport numpy as nppath = "./mnistlikedataset.npz"with np.load(path) as data:    #load DataX as train_data    train_data = data['DataX']    print(train_data)

Similar to saving multiple elements, if you want to load multiple elements from a file you can do something like this with other appropriate changes

with np.load(path) as data:    train_data = data['DataX']    print(train_data)    test_data = data['DataY']    print(test_data)