Mahout : To read a custom input file Mahout : To read a custom input file hadoop hadoop

Mahout : To read a custom input file


One way to do this is by creating an extension of FileDataModel. You'll need to override the readUserIDFromString(String value) method to use some kind of resolver do the conversion. You can use one of the implementations of IDMigrator, as Sean suggests.

For example, assuming you have an initialized MemoryIDMigrator, you could do this:

@Overrideprotected long readUserIDFromString(String stringID) {    long result = memoryIDMigrator.toLongID(stringID);     memoryIDMigrator.storeMapping(result, stringID);    return result;}

This way you could use memoryIDMigrator to do the reverse mapping, too. If you don't need that, you can just hash it the way it's done in their implementation (it's in AbstractIDMigrator).


userId and itemId can be string, so this is the CustomFileDataModel which will convert your string into integer and will keep the map (String,Id) in memory; after recommendations you can get string from id.


Assuming that your input fits in memory, loop through it. Track the ID for each string in a dictionary. If it does not fit in memory, use sort and then group by to accomplish the same idea.

In python:

import sys

import sysnext_id = 0str_to_id = {}for line in sys.stdin:    fields = line.strip().split(',')    this_id = str_to_id.get(fields[0])    if this_id is None:        next_id += 1        this_id = next_id        str_to_id[fields[0]] = this_id    fields[0] = str(this_id)    print ','.join(fields)