Hadoop - Analyze log file (Java) Hadoop - Analyze log file (Java) hadoop hadoop

Hadoop - Analyze log file (Java)


I recommend getting the raw output sum as you are doing as the result of the first pass of one Hadoop job, so at the end of the Hadoop job, you have a result like this:

User1234     Prdsum: 58    User45687    Prdsum: 0

and then have a second Hadoop job (or standalone job) that compares the various values and produces another report.

Do you need "state" as part of the first Hadoop job? If so, then you will need to keep a HashMap or HashTable in your mapper or reducer that stores the values of all the keys (users in this case) to compare - but that is not a good setup, IMHO. You are better off just doing an aggregate in one Hadoop job, and doing the comparison in another.


One way to achieve is by using a composite key.Mapper output Key is combination of userid, event id (reminder -> 0, order -> 1). Partition data using userid and you need to write your own comparator.here is the gist.

Mapper

for every event check the event type     if event type is "reminder"        emit : <User1234,0> <reminder id>    if event type is "order"        split if you have multiple orders        for every order            emit : <User1234,1> <prd, count* amount, other interested blah>

Partition using userid so all entries with same user is will go to same reducer.

Reducer

At reducer all entries will be grouped by userid and sorted event id (i.e first you will get all reminders for a given userid and followed by orders).

If `eventid` is 0    add reminders id to a set (`reminderSet`).If `eventid` is is 1 &&  prd is in `remindersSet`    emit : `<userid>  <prdsum>`else    emit : `<userid>  <0>` 

More details on Composite key can be found in 'Hadoop definitive guide' or here