How do I use Elastic MapReduce to run an XSLT transformation on millions of small S3 xml files?

xml amazon-s3 hadoop mapreduce xslt

Upload your data to an S3 bucket
Generate a file containing the full s3n:// path to each file
Write a mapper script that:
- Pulls 'mapred_work_output_dir' out of the environment (*)
- Performs XSLT transform based on the name of the file, saving to the output directory
Write an identity reducer that does nothing
Upload your mapper / reducer scripts to an S3 bucket
Test your script via the AWS EMR console

(*) Streaming puts your jobconf in the processes environment. See code here.

CodeHunter