Hdfs to s3 Distcp - Access Keys Hdfs to s3 Distcp - Access Keys hadoop hadoop

Hdfs to s3 Distcp - Access Keys


I also faced the same situation, and after got temporary credentials from matadata instance. (in case you're using IAM User's credential, please notice that the temporary credentials mentioned here is IAM Role, which attach to EC2 server not human, refer http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html)

I found only specifying the credentials in the hadoop distcp cmd will not work.You also have to specify a config fs.s3a.aws.credentials.provider. (refer http://hortonworks.github.io/hdp-aws/s3-security/index.html#using-temporary-session-credentials)

Final command will look like below

hadoop distcp -Dfs.s3a.aws.credentials.provider="org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider" -Dfs.s3a.access.key="{AccessKeyId}" -Dfs.s3a.secret.key="{SecretAccessKey}" -Dfs.s3a.session.token="{SessionToken}" s3a://bucket/prefix/file /path/on/hdfs


Amazon allows to generate temporary credentials that you can retrieve from http://169.254.169.254/latest/meta-data/iam/security-credentials/

you can read from there

An application on the instance retrieves the security credentials provided by the role from the instance metadata item iam/security-credentials/role-name. The application is granted the permissions for the actions and resources that you've defined for the role through the security credentials associated with the role. These security credentials are temporary and we rotate them automatically. We make new credentials available at least five minutes prior to the expiration of the old credentials.

The following command retrieves the security credentials for an IAM role named s3access.

$ curl http://169.254.169.254/latest/meta-data/iam/security-credentials/s3access

The following is example output.

{  "Code" : "Success",  "LastUpdated" : "2012-04-26T16:39:16Z",  "Type" : "AWS-HMAC",  "AccessKeyId" : "AKIAIOSFODNN7EXAMPLE",  "SecretAccessKey" : "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",  "Token" : "token",  "Expiration" : "2012-04-27T22:39:16Z"}

For applications, AWS CLI, and Tools for Windows PowerShell commands that run on the instance, you do not have to explicitly get the temporary security credentials — the AWS SDKs, AWS CLI, and Tools for Windows PowerShell automatically get the credentials from the EC2 instance metadata service and use them. To make a call outside of the instance using temporary security credentials (for example, to test IAM policies), you must provide the access key, secret key, and the session token. For more information, see Using Temporary Security Credentials to Request Access to AWS Resources in the IAM User Guide.


Recent (2.8+) versions let you hide your credentials in a jceks file; there's some documentation on the Hadoop s3 page there. That way: no need to put any secrets on the command line at all; you just share them across the cluster and then, in the distcp command, set hadoop.security.credential.provider.path to the path, like jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks

Fan: if you are running in EC2, the IAM role credentials should be automatically picked up from the default chain of credential providers: after looking for the config options & env vars, it tries a GET of the EC2 http endpoint which serves up the session credentials. If that's not happening, make sure that com.amazonaws.auth.InstanceProfileCredentialsProvider is on the list of credential providers. It's a bit slower than the others (and can get throttled), so best to put near the end.