Tuesday, September 11, 2018

Hive Query on Amazon S3 fails intermittently

SYMPTOM

Hive query is observed to be failing intermittently. The application log or hiveserver2.log shows errors like below while running the task attempts or while moving data to storage:

2016-03-02 13:28:23,459 INFO  [HiveServer2-Background-Pool: Thread-52002]: SessionState (SessionState.java:printInfo(824)) - Map 1: 2(+6)/16    Map 4: 0(+2)/16 Map 5: 9(+0)/17 Reducer 2: 0/1009       Reducer 3: 0/1009
2016-03-02 13:28:23,642 ERROR [HiveServer2-Background-Pool: Thread-51679]: exec.Task (SessionState.java:printError(833)) - Job Commit failed with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(org.apache.http.NoHttpResponseException: The target server failed to respond)'
org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.http.NoHttpResponseException: The target server failed to respond
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1031)
        at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:650)
        at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:655)
        at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:655)
        at org.apache.hadoop.hive.ql.exec.tez.TezTask.close(TezTask.java:403)
        ...
Caused by: org.apache.http.NoHttpResponseException: The target server failed to respond
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
        ...       
       at org.jets3t.service.StorageService.copyObject(StorageService.java:871)
        at org.jets3t.service.StorageService.copyObject(StorageService.java:916)
        at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:323)
        at sun.reflect.GeneratedMethodAccessor203.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at org.apache.hadoop.fs.s3native.$Proxy52.copy(Unknown Source)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:717)

        at org.apache.hadoop.hive.ql.exec.Utilities.renameOrMoveFiles(Utilities.java:1566)
        at org.apache.hadoop.hive.ql.exec.Utilities.mvFileToFinalPath(Utilities.java:1806)
        at org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1027)


ROOT CAUSE
Intermittent Amazon S3 access failure.

RESOLUTION
Work with Amazon to resolve the access issues by reporting the complete error message from hiveserver2.log or yarn application log.

How to access Amazon Simple Storage Service (Amazon S3) FS by passing command properties?

There is no need to set the core-site.xml to access Amazon S3, simple
the configuration as a java properties.

This is an article with instructions to access Amazon S3 by passing
parameters to `hadoop` command line. This is helpful to test access
before hardcoding the configuring parameter in the HDP cluster (which
will require a restart) or just creating scripts to do a task. To
provide Access Key and Secret Key, check out the examples:

Option 1 (Secure): Generate jceks and list and distcp files.

Generate the jce key store with the following access key:

hadoop credential create fs.s3a.access.key -value '<Access-key>'
-provider jceks:///tmp/aws.jceks

Generate the following secret key:

hadoop credential create fs.s3a.secret.key -value '<Secret-key>'
-provider jceks:///tmp/aws.jceks

Now, use the jce key store to list and distcp files:

hadoop fs
-Dhadoop.security.credential.provider.path=jceks:///tmp/dir/aws.jceks
-ls s3a://your-bucket/

hadoop fs distcp
Dhadoop.security.credential.provider.path=jceks:///tmp/dir/aws.jceks
/tmp/hello.txt s3a://your-bucket

Option 2 (Less Secure): To generate a key store and provide the username
and password as clear text using the java properties.

Use the following for Listing and distcp the access:

hadoop fs -Dfs.s3a.access.key=<;Access-key>
-Dfs.s3a.secret.key=<;Secret-key> -ls s3a://your-bucket/

hadoop distcp -Dfs.s3a.access.key=<;Access-key>
-Dfs.s3a.secret.key=<;Secret-key> /tmp/hello.txt s3a://your-bucket/

Extra Option: If the bucket is associated with a different endpoint, you
can overwrite the endpoint as a Java property. Add the following
properties to the command line to overwrite the endpoint.

-Dfs.s3a.endpoint=s3.us-east-2.amazonaws.com

Example:

hadoop fs -Dfs.s3a.endpoint=s3.us-east-2.amazonaws.com
-Dhadoop.security.credential.provider.path=jceks:///tmp/dir/aws.jceks
-ls s3a://your-bucket/