Impala ODBC/JDBC bad performance - rows fetch is very slow from a remote server compared with NN Impala ODBC/JDBC bad performance - rows fetch is very slow from a remote server compared with NN hadoop hadoop

Impala ODBC/JDBC bad performance - rows fetch is very slow from a remote server compared with NN


There is no mention of the JDBC connector version used.

There might be more than one server in your cluster where Impala daemons are run, please make the respective changesin your JDBC connection URL and verify the performance on those servers too.

Just in case you missed to refer the documentation (https://www.cloudera.com/documentation/enterprise/5-12-x/topics/impala_jdbc.html) , pay attention to this extract:

The latest JDBC driver, corresponding to Hive 0.13, provides substantial performance improvements for Impala queries that return large result sets. Impala 2.0 and later are compatible with the Hive 0.13 driver. If you already have an older JDBC driver installed, and are running Impala 2.0 or higher, consider upgrading to the latest Hive JDBC driver for best performance with JDBC applications.

Since you are using a remote machine to access Impala, refer to this information also:

If you are using JDBC-enabled applications on hosts outside the CDH cluster, you cannot use the CDH install procedure on the non-CDH hosts. Install the JDBC driver on at least one CDH host .... Then download the JAR files to each client machine that will use JDBC with Impala...

If not done earlier, update the JDBC connector, and make sure that all the impalad instances are running.Then compare the performance results of ODBC and JDBC.

This link also is worth refering:https://www.cloudera.com/documentation/enterprise/5-12-x/topics/impala_troubleshooting.html

Update 1:

Reference#1: https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Impala-JDBC-10x-Slower-Vs-Shell/m-p/51779

As recommended in the reference, try adding the below parameters to the JDBC connection string and check the log:

;LogLevel=6;LogPath=/path/to/directory

Reference#2: https://issues.apache.org/jira/browse/IMPALA-2651 you might consider the below setting:

SET disable_codegen=true;

Update 2: I guess you already have these below mentioned jars in at least one Impala server, in your cluster.

commons-logging-X.X.X.jarhadoop-common.jarhive-common-X.XX.X-cdhX.X.X.jarhive-jdbc-X.XX.X-cdhX.X.X.jarhive-metastore-X.XX.X-cdhX.X.X.jarhive-service-X.XX.X-cdhX.X.X.jarhttpclient-X.X.X.jarhttpcore-X.X.X.jarlibfb303-X.X.X.jarlibthrift-X.X.X.jarlog4j-X.X.XX.jarslf4j-api-X.X.X.jarslf4j-logXjXX-X.X.X.jar

Please copy these jars to the machine from where you try to access Impala through JDBC code. Make sure that these jars are in your classpath and execute the JDBC code.


Finaly and after almost 6 months I have found the solution!

It was always about my 1024 limitition remark, the row batch limitation was from BATCH_SIZE max value (1024), in the last versions (CDH 5.14/Impala 2.11) we have a new effective range is 1-65536.

1-1024: https://www.cloudera.com/documentation/enterprise/5-12-x/topics/impala_batch_size.html1-65536: https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_batch_size.html

So when I increase it throgh a odbc.ini with SSP_BATCH_SIZE I can benifit from increasing the other odbc parameters (RowsFetchedPerBlock / TSaslTransportBufSize) and the rows can be fetched in a seconds (~45 secs) instead of tens of minutes.

Link: http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Impala-ODBC-JDBC-bad-performance-rows-fetch-is-very-slow-from-a/m-p/61152

Thanks all for your replies.