How does Hive 'alter table <table name> concatenate' work? How does Hive 'alter table <table name> concatenate' work? hadoop hadoop

How does Hive 'alter table <table name> concatenate' work?


As per the AlterTable/PartitionConcatenate:

If the table or partition contains many small RCFiles or ORC files, then the above command will merge them into larger files. In case of RCFile the merge happens at block level whereas for ORC files the merge happens at stripe level thereby avoiding the overhead of decompressing and decoding the data.

Also ORC Stripes:

The body of ORC files consists of a series of stripes. Stripes arelarge (typically ~200MB) and independent of each other and are oftenprocessed by different tasks. The defining characteristic for columnarstorage formats is that the data for each column is stored separatelyand that reading data out of the file should be proportional to thenumber of columns read.In ORC files, each column is stored in several streams that are storednext to each other in the file. For example, an integer column isrepresented as two streams PRESENT, which uses one with a bit pervalue recording if the value is non-null, and DATA, which records thenon-null values. If all of a column's values in a stripe are non-null,the PRESENT stream is omitted from the stripe. For binary data, ORCuses three streams PRESENT, DATA, and LENGTH, which stores the lengthof each value. The details of each type will be presented in thefollowing subsections.

For implementing in Spark you can use SparkSQL with the help of Spark Context:

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)scala> sqlContext.sql("Your_hive_query_here")


Please note that SparkSQL has a number of Hive SQL commands which are not supported.

ALTER TABLE <tableIdentifier> [partitionSpec] CONCATENATE is on that list, and has been from Spark 1, 2 to 3. It will likely continue to be unsupported from Spark until the day that the Hadoop ecosystem ships Hive with Spark as its default engine, and even then, this may become deprecated.