Can we get all the column names from an HBase table?

hadoop hbase

You can use a mapreduce for this. In this case you don't need to install a custom libs for hbase as in case for coprocessor.Below a code for creating a mapreduce task.

Job setup

    Job job = Job.getInstance(config);    job.setJobName("Distinct columns");    Scan scan = new Scan();    scan.setBatch(500);    scan.addFamily(YOU_COLUMN_FAMILY_NAME);    scan.setFilter(new KeyOnlyFilter()); //scan only key part of KeyValue (raw, column family, column)    scan.setCacheBlocks(false);  // don't set to true for MR jobs    TableMapReduceUtil.initTableMapperJob(            YOU_TABLE_NAME,            scan,                      OnlyColumnNameMapper.class,   // mapper            Text.class,             // mapper output key            Text.class,             // mapper output value            job);    job.setNumReduceTasks(1);    job.setReducerClass(OnlyColumnNameReducer.class);    job.setReducerClass(OnlyColumnNameReducer.class);

Mapper

 public class OnlyColumnNameMapper extends TableMapper<Text, Text> {    @Override    protected void map(ImmutableBytesWritable key, Result value, final Context context) throws IOException, InterruptedException {       CellScanner cellScanner = value.cellScanner();       while (cellScanner.advance()) {          Cell cell = cellScanner.current();          byte[] q = Bytes.copy(cell.getQualifierArray(),                                cell.getQualifierOffset(),                                cell.getQualifierLength());          context.write(new Text(q),new Text());         } }

}

Reducer

public class OnlyColumnNameReducer extends Reducer<Text, Text, Text, Text> {    @Override    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {                context.write(new Text(key), new Text());        }}

hadoop hbase

HBase can be visualised as a distributed NavigableMap<byte[], NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>>>

There is no "metadata" (say something centrally stored in the master node) about the list of all qualifiers that's available in all region servers.

So if you have a one-time use-case, the only way for you would be to scan through the entire table and add the qualifier names in a Set<>, like you mentioned.

If this is a repeat use-case (plus if you have the discretion to add components to your tech stack), you may want to consider adding Redis. Set of qualifiers can be maintained in a distributed fashion using a Redis Set.

hadoop hbase

HBase Coprocessors can be used for this scenario. You can write custom EndPoint implementation which works like Stored Procedures in RDBMS. It executes your code on server side and get distinct columns for each region. On client you can get the distinct columns across all regions.

Performance Benefit: All columns are not transferred to the client which results in reduced network calls.

CodeHunter

Can we get all the column names from an HBase table?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last