Hbase quickly count number of rows

hadoop hbase bigdata

Use RowCounter in HBaseRowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.

$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter <tablename>Usage: RowCounter [options]     <tablename> [                  --starttime=[start]         --endtime=[end]         [--range=[startKey],[endKey]]         [<column1> <column2>...]    ]

hadoop hbase bigdata

You can use the count method in hbase to count the number of rows. But yes, counting rows of a large table can be slow.count 'tablename' [interval]

Return value is the number of rows.

This operation may take a LONG time (Run ‘$HADOOP_HOME/bin/hadoop jarhbase.jar rowcount’ to run a counting mapreduce job). Current count is shownevery 1000 rows by default. Count interval may be optionally specified. Scancaching is enabled on count scans by default. Default cache size is 10 rows.If your rows are small in size, you may want to increase thisparameter.

Examples:

hbase> count 't1'hbase> count 't1', INTERVAL => 100000hbase> count 't1', CACHE => 1000hbase> count 't1', INTERVAL => 10, CACHE => 1000

The same commands also can be run on a table reference. Suppose you had a reference to table 't1', the corresponding commands would be:

hbase> t.counthbase> t.count INTERVAL => 100000hbase> t.count CACHE => 1000hbase> t.count INTERVAL => 10, CACHE => 1000

hadoop hbase bigdata

If you cannot use RowCounter for whatever reason, then a combination of these two filters should be an optimal way to get a count:

FirstKeyOnlyFilter() AND KeyOnlyFilter()

The FirstKeyOnlyFilter will result in the scanner only returning the first column qualifier it finds, as opposed to the scanner returning all of the column qualifiers in the table, which will minimize the network bandwith. What about simply picking one column qualifier to return? This would work if you could guarentee that column qualifier exists for every row, but if that is not true then you would get an inaccurate count.

The KeyOnlyFilter will result in the scanner only returning the column family, and will not return any value for the column qualifier. This further reduces the network bandwidth, which in the general case wouldn't account for much of a reduction, but there can be an edge case where the first column picked by the previous filter just happens to be an extremely large value.

I tried playing around with scan.setCaching but the results were all over the place. Perhaps it could help.

I had 16 million rows in between a start and stop that I did the following pseudo-empirical testing:

With FirstKeyOnlyFilter and KeyOnlyFilter activated:    With caching not set (i.e., the default value), it took 188 seconds.    With caching set to 1, it took 188 seconds    With caching set to 10, it took 200 seconds    With caching set to 100, it took 187 seconds    With caching set to 1000, it took 183 seconds.    With caching set to 10000, it took 199 seconds.    With caching set to 100000, it took 199 seconds.With FirstKeyOnlyFilter and KeyOnlyFilter disabled:    With caching not set, (i.e., the default value), it took 309 seconds

I didn't bother to do proper testing on this, but it seems clear that the FirstKeyOnlyFilter and KeyOnlyFilter are good.

Moreover, the cells in this particular table are very small - so I think the filters would have been even better on a different table.

Here is a Java code sample:

import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.client.HTable;import org.apache.hadoop.hbase.client.Result;import org.apache.hadoop.hbase.client.ResultScanner;import org.apache.hadoop.hbase.client.Scan;import org.apache.hadoop.hbase.util.Bytes;import org.apache.hadoop.hbase.filter.RowFilter;import org.apache.hadoop.hbase.filter.KeyOnlyFilter; import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter; import org.apache.hadoop.hbase.filter.FilterList;import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;import org.apache.hadoop.hbase.filter.RegexStringComparator; public class HBaseCount {    public static void main(String[] args) throws IOException {        Configuration config = HBaseConfiguration.create();        HTable table = new HTable(config, "my_table");        Scan scan = new Scan(            Bytes.toBytes("foo"), Bytes.toBytes("foo~")        );        if (args.length == 1) {            scan.setCaching(Integer.valueOf(args[0]));        }        System.out.println("scan's caching is " + scan.getCaching());        FilterList allFilters = new FilterList();        allFilters.addFilter(new FirstKeyOnlyFilter());        allFilters.addFilter(new KeyOnlyFilter());        scan.setFilter(allFilters);        ResultScanner scanner = table.getScanner(scan);        int count = 0;        long start = System.currentTimeMillis();        try {            for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {                count += 1;                if (count % 100000 == 0) System.out.println(count);            }        } finally {            scanner.close();        }        long end = System.currentTimeMillis();        long elapsedTime = end - start;        System.out.println("Elapsed time was " + (elapsedTime/1000F));    }}

Here is a pychbase code sample:

    from pychbase import Connection    c = Connection()    t = c.table('my_table')    # Under the hood this applies the FirstKeyOnlyFilter and KeyOnlyFilter    # similar to the happybase example below    print t.count(row_prefix="foo")

Here is a Happybase code sample:

    from happybase import Connection    c = Connection(...)    t = c.table('my_table')    count = 0    for _ in t.scan(filter='FirstKeyOnlyFilter() AND KeyOnlyFilter()'):        count += 1    print count

Thanks to @Tuckr and @KennyCason for the tip.

CodeHunter

Hbase quickly count number of rows

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last