1

Schema description

I have an HBase table called measurements which has two column families: measurement and instrument. In the measurement family there are columns measurement:1, measurement:2 etc. for each type of substance being measured (1,2,3 are substance ids), while in instrument the columns are instrument:status_1, instrument:status_2 etc. and correspond to the correctness of the instrument for the specific substance at time of measurement. Row keys are a concatenation of the date, time and ID of the station where the measurement was conducted, for example:

ROW COLUMN+CELL
 2017年01月01日 00:00+101 column=measurement:1, timestamp=2025年01月26日T12:46:42.021, value=?pbM\xD2\xF1\xA9\xFC
 2017年01月01日 00:00+101 column=instrument:status_1, timestamp=2025年01月26日T12:46:42.025, value=0

Query

I want to find the average measured value of the substance with ID 1 across all stations for 2018年12月05日, taking into account only rows where instrument:status_1 has value 0 (to not include erroneous results in the average). I want to do the average calculation on the server side i.e. have only the resulting value delivered to my client instead of thousands of rows.

I found that this is likely possible through the AggregationClient class, so I tried the following:

Scan scan = new Scan();
scan.setFilter(new PrefixFilter(Bytes.toBytes("2018年12月05日")));
scan.addColumn(Bytes.toBytes("measurement"), Bytes.toBytes("1"));
scan.addColumn(Bytes.toBytes("instrument"), Bytes.toBytes("status_1"));
// include only rows where status_1 is "0"
scan.setFilter(new SingleColumnValueFilter(
 Bytes.toBytes("instrument"),
 Bytes.toBytes("status_1"),
 CompareOperator.EQUAL,
 Bytes.toBytes("0")
));
AggregationClient aggregationClient = new AggregationClient(connection.getConfiguration());
// values of measurement column are stored as bytes that are interpreted as double 
ColumnInterpreter<Double, Double, HBaseProtos.EmptyMsg, HBaseProtos.DoubleMsg, HBaseProtos.DoubleMsg> columnInterpreter = new DoubleColumnInterpreter();
double avg = aggregationClient.avg(TableName.valueOf("measurements"), columnInterpreter, scan);
aggregationClient.close();

An IOException is thrown by AggregationClient, however, stating that there can't be more than one column present in the scan when invoking aggregationClient.avg(). I need the average for the column measurement:1, but I also must include instrument in the scan in order to filter rows based on the value of the instrument:status_1 column. The documentation for AggregationClient states this:

Column family can't be null. In case where multiple families are provided, an IOException will be thrown. An optional column qualifier can also be defined.

Is there any workaround or a way to tell AggregationClient what column in the scan it should take the average for?

asked Jan 27, 2025 at 7:45

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.