Schema description
I have an HBase table called measurements which has two column families: measurement and instrument. In the measurement family there are columns measurement:1, measurement:2 etc. for each type of substance being measured (1,2,3 are substance ids), while in instrument the columns are instrument:status_1, instrument:status_2 etc. and correspond to the correctness of the instrument for the specific substance at time of measurement. Row keys are a concatenation of the date, time and ID of the station where the measurement was conducted, for example:
ROW COLUMN+CELL
2017年01月01日 00:00+101 column=measurement:1, timestamp=2025年01月26日T12:46:42.021, value=?pbM\xD2\xF1\xA9\xFC
2017年01月01日 00:00+101 column=instrument:status_1, timestamp=2025年01月26日T12:46:42.025, value=0
Query
I want to find the average measured value of the substance with ID 1 across all stations for 2018年12月05日, taking into account only rows where instrument:status_1 has value 0 (to not include erroneous results in the average). I want to do the average calculation on the server side i.e. have only the resulting value delivered to my client instead of thousands of rows.
I found that this is likely possible through the AggregationClient class, so I tried the following:
Scan scan = new Scan();
scan.setFilter(new PrefixFilter(Bytes.toBytes("2018年12月05日")));
scan.addColumn(Bytes.toBytes("measurement"), Bytes.toBytes("1"));
scan.addColumn(Bytes.toBytes("instrument"), Bytes.toBytes("status_1"));
// include only rows where status_1 is "0"
scan.setFilter(new SingleColumnValueFilter(
Bytes.toBytes("instrument"),
Bytes.toBytes("status_1"),
CompareOperator.EQUAL,
Bytes.toBytes("0")
));
AggregationClient aggregationClient = new AggregationClient(connection.getConfiguration());
// values of measurement column are stored as bytes that are interpreted as double
ColumnInterpreter<Double, Double, HBaseProtos.EmptyMsg, HBaseProtos.DoubleMsg, HBaseProtos.DoubleMsg> columnInterpreter = new DoubleColumnInterpreter();
double avg = aggregationClient.avg(TableName.valueOf("measurements"), columnInterpreter, scan);
aggregationClient.close();
An IOException is thrown by AggregationClient, however, stating that there can't be more than one column present in the scan when invoking aggregationClient.avg(). I need the average for the column measurement:1, but I also must include instrument in the scan in order to filter rows based on the value of the instrument:status_1 column. The documentation for AggregationClient states this:
Column family can't be null. In case where multiple families are provided, an IOException will be thrown. An optional column qualifier can also be defined.
Is there any workaround or a way to tell AggregationClient what column in the scan it should take the average for?