Data processing with the CERN ROOT framework

Question 1

Problem.

I have a directory of .root files (the proprietary format of CERN's ROOT framework). These files contain so-called TTree's, essentially tables who's cells may contain virtually any c++ object or type. In my case, these files contain tables of integers, which I wish to perform some operations on to produce a single, high precision floating-point value. I'm using Boost's cpp_dec_float_100 type (from the boost::multiprecision library) due to my precision requirements. I then write these to a new .root file.

Step through my code:

Firstly, I read into a vector<string> the lines of a file in_path, which contain the paths to my .root files.

 ifstream stream(in_path);
 string tem;
 vector<string> list;
 while(getline(stream, tem))
 list.push_back(tem);

Then, I construct an RDataFrame on them. The RDataFrame, for my purposes, essentially efficiently combines them into one large table.

ROOT::RDataFrame df("eventTree", list);

Now, a bit about my data. You can imagine the table as looking something like this:

+-----+-----+-----+
| run | A | B |
+-----+-----+-----+
| 001 | 35 | 5 |
| 001 | 40 | 10 |
| 001 | 77 | 60 |
| | | |
| ... | ... | ... |
| | | |
| 002 | 42 | 40 |
| 002 | 30 | 28 |
| 002 | 50 | 1 |
| ... | ... | ... |
+-----+-----+-----+

where the ... dots indicate continuation

For each run, I want to determine the most common difference of A and B, element-wise. This value will factor into my later calculations. So, for example, for run 002 we have 42-40 = 2, 30-28=2, and 50 - 1=49, the most common element/mode of 2,2, and 49 is 2, so the result for run 002 is 2.

A couple of important notes: A-B is guaranteed to be a positive, integer value for all entries, and there is guaranteed to be a single, unique most common difference for each run

What I do is to map each run to another map, which maps difference values to frequency. In the end, I can simply get the key of the largest value (i.e. the most common difference).

I iterate over each row, take the difference of A and B, and increment the value of the corresponding key in the map.

 map<int, map<int, int>> offset;
 df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
 ++offset[run][A- B];
 },{"A","B","run"});

Returning to the previous example, run 002 would be mapped to a map that looks like:

{{2, 2}, {49, 1}}

Next, I create a new column in the RDataFrame using .Define(), which takes the name of the column (eventTime), and a lambda which returns the value for each row in the new column. It is within this lambda that I perform my calculation.

 df.Define("eventTime", [&](UInt_t timeStamp, UInt_t B, Int_t run){
 // Get most common difference for run, add to B
 B+= max_element(offset[run].begin(), offset[run].end())->first;
 return Mult_t(B+ Mult_t(timeStamp) / 1E+8);
 },{"timeStamp","B","run"}).Snapshot("T", out_path, {"eventTime"});

Finally, .Snapshot saves the new column as a new TTree in a new .root file.

Code.

Dependencies: boost::multiprecision library, and the ROOT framework.

#include <boost/multiprecision/cpp_dec_float.hpp>
using Mult_t = boost::multiprecision::cpp_dec_float_100;
void writeTimes(const char* in_path, const char* out_path){
 ifstream stream(in_path);
 string tem;
 
 vector<string> list;
 while(getline(stream, tem))
 list.push_back(tem);
 ROOT::RDataFrame df("eventTree", list);
 map<int, map<int, int>> offset;
 df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
 ++offset[run][A- B];
 },{"A","B","run"});
 df.Define("eventTime", [&](UInt_t timeStamp, UInt_t B, Int_t run){
 B+= max_element(offset[run].begin(), offset[run].end())->first;
 return Mult_t(B+ Mult_t(timeStamp) / 1E+8);
 },{"timeStamp","B","run"}).Snapshot("T", out_path, {"eventTime"});
}

Goals.

Readability: I'm always looking to improve the readability and "flow" of my code, which is often destined for use by others.
Reliability: I find that a second pair of eyes is often useful for pointing out (sometimes obvious) errors and potential bugs in my code. For this code in particular, reliability is vital.
Performance: A secondary goal is performance. I'm not terribly concerned about optimizing what I have now, however if there are some simple/obvious changes that can or should be made to improve performance, I'm certainly interested.

Note: this code is written to be run or compiled with the ROOT interpreter.

Question 2

Let ROOT read the file for you

RdataFrame's constructor can take a filename or a TFile* as an argument, and will read the whole file in for you. This avoids first reading into the temporary vector list and then parsing that list again.

Choice of containers

A std::map is easy to use but not the best choice for performance, due to \$\mathcal{O}(\log N)\$ complexity for insertions and lookups. If the file contains a monotonically increasing number in the run column, you can get the number of runs just by looking at the last row, and then use a std::vector of that size:

auto runs = /* get maximum run number from df */;
std::vector<std::map<int, int>> offset(runs + 1);
df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
 ++offset[run][A - B];
},{"A", "B", "run"});

Similarly, the remaining std::map could changed to a std::vector if you know there is an upper bound to the difference and if the set of differences is dense.

If either of the two maps cannot be converted to a std::vector, then at least replace them with std::unordered_map, as this will have \$\mathcal{O}(1)\$ insertions and lookups.

Too early to convert to `Mult_t`?

I think you can store the event times as an ULong64_t without losing any precision, just replace the last return statement to:

return B * 1'00'000'000ull + timeStamp;

Of course, you would need to do the conversion to seconds (I assume) somewhere else then. But this speeds up the function and saves some space in the output.

Missing error checking

You are not performing any error checking when opening the file and reading lines from it. You should at least check that the whole file was read in by checking that stream.eof() == true after the while-loop. If not, then print an error message and throw an exception or call exit(EXIT_FAILURE), so you don't write out incorrect results without anyone noticing.

If you use ROOT::RDataFrame's constructor to read the file, then this will hopefully be taken care of already.

Is `max_element` good enough?

It almost looks like you want to get the median value, but your approach with max_element() will not give you that. In fact, it might not even be close to the median. Consider this histogram:

1: 2
2: 1
3: 1
4: 1
...
10: 1

Here the number 1 has two occurences, the numbers 2 to 10 only one occurrence. The median is 5, but the maximum element is 1. You need to sort the histogram and find the element such that the sum of occurences before that element is equal to that after (and also handle the case of an even number of events correctly).

Instead of building a histogram, you can just store all the differences A - B in a vector, then sort it, and then just pick the middle element (or average the two elements closest to the middle in case of an even number of elements). See this StackOverflow question for how to do that.

Of course there are more sophisticated methods, like calculating the root mean square, possibly filtering out outliers first.

Reduce the number of responsibilities

Your function is responsible for opening the file, parsing the input, adding a column, and writing the result out. Consider reducing it to a function that just adds the eventTime column:

auto& addEventTimes(ROOT::RDataFrame& df) {
 map<int, map<int, int>> offset;
 df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
 ++offset[run][A - B];
 }, {"A", "B", "run"});
 return df.Define("eventTime", [&](UInt_t timeStamp, UInt_t B, Int_t run){
 B += max_element(offset[run].begin(), offset[run].end())->first;
 return Mult_t(B + Mult_t(timeStamp) / 1E+8);
 },{"timeStamp","B","run"});
}

You can then call this from the original function:

void writeTimes(const char* in_path, const char* out_path) {
 ROOT::RDataFrame df("eventTree", in_path);
 addEventTimes(df).Snapshot("T", out_path, {"eventTime"});
}

About readability

Apart from splitting up the function as mentioned above, I would also choose some better variable names:

tem -> line
df -> eventData
offset -> offsetHistograms

This makes it more clear what they contain. Also add some comments to those parts that are not immediately clear when reading them, like the df.Foreach() and df.Define() calls.

Question 3

Thanks for taking the time! It's always useful to get an outside perspective with respect to readability, and your suggestion to use unordered_map improved performance more than I would've expected. A couple clarifications-- I'm looking for the mode, not the median. Also, I pass a vector of strings to RDataFrame as I wish to merge multiple TTrees from multiple TFiles living in multiple .root files. I could only open a single file by passing a path or TFile*. Thanks again!

Question 4

If it's the mode, great! Although you might still have to consider how to handle multiple values having the same number of occurences. Having a comment in the code saying you want the mode would have removed any doubt :) As for merging TTrees, perhaps TTree::MergeTrees() can be used?

G. Sliepen G. Sliepen 69.1k3 gold badges74 silver badges180 bronze badges · Accepted Answer · 2021-09-27 21:25:35Z

Let ROOT read the file for you

RdataFrame's constructor can take a filename or a TFile* as an argument, and will read the whole file in for you. This avoids first reading into the temporary vector list and then parsing that list again.

Choice of containers

A std::map is easy to use but not the best choice for performance, due to \$\mathcal{O}(\log N)\$ complexity for insertions and lookups. If the file contains a monotonically increasing number in the run column, you can get the number of runs just by looking at the last row, and then use a std::vector of that size:

auto runs = /* get maximum run number from df */;
std::vector<std::map<int, int>> offset(runs + 1);
df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
 ++offset[run][A - B];
},{"A", "B", "run"});

Similarly, the remaining std::map could changed to a std::vector if you know there is an upper bound to the difference and if the set of differences is dense.

If either of the two maps cannot be converted to a std::vector, then at least replace them with std::unordered_map, as this will have \$\mathcal{O}(1)\$ insertions and lookups.

Too early to convert to `Mult_t`?

I think you can store the event times as an ULong64_t without losing any precision, just replace the last return statement to:

return B * 1'00'000'000ull + timeStamp;

Of course, you would need to do the conversion to seconds (I assume) somewhere else then. But this speeds up the function and saves some space in the output.

Missing error checking

You are not performing any error checking when opening the file and reading lines from it. You should at least check that the whole file was read in by checking that stream.eof() == true after the while-loop. If not, then print an error message and throw an exception or call exit(EXIT_FAILURE), so you don't write out incorrect results without anyone noticing.

If you use ROOT::RDataFrame's constructor to read the file, then this will hopefully be taken care of already.

Is `max_element` good enough?

It almost looks like you want to get the median value, but your approach with max_element() will not give you that. In fact, it might not even be close to the median. Consider this histogram:

1: 2
2: 1
3: 1
4: 1
...
10: 1

Here the number 1 has two occurences, the numbers 2 to 10 only one occurrence. The median is 5, but the maximum element is 1. You need to sort the histogram and find the element such that the sum of occurences before that element is equal to that after (and also handle the case of an even number of events correctly).

Instead of building a histogram, you can just store all the differences A - B in a vector, then sort it, and then just pick the middle element (or average the two elements closest to the middle in case of an even number of elements). See this StackOverflow question for how to do that.

Of course there are more sophisticated methods, like calculating the root mean square, possibly filtering out outliers first.

Reduce the number of responsibilities

Your function is responsible for opening the file, parsing the input, adding a column, and writing the result out. Consider reducing it to a function that just adds the eventTime column:

auto& addEventTimes(ROOT::RDataFrame& df) {
 map<int, map<int, int>> offset;
 df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
 ++offset[run][A - B];
 }, {"A", "B", "run"});
 return df.Define("eventTime", [&](UInt_t timeStamp, UInt_t B, Int_t run){
 B += max_element(offset[run].begin(), offset[run].end())->first;
 return Mult_t(B + Mult_t(timeStamp) / 1E+8);
 },{"timeStamp","B","run"});
}

You can then call this from the original function:

void writeTimes(const char* in_path, const char* out_path) {
 ROOT::RDataFrame df("eventTree", in_path);
 addEventTimes(df).Snapshot("T", out_path, {"eventTime"});
}

About readability

Apart from splitting up the function as mentioned above, I would also choose some better variable names:

tem -> line
df -> eventData
offset -> offsetHistograms

This makes it more clear what they contain. Also add some comments to those parts that are not immediately clear when reading them, like the df.Foreach() and df.Define() calls.

Thanks for taking the time! It's always useful to get an outside perspective with respect to readability, and your suggestion to use unordered_map improved performance more than I would've expected. A couple clarifications-- I'm looking for the mode, not the median. Also, I pass a vector of strings to RDataFrame as I wish to merge multiple TTrees from multiple TFiles living in multiple .root files. I could only open a single file by passing a path or TFile*. Thanks again!
If it's the mode, great! Although you might still have to consider how to handle multiple values having the same number of occurences. Having a comment in the code saying you want the mode would have removed any doubt :) As for merging TTrees, perhaps TTree::MergeTrees() can be used?

Stack Exchange Network

Data processing with the CERN ROOT framework

1 Answer 1

Let ROOT read the file for you

Choice of containers

Too early to convert to `Mult_t`?

Missing error checking

Is `max_element` good enough?

Reduce the number of responsibilities

About readability

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Data processing with the CERN ROOT framework

1 Answer 1

Let ROOT read the file for you

Choice of containers

Too early to convert to Mult_t?

Missing error checking

Is max_element good enough?

Reduce the number of responsibilities

About readability

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Too early to convert to `Mult_t`?

Is `max_element` good enough?