Problem.
I have a directory of .root
files (the proprietary format of CERN's ROOT
framework). These files contain so-called TTree
's, essentially tables who's cells may contain virtually any c++
object or type. In my case, these files contain tables of integers, which I wish to perform some operations on to produce a single, high precision floating-point value. I'm using Boost
's cpp_dec_float_100
type (from the boost::multiprecision
library) due to my precision requirements. I then write these to a new .root
file.
Step through my code:
Firstly, I read into a vector<string>
the lines of a file in_path
, which contain the paths to my .root
files.
ifstream stream(in_path);
string tem;
vector<string> list;
while(getline(stream, tem))
list.push_back(tem);
Then, I construct an RDataFrame
on them. The RDataFrame
, for my purposes, essentially efficiently combines them into one large table.
ROOT::RDataFrame df("eventTree", list);
Now, a bit about my data. You can imagine the table as looking something like this:
+-----+-----+-----+
| run | A | B |
+-----+-----+-----+
| 001 | 35 | 5 |
| 001 | 40 | 10 |
| 001 | 77 | 60 |
| | | |
| ... | ... | ... |
| | | |
| 002 | 42 | 40 |
| 002 | 30 | 28 |
| 002 | 50 | 1 |
| ... | ... | ... |
+-----+-----+-----+
where the ... dots indicate continuation
For each run
, I want to determine the most common difference of A
and B
, element-wise. This value will factor into my later calculations. So, for example, for run 002 we have 42-40 = 2
, 30-28=2
, and 50 - 1=49
, the most common element/mode of 2,2, and 49
is 2
, so the result for run 002
is 2
.
A couple of important notes: A-B
is guaranteed to be a positive, integer value for all entries, and there is guaranteed to be a single, unique most common difference for each run
What I do is to map each run
to another map
, which maps difference values to frequency. In the end, I can simply get the key of the largest value (i.e. the most common difference).
I iterate over each row, take the difference of A
and B
, and increment the value of the corresponding key in the map.
map<int, map<int, int>> offset;
df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
++offset[run][A- B];
},{"A","B","run"});
Returning to the previous example, run 002
would be mapped to a map that looks like:
{{2, 2}, {49, 1}}
Next, I create a new column in the RDataFrame
using .Define()
, which takes the name of the column (eventTime
), and a lambda
which returns the value for each row in the new column. It is within this lambda
that I perform my calculation.
df.Define("eventTime", [&](UInt_t timeStamp, UInt_t B, Int_t run){
// Get most common difference for run, add to B
B+= max_element(offset[run].begin(), offset[run].end())->first;
return Mult_t(B+ Mult_t(timeStamp) / 1E+8);
},{"timeStamp","B","run"}).Snapshot("T", out_path, {"eventTime"});
Finally, .Snapshot
saves the new column as a new TTree
in a new .root
file.
Code.
Dependencies: boost::multiprecision
library, and the ROOT
framework.
#include <boost/multiprecision/cpp_dec_float.hpp>
using Mult_t = boost::multiprecision::cpp_dec_float_100;
void writeTimes(const char* in_path, const char* out_path){
ifstream stream(in_path);
string tem;
vector<string> list;
while(getline(stream, tem))
list.push_back(tem);
ROOT::RDataFrame df("eventTree", list);
map<int, map<int, int>> offset;
df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
++offset[run][A- B];
},{"A","B","run"});
df.Define("eventTime", [&](UInt_t timeStamp, UInt_t B, Int_t run){
B+= max_element(offset[run].begin(), offset[run].end())->first;
return Mult_t(B+ Mult_t(timeStamp) / 1E+8);
},{"timeStamp","B","run"}).Snapshot("T", out_path, {"eventTime"});
}
Goals.
- Readability: I'm always looking to improve the readability and "flow" of my code, which is often destined for use by others.
- Reliability: I find that a second pair of eyes is often useful for pointing out (sometimes obvious) errors and potential bugs in my code. For this code in particular, reliability is vital.
- Performance: A secondary goal is performance. I'm not terribly concerned about optimizing what I have now, however if there are some simple/obvious changes that can or should be made to improve performance, I'm certainly interested.
Note: this code is written to be run or compiled with the
ROOT
interpreter.
1 Answer 1
Let ROOT read the file for you
RdataFrame
's constructor can take a filename or a TFile*
as an argument, and will read the whole file in for you. This avoids first reading into the temporary vector list
and then parsing that list again.
Choice of containers
A std::map
is easy to use but not the best choice for performance, due to \$\mathcal{O}(\log N)\$ complexity for insertions and lookups. If the file contains a monotonically increasing number in the run column, you can get the number of runs just by looking at the last row, and then use a std::vector
of that size:
auto runs = /* get maximum run number from df */;
std::vector<std::map<int, int>> offset(runs + 1);
df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
++offset[run][A - B];
},{"A", "B", "run"});
Similarly, the remaining std::map
could changed to a std::vector
if you know there is an upper bound to the difference and if the set of differences is dense.
If either of the two maps cannot be converted to a std::vector
, then at least replace them with std::unordered_map
, as this will have \$\mathcal{O}(1)\$ insertions and lookups.
Too early to convert to Mult_t
?
I think you can store the event times as an ULong64_t
without losing any precision, just replace the last return
statement to:
return B * 1'00'000'000ull + timeStamp;
Of course, you would need to do the conversion to seconds (I assume) somewhere else then. But this speeds up the function and saves some space in the output.
Missing error checking
You are not performing any error checking when opening the file and reading lines from it. You should at least check that the whole file was read in by checking that stream.eof() == true
after the while
-loop. If not, then print an error message and throw
an exception or call exit(EXIT_FAILURE)
, so you don't write out incorrect results without anyone noticing.
If you use ROOT::RDataFrame
's constructor to read the file, then this will hopefully be taken care of already.
Is max_element
good enough?
It almost looks like you want to get the median value, but your approach with max_element()
will not give you that. In fact, it might not even be close to the median. Consider this histogram:
1: 2
2: 1
3: 1
4: 1
...
10: 1
Here the number 1 has two occurences, the numbers 2 to 10 only one occurrence. The median is 5, but the maximum element is 1. You need to sort the histogram and find the element such that the sum of occurences before that element is equal to that after (and also handle the case of an even number of events correctly).
Instead of building a histogram, you can just store all the differences A - B
in a vector, then sort it, and then just pick the middle element (or average the two elements closest to the middle in case of an even number of elements). See this StackOverflow question for how to do that.
Of course there are more sophisticated methods, like calculating the root mean square, possibly filtering out outliers first.
Reduce the number of responsibilities
Your function is responsible for opening the file, parsing the input, adding a column, and writing the result out. Consider reducing it to a function that just adds the eventTime column:
auto& addEventTimes(ROOT::RDataFrame& df) {
map<int, map<int, int>> offset;
df.Foreach([&](ULong64_t A, UInt_t B, Int_t run){
++offset[run][A - B];
}, {"A", "B", "run"});
return df.Define("eventTime", [&](UInt_t timeStamp, UInt_t B, Int_t run){
B += max_element(offset[run].begin(), offset[run].end())->first;
return Mult_t(B + Mult_t(timeStamp) / 1E+8);
},{"timeStamp","B","run"});
}
You can then call this from the original function:
void writeTimes(const char* in_path, const char* out_path) {
ROOT::RDataFrame df("eventTree", in_path);
addEventTimes(df).Snapshot("T", out_path, {"eventTime"});
}
About readability
Apart from splitting up the function as mentioned above, I would also choose some better variable names:
tem
->line
df
->eventData
offset
->offsetHistograms
This makes it more clear what they contain. Also add some comments to those parts that are not immediately clear when reading them, like the df.Foreach()
and df.Define()
calls.
-
\$\begingroup\$ Thanks for taking the time! It's always useful to get an outside perspective with respect to readability, and your suggestion to use
unordered_map
improved performance more than I would've expected. A couple clarifications-- I'm looking for the mode, not the median. Also, I pass a vector of strings toRDataFrame
as I wish to merge multipleTTrees
from multipleTFiles
living in multiple.root
files. I could only open a single file by passing a path orTFile*
. Thanks again! \$\endgroup\$10GeV– 10GeV2021年09月27日 22:59:31 +00:00Commented Sep 27, 2021 at 22:59 -
\$\begingroup\$ If it's the mode, great! Although you might still have to consider how to handle multiple values having the same number of occurences. Having a comment in the code saying you want the mode would have removed any doubt :) As for merging
TTree
s, perhapsTTree::MergeTrees()
can be used? \$\endgroup\$G. Sliepen– G. Sliepen2021年09月28日 06:10:03 +00:00Commented Sep 28, 2021 at 6:10