Performing tasks on for-loop over thousands of items in Rcpp

Question 1

I am trying to adapt an R script that I wrote to perform several tasks in a for-loop. The for-loop run over thousands of items, for about 900 times, therefore I would like to optimise my code at the best. I decided to go for c++ via Rcpp. Here it is my code:

cppFunction('
Rcpp::DataFrame FasterFunction(StringVector path_data, IntegerVector ID_vector, const DataFrame& DF1, 
DataFrame DF2, int time1, int time2, int start_time) {
 Function read_t("fread");
 Function subset("[.data.frame");
 Function mergeDFs("merge");
 
 const int& TimeVal = (time1 + time2 - (2 * start_time));
 
 for(int i = 0; i < ID_vector.size(); ++i) {
 
 int ID = ID_vector[i];
 
 //Define path to file
 StringVector tmp_path_data = path_data;
 tmp_path_data.insert(1, "/prefix_");
 tmp_path_data.insert(2, std::to_string(ID));
 tmp_path_data.insert(3, ".txt");
 Rcpp::String res = collapse(tmp_path_data);
 
 //Read file as DataFrame
 Rcpp::DataFrame data_df = read_t(Named("file")=res);
 
 //Extract column "name1" from DataFrame "data_df"
 Rcpp::NumericVector name1_list = as<NumericVector>(data_df["name1"]);
 
 //Convert int to NumericVector
 NumericVector ID_vec(1);
 ID_vec[0] = static_cast<double>(ID);
 
 //Finding "ID_vec" matching value in "name1_list" vector
 IntegerVector ID_pos = match(ID_vec,name1_list);
 
 //Remove row from "data_df" with "ID" in column "name1"
 data_df = subset(data_df, -ID_pos,R_MissingArg);
 
 //Extract column "name1" from DataFrame "data_df", after removing "ID"
 name1_list = as<NumericVector>(data_df["name1"]);
 
 //Extract column "name1" from DataFrame "DF2"
 Rcpp::NumericVector ID2_list = as<NumericVector>(DF2["name1"]);
 
 //Finding "name1_list" matching value in "ID2_list" vector
 IntegerVector ID2_pos = match(name1_list,ID2_list);
 
 //Remove rows from "DF2" with "ID2_pos" in column "name1"
 Rcpp::DataFrame tmpDF = subset(DF2, ID2_pos,R_MissingArg);
 
 //Merge "data_df" and "DF2" by column "name" "ID2_pos" in column "name1"
 const DataFrame& resDF = mergeDFs(Named("x")=data_df,Named("y")=DF2,Named("by")="name1");
 
 //Add additional columns to the "resDF" dataframe
 resDF["term1"] = abs(as<NumericVector>(resDF["factor1"]) - as<NumericVector>(resDF["factor2"]));
 resDF["term2"] = Rcpp::pow((1 - as<NumericVector>(resDF["term1"])),TimeVal);
 
 //Add values to row "i" and column "term_tot" of the dataframe "DF1"
 Rcpp::NumericVector TermTot_Vector = as<NumericVector>(DF1["term_tot"]);
 TermTot_Vector[i] = sum(as<NumericVector>(resDF["term2"]));
 DF1["term_tot"] = TermTot_Vector;
 
 };
 
 return(DF1);
};
')

As an example to what ID_vector, DF1, DF2 and data_df correspond to, consider the following:

ID_vector = c(1,3,5,9,13,14,22,39)
DF2 = data.frame(name1 = c(1,3,5,9,13,14,22,39), factor2 = rnorm(8,0,2))
DF1 = data.frame(name1 = c(1,3,5,9,13,14,22,39), term_tot = rep(NA,8))
data_df = data.frame(nameA = rep(3,8),name1 = c(1,3,5,9,13,14,22,39), factor1 = rnorm(8,0,1))

Note that I did not provide the path_data argument for reproducibility, since it woudl require the long list of files over which the for-loop goes. However, the structure of the data within the files is given by the data_df dataframe. There, I would be mostly interested in knowing wether I am reading the file in the more efficient way.

Question 2

Please do not edit the question, especially the code, after an answer has been posted. Changing the question may cause answer invalidation. Everyone needs to be able to see what the reviewer was referring to. What to do after the question has been answered.

Question 3

Unlike Stack Overflow we want to see the actual code rather than the minimal reproducible code. Code Review can handle up to 64K characters in the question.

Question 4

Ok, thanks for the heads up. Shall I then edit my question with the full code? (just referring to your other comment).

Question 5

Inconsistent use of `Rcpp::`

You are not using Rcpp:: consistently. Either use it everywhere or don't use it at all.

What does everything mean?

FasterFunction() is a very bad name for a function. The word Function is redundant, and Faster is also not very informative: faster than what? Faster how?

Many variable names are also not helpful at all. What are DF1 and DF2? I already know from the type that they are dataframes, so DF is not adding any more information. And there are two, but what is the difference between them?

Even the column names in the dataframes are very generic. factor1, term1? Reading this code doesn't give me any clue about what is in these dataframes, nor what it is that you are trying to calculate.

Unless you intentionally obfuscated the code when posting it here on Code Review (which you shouldn't anyway), please spend some time giving all the functions, variables and dataframe columns meaningful names. It's not just me that can't understand the code you have written, your colleagues might also have trouble, and even you yourself, after one year, will have forgotten half of what you wrote here, so it's in your own interest to do this.

Some rules of thumb:

Use nouns for variables, verbs for functions. For example, subset is a bad name for a function. make_subset would be better, but how exactly does it decide what subset to make? The comment above its use says it removes rows. Maybe that should be reflected in the name.
Don't put the type of something in its name. Don't append things like _vector, _list, and _df to the names of variables. The type of the variable will already clarify what it is. For variables that hold multiple values, make their name a plural, so for example IDs instead of ID_vec.
Be consistent. I see _vec, _Vector, and _list being used as suffixes for things that are NumericVectors. If you are going to use some suffix, make sure you pick one and use it consistently. Also pick one style of capitalization for variable names, so write term_tot instead of TermTot.
Don't abbreviate unnecessarily. What does pos mean? Position? Positive? Posit? Prefer to write words in full. If you do abbreviate, make sure you use the same abbreviation consistently, and that it is not creating any ambiguities.
Avoid generic and meaningless words. tmp? Sure it is a temporary variable, but what does it hold?

Remove as much as you can out of the loop

Find whatever you can that is done within the loop that you can do outside of it, as that will prevent the same thing from being done multiple times unnecessarily. For example:

tmp_path_data: you can already build and collapse everything up to and including "/prefix_", and only add the ID and .txt inside the loop.
TermTot_Vector can be declared before the loop, and DF1["term_tot"] = TermToT_Vector can be done after the loop.

There might be many more things that can be optimized this way.

Remove dead code

You calculate tmpDF but then you never use it.

Question 6

Ok, thanks for the heads up. I did not post my original script in order to have a somehow reproducible example. I agree with the points you raised. I still wonder whether I could perform some of these tasks more efficiently. I already moved lines that can be run outside of the loop.

Question 7

We can only do a proper review if we see your actual code.

G. Sliepen G. Sliepen 69k3 gold badges74 silver badges180 bronze badges · Accepted Answer · 2024-02-14 11:14:49Z

Inconsistent use of `Rcpp::`

You are not using Rcpp:: consistently. Either use it everywhere or don't use it at all.

What does everything mean?

FasterFunction() is a very bad name for a function. The word Function is redundant, and Faster is also not very informative: faster than what? Faster how?

Many variable names are also not helpful at all. What are DF1 and DF2? I already know from the type that they are dataframes, so DF is not adding any more information. And there are two, but what is the difference between them?

Even the column names in the dataframes are very generic. factor1, term1? Reading this code doesn't give me any clue about what is in these dataframes, nor what it is that you are trying to calculate.

Unless you intentionally obfuscated the code when posting it here on Code Review (which you shouldn't anyway), please spend some time giving all the functions, variables and dataframe columns meaningful names. It's not just me that can't understand the code you have written, your colleagues might also have trouble, and even you yourself, after one year, will have forgotten half of what you wrote here, so it's in your own interest to do this.

Some rules of thumb:

Use nouns for variables, verbs for functions. For example, subset is a bad name for a function. make_subset would be better, but how exactly does it decide what subset to make? The comment above its use says it removes rows. Maybe that should be reflected in the name.
Don't put the type of something in its name. Don't append things like _vector, _list, and _df to the names of variables. The type of the variable will already clarify what it is. For variables that hold multiple values, make their name a plural, so for example IDs instead of ID_vec.
Be consistent. I see _vec, _Vector, and _list being used as suffixes for things that are NumericVectors. If you are going to use some suffix, make sure you pick one and use it consistently. Also pick one style of capitalization for variable names, so write term_tot instead of TermTot.
Don't abbreviate unnecessarily. What does pos mean? Position? Positive? Posit? Prefer to write words in full. If you do abbreviate, make sure you use the same abbreviation consistently, and that it is not creating any ambiguities.
Avoid generic and meaningless words. tmp? Sure it is a temporary variable, but what does it hold?

Remove as much as you can out of the loop

Find whatever you can that is done within the loop that you can do outside of it, as that will prevent the same thing from being done multiple times unnecessarily. For example:

tmp_path_data: you can already build and collapse everything up to and including "/prefix_", and only add the ID and .txt inside the loop.
TermTot_Vector can be declared before the loop, and DF1["term_tot"] = TermToT_Vector can be done after the loop.

There might be many more things that can be optimized this way.

Remove dead code

You calculate tmpDF but then you never use it.

Ok, thanks for the heads up. I did not post my original script in order to have a somehow reproducible example. I agree with the points you raised. I still wonder whether I could perform some of these tasks more efficiently. I already moved lines that can be run outside of the loop.

Stack Exchange Network

Performing tasks on for-loop over thousands of items in Rcpp

1 Answer 1

Inconsistent use of `Rcpp::`

What does everything mean?

Remove as much as you can out of the loop

Remove dead code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Performing tasks on for-loop over thousands of items in Rcpp

1 Answer 1

Inconsistent use of Rcpp::

What does everything mean?

Remove as much as you can out of the loop

Remove dead code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Inconsistent use of `Rcpp::`