I am trying to adapt an R script that I wrote to perform several tasks in a for-loop. The for-loop run over thousands of items, for about 900 times, therefore I would like to optimise my code at the best. I decided to go for c++ via Rcpp. Here it is my code:
cppFunction('
Rcpp::DataFrame FasterFunction(StringVector path_data, IntegerVector ID_vector, const DataFrame& DF1,
DataFrame DF2, int time1, int time2, int start_time) {
Function read_t("fread");
Function subset("[.data.frame");
Function mergeDFs("merge");
const int& TimeVal = (time1 + time2 - (2 * start_time));
for(int i = 0; i < ID_vector.size(); ++i) {
int ID = ID_vector[i];
//Define path to file
StringVector tmp_path_data = path_data;
tmp_path_data.insert(1, "/prefix_");
tmp_path_data.insert(2, std::to_string(ID));
tmp_path_data.insert(3, ".txt");
Rcpp::String res = collapse(tmp_path_data);
//Read file as DataFrame
Rcpp::DataFrame data_df = read_t(Named("file")=res);
//Extract column "name1" from DataFrame "data_df"
Rcpp::NumericVector name1_list = as<NumericVector>(data_df["name1"]);
//Convert int to NumericVector
NumericVector ID_vec(1);
ID_vec[0] = static_cast<double>(ID);
//Finding "ID_vec" matching value in "name1_list" vector
IntegerVector ID_pos = match(ID_vec,name1_list);
//Remove row from "data_df" with "ID" in column "name1"
data_df = subset(data_df, -ID_pos,R_MissingArg);
//Extract column "name1" from DataFrame "data_df", after removing "ID"
name1_list = as<NumericVector>(data_df["name1"]);
//Extract column "name1" from DataFrame "DF2"
Rcpp::NumericVector ID2_list = as<NumericVector>(DF2["name1"]);
//Finding "name1_list" matching value in "ID2_list" vector
IntegerVector ID2_pos = match(name1_list,ID2_list);
//Remove rows from "DF2" with "ID2_pos" in column "name1"
Rcpp::DataFrame tmpDF = subset(DF2, ID2_pos,R_MissingArg);
//Merge "data_df" and "DF2" by column "name" "ID2_pos" in column "name1"
const DataFrame& resDF = mergeDFs(Named("x")=data_df,Named("y")=DF2,Named("by")="name1");
//Add additional columns to the "resDF" dataframe
resDF["term1"] = abs(as<NumericVector>(resDF["factor1"]) - as<NumericVector>(resDF["factor2"]));
resDF["term2"] = Rcpp::pow((1 - as<NumericVector>(resDF["term1"])),TimeVal);
//Add values to row "i" and column "term_tot" of the dataframe "DF1"
Rcpp::NumericVector TermTot_Vector = as<NumericVector>(DF1["term_tot"]);
TermTot_Vector[i] = sum(as<NumericVector>(resDF["term2"]));
DF1["term_tot"] = TermTot_Vector;
};
return(DF1);
};
')
As an example to what ID_vector
, DF1
, DF2
and data_df
correspond to, consider the following:
ID_vector = c(1,3,5,9,13,14,22,39)
DF2 = data.frame(name1 = c(1,3,5,9,13,14,22,39), factor2 = rnorm(8,0,2))
DF1 = data.frame(name1 = c(1,3,5,9,13,14,22,39), term_tot = rep(NA,8))
data_df = data.frame(nameA = rep(3,8),name1 = c(1,3,5,9,13,14,22,39), factor1 = rnorm(8,0,1))
Note that I did not provide the path_data argument for reproducibility, since it woudl require the long list of files over which the for-loop goes. However, the structure of the data within the files is given by the data_df
dataframe. There, I would be mostly interested in knowing wether I am reading the file in the more efficient way.
-
\$\begingroup\$ Please do not edit the question, especially the code, after an answer has been posted. Changing the question may cause answer invalidation. Everyone needs to be able to see what the reviewer was referring to. What to do after the question has been answered. \$\endgroup\$pacmaninbw– pacmaninbw ♦2024年02月15日 13:53:22 +00:00Commented Feb 15, 2024 at 13:53
-
\$\begingroup\$ Unlike Stack Overflow we want to see the actual code rather than the minimal reproducible code. Code Review can handle up to 64K characters in the question. \$\endgroup\$pacmaninbw– pacmaninbw ♦2024年02月15日 13:55:53 +00:00Commented Feb 15, 2024 at 13:55
-
\$\begingroup\$ Ok, thanks for the heads up. Shall I then edit my question with the full code? (just referring to your other comment). \$\endgroup\$CafféSospeso– CafféSospeso2024年02月15日 17:29:47 +00:00Commented Feb 15, 2024 at 17:29
1 Answer 1
Inconsistent use of Rcpp::
You are not using Rcpp::
consistently. Either use it everywhere or don't use it at all.
What does everything mean?
FasterFunction()
is a very bad name for a function. The word Function
is redundant, and Faster
is also not very informative: faster than what? Faster how?
Many variable names are also not helpful at all. What are DF1
and DF2
? I already know from the type that they are dataframes, so DF
is not adding any more information. And there are two, but what is the difference between them?
Even the column names in the dataframes are very generic. factor1
, term1
? Reading this code doesn't give me any clue about what is in these dataframes, nor what it is that you are trying to calculate.
Unless you intentionally obfuscated the code when posting it here on Code Review (which you shouldn't anyway), please spend some time giving all the functions, variables and dataframe columns meaningful names. It's not just me that can't understand the code you have written, your colleagues might also have trouble, and even you yourself, after one year, will have forgotten half of what you wrote here, so it's in your own interest to do this.
Some rules of thumb:
- Use nouns for variables, verbs for functions. For example,
subset
is a bad name for a function.make_subset
would be better, but how exactly does it decide what subset to make? The comment above its use says it removes rows. Maybe that should be reflected in the name. - Don't put the type of something in its name. Don't append things like
_vector
,_list
, and_df
to the names of variables. The type of the variable will already clarify what it is. For variables that hold multiple values, make their name a plural, so for exampleIDs
instead ofID_vec
. - Be consistent. I see
_vec
,_Vector
, and_list
being used as suffixes for things that areNumericVector
s. If you are going to use some suffix, make sure you pick one and use it consistently. Also pick one style of capitalization for variable names, so writeterm_tot
instead ofTermTot
. - Don't abbreviate unnecessarily. What does
pos
mean? Position? Positive? Posit? Prefer to write words in full. If you do abbreviate, make sure you use the same abbreviation consistently, and that it is not creating any ambiguities. - Avoid generic and meaningless words.
tmp
? Sure it is a temporary variable, but what does it hold?
Remove as much as you can out of the loop
Find whatever you can that is done within the loop that you can do outside of it, as that will prevent the same thing from being done multiple times unnecessarily. For example:
tmp_path_data
: you can already build and collapse everything up to and including"/prefix_"
, and only add the ID and.txt
inside the loop.TermTot_Vector
can be declared before the loop, andDF1["term_tot"] = TermToT_Vector
can be done after the loop.
There might be many more things that can be optimized this way.
Remove dead code
You calculate tmpDF
but then you never use it.
-
\$\begingroup\$ Ok, thanks for the heads up. I did not post my original script in order to have a somehow reproducible example. I agree with the points you raised. I still wonder whether I could perform some of these tasks more efficiently. I already moved lines that can be run outside of the loop. \$\endgroup\$CafféSospeso– CafféSospeso2024年02月14日 22:03:24 +00:00Commented Feb 14, 2024 at 22:03
-
1\$\begingroup\$ We can only do a proper review if we see your actual code. \$\endgroup\$G. Sliepen– G. Sliepen2024年02月15日 07:27:04 +00:00Commented Feb 15, 2024 at 7:27