Problem
The main idea is to build some fine scale weather data using inverse distance weighting (idw) from the 5 closest stations of the observed station. If the value
in the newdat
data set is greater than 134, less than -80, equal to -9999, or NA
then run the function to get an interpolated value. If greater than 2 stations exist use idw, if only 1 station exists use the closest station, or if no close stations exist use the mean for the month of the year.
I'm trying to pass some information to a function I've created, have it run through some subsetting and calculations, then return the value to the point in the data frame that conditions were met. It's quite a hack job, but I've got it working using a for
loop; however, it takes a really long time on the regular data set .
The idea is to create a function nearest()
that accepts 5 weather stations, stacks them, finds inverse distance weight, and then returns a value to the original data frame.
Is there a better way to do this? I'd like a more R, or efficient way to do it hopefully with data.table
or dplyr
since they are optimized so well. I think this can also be done with one of the apply
functions, but I don't know how to code this.
About the data
In the original data set there are 7 data sets; 1 station as the center station which is called station
in this example; and 5 closest stations (s1
-s5
). Each of the 5 stations have around 20-30k obs and the primary station newdat
will have 18,627 -- one for each day between 1900 and 1950. This isn't so bad as I have over 600 stations for this computation and each one takes around 15 minutes (15 minutes per station x 600 stations = 6.25 days). The next round will have over 12,000 stations, so this is just a warm up and I think I should be able to reuse the code.
In this example, I have scraped all the local stations (s1
-s5
) down to just one subset of the data The nearest station (station
) has also been subset out to a single observation. The main data set newdat
has also been reduced down to just 10 observations with an NA
value and a value over 200. The goal is to replace both of those values using the function nearest()
.
The flow of the data looks like this:
Load individual station id (
newdat
-lists all dates, and 5 closest stations (s1-s5
)For an individual station check the
value
variable for each day for conditions described above (e.g. outlier, NA, or -9999)- When condition is met, grab the location in the data frame of element, year, month, and day from
newdat
. - Subset
s1-s5
based on those criteria (this will leave 1 or 0 obs for eachs1-s5
- Stack
s1-s5
and use either idw if greater than 2 obs exist, use single station if only 1 obs exists, or mean of the month and year if no obs exist - If using idw, it is necessary to use the lat/long for
station
and (s1-s5
). - Use this value to plug in to
newdat
where thevalue
was found to be an outlier, NA, or -9999. - Move on to next day
dput (need all):
newdat <- structure(list(id = c("USC00031632", "USC00031632", "USC00031632",
"USC00031632", "USC00031632", "USC00031632", "USC00031632", "USC00031632",
"USC00031632", "USC00031632"), element = structure(c(1L, 2L,
1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("TMAX", "TMIN"), class = "factor"),
year = c("1900", "1900", "1900", "1900", "1900", "1900",
"1900", "1900", "1900", "1900"), month = c("01", "01", "02",
"02", "03", "04", "04", "05", "05", "01"), day = c("01",
"01", "01", "01", "01", "01", "01", "01", "01", "02"), date = structure(c(-25567,
-25567, -25536, -25536, -25508, -25477, -25477, -25447, -25447,
-25566), class = "Date"), value = c(30.02, 28, 37.94, 10.94,
200, 25, 41, 82.04, 51.08, NA)), .Names = c("id", "element",
"year", "month", "day", "date", "value"), row.names = c(NA, 10L
), class = "data.frame")
s1 <- structure(list(id = "USC00031152", element = "TMAX", year = 1900,
month = 1, day = 2, date = structure(-25566, class = "Date"),
value = 37.04, y = 33.59, x = -92.8236), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id",
"element", "year", "month", "day", "date", "value", "y", "x"))
s2 <- structure(list(id = "USC00034638", element = "TMAX", year = 1900,
month = 1, day = 2, date = structure(-25566, class = "Date"),
value = 62.06, y = 34.7392, x = -90.7664), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id",
"element", "year", "month", "day", "date", "value", "y", "x"))
s3 <- structure(list(id = "USC00036352", element = "TMAX", year = 1900,
month = 1, day = 2, date = structure(-25566, class = "Date"),
value = 30.92, y = 35.2833, x = -93.1), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id",
"element", "year", "month", "day", "date", "value", "y", "x"))
s4 <- structure(list(id = "USC00036248", element = "TMAX", year = 1900,
month = 1, day = 2, date = structure(-25566, class = "Date"),
value = 30.02, y = 36.3667, x = -94.1), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id",
"element", "year", "month", "day", "date", "value", "y", "x"))
s5 <- structure(list(id = "USC00035112", element = "TMAX", year = 1900,
month = 1, day = 2, date = structure(-25566, class = "Date"),
value = -9999, y = 33.9294, x = -93.8583), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id",
"element", "year", "month", "day", "date", "value", "y", "x"))
station <- structure(list(id = "USC00030528", lat = 35.45, long = -92.4), class = "data.frame", row.names = c(NA,
-1L), .Names = c("id", "lat", "long"))
Function:
nearest <- function(id, yr, mnt, dy, ele, s1=s1, s2=s2, s3=s3, s4=s4, s5=s5, out=output, station=station){
s1_d <- filter(s1, year == yr & month == mnt & day == dy & element == ele)
s2_d <- filter(s2, year == yr & month == mnt & day == dy & element == ele)
s3_d <- filter(s3, year == yr & month == mnt & day == dy & element == ele)
s4_d <- filter(s4, year == yr & month == mnt & day == dy & element == ele)
s5_d <- filter(s5, year == yr & month == mnt & day == dy & element == ele)
s1_d <- select(s1_d, id, x, y, value)
s2_d <- select(s2_d, id, x, y, value)
s3_d <- select(s3_d, id, x, y, value)
s4_d <- select(s4_d, id, x, y, value)
s5_d <- select(s5_d, id, x, y, value)
stack <- do.call("rbind", list(s1_d, s2_d, s3_d, s4_d, s5_d))
stack <- select(stack, id, x, y, value)
stack <- filter(stack, value != -9999)
if (dim(stack)[1] >= 1){
ifelse(dim(stack)[1] == 1, v <- stack$value, v <- idw(stack$value, stack[,2:4], station[,2:3]))
} else {
ret <- filter(out, id == id & year == yr, month == mnt, element == ele, value != -9999)
v <- mean(ret$value)
}
return(v)
}
Code:
library(dplyr)
library(phylin)
library(lubridate)
for(j in which(is.na(newdat$value) | newdat$value > 134 | newdat$value < -80 | newdat$value == -9999)) {
newdat[j,7] <- nearest(id = j, yr = as.numeric(newdat[j,3]), mnt = as.numeric(newdat[j,4]), dy = as.numeric(newdat[j,5]),
ele = as.character(newdat[j,2]), s1=s1, s2=s2, s3=s3,s4=s4,s5=s5, out=newdat, station=station)
}
1 Answer 1
To go further than my comment:
You're filtering twice for the same datas and so repeating yourself a lot in your nearest function, this could be simplified in a one call subset and avoid too much context switches.
This is how I would write your function:
nearest2 <- function(id,yr,mnt,dy,ele,s1=s1, s2=s2, s3=s3, s4=s4, s5=s5, out=output, station=station){
stations <- rbind(s1,s2,s3,s4,s5)
stack <- with(stations,stations[year==yr & month==mnt & day==dy & element==ele & value!=-9999,c("id","x","y","value")])
if (dim(stack)[1] >= 1){
ifelse(dim(stack)[1] == 1, v <- stack$value, v <- idw(stack$value, stack[,2:4], station[,2:3]))
} else {
ret <- filter(out, id == id & year == yr, month == mnt, element == ele, value != -9999)
v <- mean(ret$value)
}
return(v)
}
As you can see the idea is not really different, I stick to base R but feel free to prefer filter %>% select
approach.
To test the results I created 2 functions calling your orignal code and mine:
original <- function() {
for(j in which(is.na(newdat$value) | newdat$value > 134 | newdat$value < -80 | newdat$value == -9999)) {
nearest(id = j, yr = as.numeric(newdat[j,3]), mnt = as.numeric(newdat[j,4]), dy = as.numeric(newdat[j,5]),
ele = as.character(newdat[j,2]), s1=s1, s2=s2, s3=s3,s4=s4,s5=s5, out=newdat, station=station)
}
}
refactored <- function() {
for(j in which(is.na(newdat$value) | newdat$value > 134 | newdat$value < -80 | newdat$value == -9999)) {
nearest2(id = j, yr = as.numeric(newdat[j,3]), mnt = as.numeric(newdat[j,4]), dy = as.numeric(newdat[j,5]),
ele = as.character(newdat[j,2]), s1=s1, s2=s2, s3=s3,s4=s4,s5=s5, out=newdat, station=station)
}
}
And then I did benchmark them to see if it improves or not:
> microbenchmark(original(),refactored(),times=10L)
Unit: milliseconds
expr min lq mean median uq max neval
original() 10.324189 10.612698 12.101344 11.366187 12.554720 17.544906 10
refactored() 1.774676 1.831978 2.452274 2.527737 2.769064 3.222069 10
I think an average gain of 6 times is kind of an interesting improvement, but this has to be confirmed on your real datas.
Concerning your main for loop, it does not sound that bad as working on specific values and not growing a vector (implying memory copies) in it's content, moving to sapply
can bring a little gain on large vectors at end.
Example with your sample datas:
refactored2 <- function() {
sapply(
which(is.na(newdat$value) | newdat$value > 134 | newdat$value < -80 | newdat$value == -9999),
function(j) {
nearest2(id = j, yr = as.numeric(newdat[j,3]), mnt = as.numeric(newdat[j,4]), dy = as.numeric(newdat[j,5]),
ele = as.character(newdat[j,2]), s1=s1, s2=s2, s3=s3,s4=s4,s5=s5, out=newdat, station=station)
})
}
Benchmark result, the overhead of sapply is visible for a small vector (2 values actually) here, this should improve with larger input in favor of sapply:
> microbenchmark(original(),refactored(),refactored2(),times=10L)
Unit: milliseconds
expr min lq mean median uq max neval
original() 9.587806 11.258994 11.862497 11.979840 12.065223 13.659437 10
refactored() 1.840816 2.071738 2.892308 2.554820 3.360622 5.995409 10
refactored2() 1.853931 2.664436 2.967314 3.144811 3.234328 3.925097 10
I assumed you don't wish to change your original newdat
observations types, if you can accept it, turn the year, month and day to numeric before processing to avoid multiple calls.
Lastly, you're testing year, month and day separately for equality, maybe getting rid of them and comparing the date field directly could be a potential improvement too (or using POSIXct class for the date field and work with date functions if your need time ranges).
Hope this helps
-
1\$\begingroup\$ Thank you for this! The benchmark greatly improves the performance of this. I've improved the code further outside of this and have no included sapply as you suggest. However, the
bind
thensort
does make a lot more sense, so I'll try that on the dataset. Ironically, I'm been running this code for 7 days now on the original data and should finish today. I'll report back with improvements on the original. Thanks! \$\endgroup\$Amstell– Amstell2015年12月09日 15:49:52 +00:00Commented Dec 9, 2015 at 15:49
nearest
function. From what I can understand, they are not sorted data. If you can sort the data before calling thenearest
function you could apply a binary search or something similar. \$\endgroup\$newdat
. \$\endgroup\$