Processing data for five weather stations

Question 1

Problem

The main idea is to build some fine scale weather data using inverse distance weighting (idw) from the 5 closest stations of the observed station. If the value in the newdat data set is greater than 134, less than -80, equal to -9999, or NA then run the function to get an interpolated value. If greater than 2 stations exist use idw, if only 1 station exists use the closest station, or if no close stations exist use the mean for the month of the year.

I'm trying to pass some information to a function I've created, have it run through some subsetting and calculations, then return the value to the point in the data frame that conditions were met. It's quite a hack job, but I've got it working using a for loop; however, it takes a really long time on the regular data set .

The idea is to create a function nearest() that accepts 5 weather stations, stacks them, finds inverse distance weight, and then returns a value to the original data frame.

Is there a better way to do this? I'd like a more R, or efficient way to do it hopefully with data.table or dplyr since they are optimized so well. I think this can also be done with one of the apply functions, but I don't know how to code this.

About the data

In the original data set there are 7 data sets; 1 station as the center station which is called station in this example; and 5 closest stations (s1-s5). Each of the 5 stations have around 20-30k obs and the primary station newdat will have 18,627 -- one for each day between 1900 and 1950. This isn't so bad as I have over 600 stations for this computation and each one takes around 15 minutes (15 minutes per station x 600 stations = 6.25 days). The next round will have over 12,000 stations, so this is just a warm up and I think I should be able to reuse the code.

In this example, I have scraped all the local stations (s1-s5) down to just one subset of the data The nearest station (station) has also been subset out to a single observation. The main data set newdat has also been reduced down to just 10 observations with an NA value and a value over 200. The goal is to replace both of those values using the function nearest().

The flow of the data looks like this:

Load individual station id (newdat-lists all dates, and 5 closest stations (s1-s5)
For an individual station check the value variable for each day for conditions described above (e.g. outlier, NA, or -9999)
When condition is met, grab the location in the data frame of element, year, month, and day from newdat.
Subset s1-s5 based on those criteria (this will leave 1 or 0 obs for each s1-s5
Stack s1-s5 and use either idw if greater than 2 obs exist, use single station if only 1 obs exists, or mean of the month and year if no obs exist
If using idw, it is necessary to use the lat/long for station and (s1-s5).
Use this value to plug in to newdat where the value was found to be an outlier, NA, or -9999.
Move on to next day

dput (need all):

newdat <- structure(list(id = c("USC00031632", "USC00031632", "USC00031632", 
"USC00031632", "USC00031632", "USC00031632", "USC00031632", "USC00031632", 
"USC00031632", "USC00031632"), element = structure(c(1L, 2L, 
1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("TMAX", "TMIN"), class = "factor"), 
 year = c("1900", "1900", "1900", "1900", "1900", "1900", 
 "1900", "1900", "1900", "1900"), month = c("01", "01", "02", 
 "02", "03", "04", "04", "05", "05", "01"), day = c("01", 
 "01", "01", "01", "01", "01", "01", "01", "01", "02"), date = structure(c(-25567, 
 -25567, -25536, -25536, -25508, -25477, -25477, -25447, -25447, 
 -25566), class = "Date"), value = c(30.02, 28, 37.94, 10.94, 
 200, 25, 41, 82.04, 51.08, NA)), .Names = c("id", "element", 
"year", "month", "day", "date", "value"), row.names = c(NA, 10L
), class = "data.frame")
s1 <- structure(list(id = "USC00031152", element = "TMAX", year = 1900, 
 month = 1, day = 2, date = structure(-25566, class = "Date"), 
 value = 37.04, y = 33.59, x = -92.8236), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id", 
"element", "year", "month", "day", "date", "value", "y", "x"))
s2 <- structure(list(id = "USC00034638", element = "TMAX", year = 1900, 
 month = 1, day = 2, date = structure(-25566, class = "Date"), 
 value = 62.06, y = 34.7392, x = -90.7664), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id", 
"element", "year", "month", "day", "date", "value", "y", "x"))
s3 <- structure(list(id = "USC00036352", element = "TMAX", year = 1900, 
 month = 1, day = 2, date = structure(-25566, class = "Date"), 
 value = 30.92, y = 35.2833, x = -93.1), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id", 
"element", "year", "month", "day", "date", "value", "y", "x"))
s4 <- structure(list(id = "USC00036248", element = "TMAX", year = 1900, 
 month = 1, day = 2, date = structure(-25566, class = "Date"), 
 value = 30.02, y = 36.3667, x = -94.1), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id", 
"element", "year", "month", "day", "date", "value", "y", "x"))
s5 <- structure(list(id = "USC00035112", element = "TMAX", year = 1900, 
 month = 1, day = 2, date = structure(-25566, class = "Date"), 
 value = -9999, y = 33.9294, x = -93.8583), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id", 
"element", "year", "month", "day", "date", "value", "y", "x"))
station <- structure(list(id = "USC00030528", lat = 35.45, long = -92.4), class = "data.frame", row.names = c(NA, 
-1L), .Names = c("id", "lat", "long"))

Function:

nearest <- function(id, yr, mnt, dy, ele, s1=s1, s2=s2, s3=s3, s4=s4, s5=s5, out=output, station=station){
 s1_d <- filter(s1, year == yr & month == mnt & day == dy & element == ele)
 s2_d <- filter(s2, year == yr & month == mnt & day == dy & element == ele)
 s3_d <- filter(s3, year == yr & month == mnt & day == dy & element == ele)
 s4_d <- filter(s4, year == yr & month == mnt & day == dy & element == ele)
 s5_d <- filter(s5, year == yr & month == mnt & day == dy & element == ele)
 s1_d <- select(s1_d, id, x, y, value)
 s2_d <- select(s2_d, id, x, y, value)
 s3_d <- select(s3_d, id, x, y, value)
 s4_d <- select(s4_d, id, x, y, value)
 s5_d <- select(s5_d, id, x, y, value)
 stack <- do.call("rbind", list(s1_d, s2_d, s3_d, s4_d, s5_d))
 stack <- select(stack, id, x, y, value)
 stack <- filter(stack, value != -9999)
 if (dim(stack)[1] >= 1){
 ifelse(dim(stack)[1] == 1, v <- stack$value, v <- idw(stack$value, stack[,2:4], station[,2:3])) 
 } else {
 ret <- filter(out, id == id & year == yr, month == mnt, element == ele, value != -9999)
 v <- mean(ret$value) 
 } 
 return(v)
}

Code:

library(dplyr)
library(phylin)
library(lubridate) 
 for(j in which(is.na(newdat$value) | newdat$value > 134 | newdat$value < -80 | newdat$value == -9999)) {
 newdat[j,7] <- nearest(id = j, yr = as.numeric(newdat[j,3]), mnt = as.numeric(newdat[j,4]), dy = as.numeric(newdat[j,5]),
 ele = as.character(newdat[j,2]), s1=s1, s2=s2, s3=s3,s4=s4,s5=s5, out=newdat, station=station)
 }

Question 2

Hi! Welcome to Code Review. Good job on your first post!

Question 3

Hello, for us to get a sense of the slow computation times you are experiencing, could you please provide the dimensions of all the objects in your real application?

Question 4

@flodel See edit in question. Hope this makes it more clear

Question 5

I have no knowledge and experience in R (so I don't know exactly how stuff works), but my guess is that the bottleneck is due to the search/filtering of data in the nearest function. From what I can understand, they are not sorted data. If you can sort the data before calling the nearest function you could apply a binary search or something similar.

Question 6

@GentianKasa Yes, I've done some profiling on the code and the filtering is certainly part of the bottle neck, but I don't see a way around this because of the way the code works. The stations are already loaded for each local station, but it still needs to filter out based on element, year, month, day from the point in newdat.

Question 7

To go further than my comment:

You're filtering twice for the same datas and so repeating yourself a lot in your nearest function, this could be simplified in a one call subset and avoid too much context switches.

This is how I would write your function:

nearest2 <- function(id,yr,mnt,dy,ele,s1=s1, s2=s2, s3=s3, s4=s4, s5=s5, out=output, station=station){
 stations <- rbind(s1,s2,s3,s4,s5)
 stack <- with(stations,stations[year==yr & month==mnt & day==dy & element==ele & value!=-9999,c("id","x","y","value")])
 if (dim(stack)[1] >= 1){
 ifelse(dim(stack)[1] == 1, v <- stack$value, v <- idw(stack$value, stack[,2:4], station[,2:3])) 
 } else {
 ret <- filter(out, id == id & year == yr, month == mnt, element == ele, value != -9999)
 v <- mean(ret$value) 
 } 
 return(v)
}

As you can see the idea is not really different, I stick to base R but feel free to prefer filter %>% select approach.

To test the results I created 2 functions calling your orignal code and mine:

original <- function() {
 for(j in which(is.na(newdat$value) | newdat$value > 134 | newdat$value < -80 | newdat$value == -9999)) {
 nearest(id = j, yr = as.numeric(newdat[j,3]), mnt = as.numeric(newdat[j,4]), dy = as.numeric(newdat[j,5]),
 ele = as.character(newdat[j,2]), s1=s1, s2=s2, s3=s3,s4=s4,s5=s5, out=newdat, station=station)
 }
}
refactored <- function() {
 for(j in which(is.na(newdat$value) | newdat$value > 134 | newdat$value < -80 | newdat$value == -9999)) {
 nearest2(id = j, yr = as.numeric(newdat[j,3]), mnt = as.numeric(newdat[j,4]), dy = as.numeric(newdat[j,5]),
 ele = as.character(newdat[j,2]), s1=s1, s2=s2, s3=s3,s4=s4,s5=s5, out=newdat, station=station)
 }
}

And then I did benchmark them to see if it improves or not:

> microbenchmark(original(),refactored(),times=10L)
Unit: milliseconds
 expr min lq mean median uq max neval
 original() 10.324189 10.612698 12.101344 11.366187 12.554720 17.544906 10
 refactored() 1.774676 1.831978 2.452274 2.527737 2.769064 3.222069 10

I think an average gain of 6 times is kind of an interesting improvement, but this has to be confirmed on your real datas.

Concerning your main for loop, it does not sound that bad as working on specific values and not growing a vector (implying memory copies) in it's content, moving to sapply can bring a little gain on large vectors at end.

Example with your sample datas:

refactored2 <- function() {
 sapply(
 which(is.na(newdat$value) | newdat$value > 134 | newdat$value < -80 | newdat$value == -9999),
 function(j) {
 nearest2(id = j, yr = as.numeric(newdat[j,3]), mnt = as.numeric(newdat[j,4]), dy = as.numeric(newdat[j,5]),
 ele = as.character(newdat[j,2]), s1=s1, s2=s2, s3=s3,s4=s4,s5=s5, out=newdat, station=station)
 })
}

Benchmark result, the overhead of sapply is visible for a small vector (2 values actually) here, this should improve with larger input in favor of sapply:

> microbenchmark(original(),refactored(),refactored2(),times=10L)
Unit: milliseconds
 expr min lq mean median uq max neval
 original() 9.587806 11.258994 11.862497 11.979840 12.065223 13.659437 10
 refactored() 1.840816 2.071738 2.892308 2.554820 3.360622 5.995409 10
 refactored2() 1.853931 2.664436 2.967314 3.144811 3.234328 3.925097 10

I assumed you don't wish to change your original newdat observations types, if you can accept it, turn the year, month and day to numeric before processing to avoid multiple calls.

Lastly, you're testing year, month and day separately for equality, maybe getting rid of them and comparing the date field directly could be a potential improvement too (or using POSIXct class for the date field and work with date functions if your need time ranges).

Hope this helps

Question 8

Thank you for this! The benchmark greatly improves the performance of this. I've improved the code further outside of this and have no included sapply as you suggest. However, the bind then sort does make a lot more sense, so I'll try that on the dataset. Ironically, I'm been running this code for 7 days now on the original data and should finish today. I'll report back with improvements on the original. Thanks!

Tensibai Tensibai 2182 silver badges9 bronze badges · Accepted Answer · 2015-12-09 14:00:57Z

To go further than my comment:

You're filtering twice for the same datas and so repeating yourself a lot in your nearest function, this could be simplified in a one call subset and avoid too much context switches.

This is how I would write your function:

nearest2 <- function(id,yr,mnt,dy,ele,s1=s1, s2=s2, s3=s3, s4=s4, s5=s5, out=output, station=station){
 stations <- rbind(s1,s2,s3,s4,s5)
 stack <- with(stations,stations[year==yr & month==mnt & day==dy & element==ele & value!=-9999,c("id","x","y","value")])
 if (dim(stack)[1] >= 1){
 ifelse(dim(stack)[1] == 1, v <- stack$value, v <- idw(stack$value, stack[,2:4], station[,2:3])) 
 } else {
 ret <- filter(out, id == id & year == yr, month == mnt, element == ele, value != -9999)
 v <- mean(ret$value) 
 } 
 return(v)
}

As you can see the idea is not really different, I stick to base R but feel free to prefer filter %>% select approach.

To test the results I created 2 functions calling your orignal code and mine:

original <- function() {
 for(j in which(is.na(newdat$value) | newdat$value > 134 | newdat$value < -80 | newdat$value == -9999)) {
 nearest(id = j, yr = as.numeric(newdat[j,3]), mnt = as.numeric(newdat[j,4]), dy = as.numeric(newdat[j,5]),
 ele = as.character(newdat[j,2]), s1=s1, s2=s2, s3=s3,s4=s4,s5=s5, out=newdat, station=station)
 }
}
refactored <- function() {
 for(j in which(is.na(newdat$value) | newdat$value > 134 | newdat$value < -80 | newdat$value == -9999)) {
 nearest2(id = j, yr = as.numeric(newdat[j,3]), mnt = as.numeric(newdat[j,4]), dy = as.numeric(newdat[j,5]),
 ele = as.character(newdat[j,2]), s1=s1, s2=s2, s3=s3,s4=s4,s5=s5, out=newdat, station=station)
 }
}

And then I did benchmark them to see if it improves or not:

> microbenchmark(original(),refactored(),times=10L)
Unit: milliseconds
 expr min lq mean median uq max neval
 original() 10.324189 10.612698 12.101344 11.366187 12.554720 17.544906 10
 refactored() 1.774676 1.831978 2.452274 2.527737 2.769064 3.222069 10

I think an average gain of 6 times is kind of an interesting improvement, but this has to be confirmed on your real datas.

Concerning your main for loop, it does not sound that bad as working on specific values and not growing a vector (implying memory copies) in it's content, moving to sapply can bring a little gain on large vectors at end.

Example with your sample datas:

refactored2 <- function() {
 sapply(
 which(is.na(newdat$value) | newdat$value > 134 | newdat$value < -80 | newdat$value == -9999),
 function(j) {
 nearest2(id = j, yr = as.numeric(newdat[j,3]), mnt = as.numeric(newdat[j,4]), dy = as.numeric(newdat[j,5]),
 ele = as.character(newdat[j,2]), s1=s1, s2=s2, s3=s3,s4=s4,s5=s5, out=newdat, station=station)
 })
}

Benchmark result, the overhead of sapply is visible for a small vector (2 values actually) here, this should improve with larger input in favor of sapply:

> microbenchmark(original(),refactored(),refactored2(),times=10L)
Unit: milliseconds
 expr min lq mean median uq max neval
 original() 9.587806 11.258994 11.862497 11.979840 12.065223 13.659437 10
 refactored() 1.840816 2.071738 2.892308 2.554820 3.360622 5.995409 10
 refactored2() 1.853931 2.664436 2.967314 3.144811 3.234328 3.925097 10

I assumed you don't wish to change your original newdat observations types, if you can accept it, turn the year, month and day to numeric before processing to avoid multiple calls.

Lastly, you're testing year, month and day separately for equality, maybe getting rid of them and comparing the date field directly could be a potential improvement too (or using POSIXct class for the date field and work with date functions if your need time ranges).

Hope this helps

Thank you for this! The benchmark greatly improves the performance of this. I've improved the code further outside of this and have no included sapply as you suggest. However, the bind then sort does make a lot more sense, so I'll try that on the dataset. Ironically, I'm been running this code for 7 days now on the original data and should finish today. I'll report back with improvements on the original. Thanks!

Stack Exchange Network

Processing data for five weather stations

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Processing data for five weather stations

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions