Processing weather station data using idw - Follow Up

Question 1

New Approach

My previous question may have had a little too much going on and I realized I could simplify the problem by constructing the data a bit differently thanks to @Gentian Kasa . Previously, the code was filtering data constantly and causing a big bottle neck in the processing time. I have now constructed the data in a way that both the main station, and local stations have the same number of days, so instead of filtering the code it now simply processes through the data.frames.

Problem

There is 1 main station (df) and 3 local stations (s) stacked in a single data.frame with values for three days. The idea is to take each day from the main station, find the relative anomaly of the three local stations, and smooth it using inverse distance weighting (IDW) from the phylin package. This is then applied to the value in the main station by multiplication.

This code is working fine and it is certainly an improvement from before, but I would like to see if there is a better/faster way using an optimized package/method (e.g. data.table, dplyr, apply). I still don't know how to approach this problem without the cumbersome for loop.

The original data set has around 19,000 days, with 3 different variables, for 20,000 stations totaling 1.14 trillion observations. You can imagine how long this might take -- prior estimates were at 14 days;although, I have not checked with this updated code.

Data

Main Station : df

 id lat long year month day value
1 12345 100 50 1900 1 1 54.87800
2 12345 100 50 1900 1 2 106.96603
3 12345 100 50 1900 1 3 98.31988

Local Stations: s

 id lat long year month day value
1 USC00031152 33.5900 -92.8236 1900 1 1 63.31576
2 USC00034638 34.7392 -90.7664 1900 1 1 86.04906
3 USC00036352 35.2833 -93.1000 1900 1 1 76.50639
4 USC00031152 33.5900 -92.8236 1900 1 2 71.37608
5 USC00034638 34.7392 -90.7664 1900 1 2 89.91196
6 USC00036352 35.2833 -93.1000 1900 1 2 76.35352
7 USC00031152 33.5900 -92.8236 1900 1 3 53.72596
8 USC00034638 34.7392 -90.7664 1900 1 3 61.79896
9 USC00036352 35.2833 -93.1000 1900 1 3 85.89112

dput

s <- structure(list(id = c("USC00031152", "USC00034638", "USC00036352", 
"USC00031152", "USC00034638", "USC00036352", "USC00031152", "USC00034638", 
"USC00036352"), lat = c(33.59, 34.7392, 35.2833, 33.59, 34.7392, 
35.2833, 33.59, 34.7392, 35.2833), long = c(-92.8236, -90.7664, 
-93.1, -92.8236, -90.7664, -93.1, -92.8236, -90.7664, -93.1), 
 year = c(1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 
 1900), month = c(1, 1, 1, 1, 1, 1, 1, 1, 1), day = c(1L, 
 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), value = c(63.3157576809045, 
 86.0490598902219, 76.506386949066, 71.3760752788486, 89.9119576975542, 
 76.3535163951321, 53.7259645981243, 61.7989638892985, 85.8911224149051
 )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-9L), .Names = c("id", "lat", "long", "year", "month", "day", 
"value"))
df <- structure(list(id = c(12345, 12345, 12345), lat = c(100, 100, 
100), long = c(50, 50, 50), year = c(1900, 1900, 1900), month = c(1, 
1, 1), day = 1:3, value = c(54.8780020601509, 106.966029162171, 
98.3198828955801)), row.names = c(NA, -3L), class = "data.frame", .Names = c("id", 
"lat", "long", "year", "month", "day", "value"))

Code

library(phylin)
nearest <- function(i, loc){
 # Stack 3 local stations
 stack <- s[loc:(loc+2),]
 # Get 1 main station
 station <- df[i,]
 # Check for NA and build relative anomaly (r)
 stack <- stack[!is.na(stack$value),]
 stack$r <- stack$value/station$value
 # Use IDW and return v
 v <- as.numeric(ifelse(dim(stack)[1] == 1, 
 stack$r, 
 idw(stack$r, stack[,c(2,3,8)], station[,2:3])))
 return(v)
} 
ncdc <- 1
for (i in 1:nrow(df)){
 # Get relative anomaly from function
 r <- nearest(i, ncdc)
 # Get value from main station and apply anomaly
 p <- df[i,7] 
 df[i,7] <- p*r 
 # Iterate to next 3 local stations 
 ncdc <- ncdc + 3
}

Output

 id lat long year month day value
1 12345 100 50 1900 1 1 75.40086
2 12345 100 50 1900 1 2 79.31592
3 12345 100 50 1900 1 3 67.12082

Question 2

In a previous post you provided more info about your real data sizes but I was still a bit confused. I have given a lot of thought to your problem and will likely have improvements to suggest but first I need to understand a few things: Your example has one main station and three local stations. In your more general case, do you have M>1 main stations and N>1 local stations? If you take two main stations, do they always share the same set of N of local stations? Are the two sets M and N distinct, identical, or do they overlap?

Question 3

I am asking all of these because it appears your code is wasting a lot of computation times by computing distances over and over. If you have two sets (main and local) of stations, it makes sense to first compute a matrix of distances between each pair of (main vs. local) stations and use it everywhere.

Question 4

My gut feeling is that in the beginning you only have one large set of N stations. You pick one of them as the main station and consider all the other as local (or maybe only a subset: the 3 closest or those within a specified distance of the main station). But the important point is that you will end up repeating the process for each of the N stations. So my point is that it would be useful to compute a N-by-N matrix of distances to start with. Do I understand correctly?

Question 5

@flodel Yes, that is correct. There are M main stations and N local stations. Each M station has 3 N local stations associated with it that are the closest by proximity. Then IDW is computed to smooth out relative anomalies for each day of the year. After this M station has all their relative anomalies applied to the value the process moves to the next M station. Is this more clear?

Question 6

Are the M main stations and N local stations all originating from a common set? Do you have a pre-process that loops through the stations: picks one as the main station, computes the distance with all other stations and pick the three closest as local stations?

Question 7

The code is quite inefficient for two reasons:

a) the idw function is poorly written. I see at least three for loops (one inside real.dist and two apply calls inside idw) that could have been replaced with faster alternatives

b) but mostly, your algorithm duplicates many operations. For instance, in your example, the distance between your main station and each of the three local stations is repeated nrow(df) times. It also manifests itself by the fact that your data, in its current form, contains many id/lat/log/year/month/day duplicates.

I would suggest you re-arrange your data as follows:

A matrix of coordinates for the main stations:

m.coord <- structure(c(100, 105, 50, 55),
 .Dim = c(2L, 2L),
 .Dimnames = list(c("12345", "12346"),
 c("lat", "long")))

A matrix of coordinates for the local stations:

s.coord <- structure(c(33.59, 34.7392, 35.2833,
 -92.8236, -90.7664, -93.1),
 .Dim = c(3L, 2L),
 .Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
 c("lat", "long")))

A matrix for the values at the local stations, where one dimension corresponds to the stations, the other to the time:

s.values <- structure(
 c(63.3157576809045, 86.0490598902219, 76.506386949066, 
 71.3760752788486, 89.9119576975542, 76.3535163951321, 
 53.7259645981243, 61.7989638892985, 85.8911224149051),
 .Dim = c(3L, 3L),
 .Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
 c("1900-01-01", "1900-01-02", "1900-01-03")))

First step is to compute the distances between the main stations and the local stations. I use the fields::rdist because it is fast (compiled in Fortran), at least a lot faster than the for loop inside idw.

library(fields)
d.mat <- rdist(m.coord, s.coord)

The output is a matrix where each row contains the distances between one main station and all the local stations.

Next, we go from a distance matrix to a weight matrix:

w.mat <- 1 / d.mat ^ 2

In the event that one or more of the distances is zero, the weight is now infinite, which is a bit undesirable. To handle that situation, we modify the weights to 0 or 1 for all rows that contain infinite values:

is.inf <- is.infinite(w.mat)
has.infinite <- rowSums(is.inf) > 0
w.mat[has.infinite, ] <- as.numeric(is.inf[has.infinite, ])

If you only want to work with the 3 closest stations, you can write a function that will only keep the three highest weights on each row and turn all other weights to zero:

keep_n <- function(x, n = 3) ifelse(rank(x) > length(x) – n, x, 0)

and run it on each row of the weight matrix:

w.mat <- t(apply(w.mat, 1, keep_n))

Then, you want to scale your weights so that they sum to 1 on each row. Easy:

w.mat <- w.mat / rowSums(w.mat)

Now that you have your weights, computing the weighted averages for all main stations and all days is done in one simple matrix multiplication:

new_val <- w.mat %*% s.values

If s.values contains NAs, then it is a little bit more difficult:

z <- is.na(s.values)
new_val <- (w.mat %*% ifelse(z, 0, s.values)) / (w.mat %*% !z)

I think you will know how to take it from here. Note that if your data is so large that you cannot handle all stations at the same time (i.e can't even compute d.mat), you could loop on the main stations by bunches, e.g. 1000 main stations at a time.

One remark: your earlier solution and the one I suggested here both use basic euclidean distance to compute distances. This will probably be fine if your data is restricted to a small region and far from where longitude jumps. Otherwise, you might want to look in a more appropriate function for computing the distance matrix.

Question 8

Thanks for your very thorough answer. I like the computation using rdist better than idw. I have not had a chance to actually run the code and check the results. I'm a bit confused on the output though. Would you mind showing the finished code using the weights an applying the relative anomaly so I can see how that works? Thanks for a great answer.

Question 9

I had a chance to run the code and see that the output is providing the adjusted value. You've sparked an idea to find the Euclidean distance first and create a lookup table to run through. The data is currently structured so that I load each individual main station and the closest local stations so I'll adjust. It might take a bit longer but I need the stations split up. Thanks for your answer.

Question 10

This is a great answer and exactly what I am trying to do with idw, but did not look at the function. I appreciate your time and thanks for your help.

Question 11

Hit up my other question with your answer and I'll award the bounty(100 pts) to you since no one has responded

flodel flodel 3,5551 gold badge16 silver badges15 bronze badges · Accepted Answer · 2015-11-26 18:23:55Z

The code is quite inefficient for two reasons:

a) the idw function is poorly written. I see at least three for loops (one inside real.dist and two apply calls inside idw) that could have been replaced with faster alternatives

b) but mostly, your algorithm duplicates many operations. For instance, in your example, the distance between your main station and each of the three local stations is repeated nrow(df) times. It also manifests itself by the fact that your data, in its current form, contains many id/lat/log/year/month/day duplicates.

I would suggest you re-arrange your data as follows:

A matrix of coordinates for the main stations:

m.coord <- structure(c(100, 105, 50, 55),
 .Dim = c(2L, 2L),
 .Dimnames = list(c("12345", "12346"),
 c("lat", "long")))

A matrix of coordinates for the local stations:

s.coord <- structure(c(33.59, 34.7392, 35.2833,
 -92.8236, -90.7664, -93.1),
 .Dim = c(3L, 2L),
 .Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
 c("lat", "long")))

A matrix for the values at the local stations, where one dimension corresponds to the stations, the other to the time:

s.values <- structure(
 c(63.3157576809045, 86.0490598902219, 76.506386949066, 
 71.3760752788486, 89.9119576975542, 76.3535163951321, 
 53.7259645981243, 61.7989638892985, 85.8911224149051),
 .Dim = c(3L, 3L),
 .Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
 c("1900-01-01", "1900-01-02", "1900-01-03")))

First step is to compute the distances between the main stations and the local stations. I use the fields::rdist because it is fast (compiled in Fortran), at least a lot faster than the for loop inside idw.

library(fields)
d.mat <- rdist(m.coord, s.coord)

The output is a matrix where each row contains the distances between one main station and all the local stations.

Next, we go from a distance matrix to a weight matrix:

w.mat <- 1 / d.mat ^ 2

In the event that one or more of the distances is zero, the weight is now infinite, which is a bit undesirable. To handle that situation, we modify the weights to 0 or 1 for all rows that contain infinite values:

is.inf <- is.infinite(w.mat)
has.infinite <- rowSums(is.inf) > 0
w.mat[has.infinite, ] <- as.numeric(is.inf[has.infinite, ])

If you only want to work with the 3 closest stations, you can write a function that will only keep the three highest weights on each row and turn all other weights to zero:

keep_n <- function(x, n = 3) ifelse(rank(x) > length(x) – n, x, 0)

and run it on each row of the weight matrix:

w.mat <- t(apply(w.mat, 1, keep_n))

Then, you want to scale your weights so that they sum to 1 on each row. Easy:

w.mat <- w.mat / rowSums(w.mat)

Now that you have your weights, computing the weighted averages for all main stations and all days is done in one simple matrix multiplication:

new_val <- w.mat %*% s.values

If s.values contains NAs, then it is a little bit more difficult:

z <- is.na(s.values)
new_val <- (w.mat %*% ifelse(z, 0, s.values)) / (w.mat %*% !z)

I think you will know how to take it from here. Note that if your data is so large that you cannot handle all stations at the same time (i.e can't even compute d.mat), you could loop on the main stations by bunches, e.g. 1000 main stations at a time.

One remark: your earlier solution and the one I suggested here both use basic euclidean distance to compute distances. This will probably be fine if your data is restricted to a small region and far from where longitude jumps. Otherwise, you might want to look in a more appropriate function for computing the distance matrix.

Thanks for your very thorough answer. I like the computation using rdist better than idw. I have not had a chance to actually run the code and check the results. I'm a bit confused on the output though. Would you mind showing the finished code using the weights an applying the relative anomaly so I can see how that works? Thanks for a great answer.
I had a chance to run the code and see that the output is providing the adjusted value. You've sparked an idea to find the Euclidean distance first and create a lookup table to run through. The data is currently structured so that I load each individual main station and the closest local stations so I'll adjust. It might take a bit longer but I need the stations split up. Thanks for your answer.
This is a great answer and exactly what I am trying to do with idw, but did not look at the function. I appreciate your time and thanks for your help.
Hit up my other question with your answer and I'll award the bounty(100 pts) to you since no one has responded

Stack Exchange Network

Processing weather station data using idw - Follow Up

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Processing weather station data using idw - Follow Up

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions