New Approach
My previous question may have had a little too much going on and I realized I could simplify the problem by constructing the data a bit differently thanks to @Gentian Kasa . Previously, the code was filtering data constantly and causing a big bottle neck in the processing time. I have now constructed the data in a way that both the main station, and local stations have the same number of days, so instead of filtering the code it now simply processes through the data.frames.
Problem
There is 1 main station (df
) and 3 local stations (s
) stacked in a single data.frame with values for three days. The idea is to take each day from the main station, find the relative anomaly of the three local stations, and smooth it using inverse distance weighting (IDW) from the phylin
package. This is then applied to the value
in the main station by multiplication.
This code is working fine and it is certainly an improvement from before, but I would like to see if there is a better/faster way using an optimized package/method (e.g. data.table
, dplyr
, apply
). I still don't know how to approach this problem without the cumbersome for
loop.
The original data set has around 19,000 days, with 3 different variables, for 20,000 stations totaling 1.14 trillion observations. You can imagine how long this might take -- prior estimates were at 14 days;although, I have not checked with this updated code.
Data
Main Station : df
id lat long year month day value
1 12345 100 50 1900 1 1 54.87800
2 12345 100 50 1900 1 2 106.96603
3 12345 100 50 1900 1 3 98.31988
Local Stations: s
id lat long year month day value
1 USC00031152 33.5900 -92.8236 1900 1 1 63.31576
2 USC00034638 34.7392 -90.7664 1900 1 1 86.04906
3 USC00036352 35.2833 -93.1000 1900 1 1 76.50639
4 USC00031152 33.5900 -92.8236 1900 1 2 71.37608
5 USC00034638 34.7392 -90.7664 1900 1 2 89.91196
6 USC00036352 35.2833 -93.1000 1900 1 2 76.35352
7 USC00031152 33.5900 -92.8236 1900 1 3 53.72596
8 USC00034638 34.7392 -90.7664 1900 1 3 61.79896
9 USC00036352 35.2833 -93.1000 1900 1 3 85.89112
dput
s <- structure(list(id = c("USC00031152", "USC00034638", "USC00036352",
"USC00031152", "USC00034638", "USC00036352", "USC00031152", "USC00034638",
"USC00036352"), lat = c(33.59, 34.7392, 35.2833, 33.59, 34.7392,
35.2833, 33.59, 34.7392, 35.2833), long = c(-92.8236, -90.7664,
-93.1, -92.8236, -90.7664, -93.1, -92.8236, -90.7664, -93.1),
year = c(1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900,
1900), month = c(1, 1, 1, 1, 1, 1, 1, 1, 1), day = c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), value = c(63.3157576809045,
86.0490598902219, 76.506386949066, 71.3760752788486, 89.9119576975542,
76.3535163951321, 53.7259645981243, 61.7989638892985, 85.8911224149051
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-9L), .Names = c("id", "lat", "long", "year", "month", "day",
"value"))
df <- structure(list(id = c(12345, 12345, 12345), lat = c(100, 100,
100), long = c(50, 50, 50), year = c(1900, 1900, 1900), month = c(1,
1, 1), day = 1:3, value = c(54.8780020601509, 106.966029162171,
98.3198828955801)), row.names = c(NA, -3L), class = "data.frame", .Names = c("id",
"lat", "long", "year", "month", "day", "value"))
Code
library(phylin)
nearest <- function(i, loc){
# Stack 3 local stations
stack <- s[loc:(loc+2),]
# Get 1 main station
station <- df[i,]
# Check for NA and build relative anomaly (r)
stack <- stack[!is.na(stack$value),]
stack$r <- stack$value/station$value
# Use IDW and return v
v <- as.numeric(ifelse(dim(stack)[1] == 1,
stack$r,
idw(stack$r, stack[,c(2,3,8)], station[,2:3])))
return(v)
}
ncdc <- 1
for (i in 1:nrow(df)){
# Get relative anomaly from function
r <- nearest(i, ncdc)
# Get value from main station and apply anomaly
p <- df[i,7]
df[i,7] <- p*r
# Iterate to next 3 local stations
ncdc <- ncdc + 3
}
Output
id lat long year month day value
1 12345 100 50 1900 1 1 75.40086
2 12345 100 50 1900 1 2 79.31592
3 12345 100 50 1900 1 3 67.12082
1 Answer 1
The code is quite inefficient for two reasons:
a) the idw
function is poorly written. I see at least three for loops (one inside real.dist
and two apply
calls inside idw
) that could have been replaced with faster alternatives
b) but mostly, your algorithm duplicates many operations. For instance, in your example, the distance between your main station and each of the three local stations is repeated nrow(df)
times. It also manifests itself by the fact that your data, in its current form, contains many id/lat/log/year/month/day duplicates.
I would suggest you re-arrange your data as follows:
A matrix of coordinates for the main stations:
m.coord <- structure(c(100, 105, 50, 55),
.Dim = c(2L, 2L),
.Dimnames = list(c("12345", "12346"),
c("lat", "long")))
A matrix of coordinates for the local stations:
s.coord <- structure(c(33.59, 34.7392, 35.2833,
-92.8236, -90.7664, -93.1),
.Dim = c(3L, 2L),
.Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
c("lat", "long")))
A matrix for the values at the local stations, where one dimension corresponds to the stations, the other to the time:
s.values <- structure(
c(63.3157576809045, 86.0490598902219, 76.506386949066,
71.3760752788486, 89.9119576975542, 76.3535163951321,
53.7259645981243, 61.7989638892985, 85.8911224149051),
.Dim = c(3L, 3L),
.Dimnames = list(c("USC00031152", "USC00034638", "USC00036352"),
c("1900-01-01", "1900-01-02", "1900-01-03")))
First step is to compute the distances between the main stations and the local stations. I use the fields::rdist
because it is fast (compiled in Fortran), at least a lot faster than the for
loop inside idw
.
library(fields)
d.mat <- rdist(m.coord, s.coord)
The output is a matrix where each row contains the distances between one main station and all the local stations.
Next, we go from a distance matrix to a weight matrix:
w.mat <- 1 / d.mat ^ 2
In the event that one or more of the distances is zero, the weight is now infinite, which is a bit undesirable. To handle that situation, we modify the weights to 0 or 1 for all rows that contain infinite values:
is.inf <- is.infinite(w.mat)
has.infinite <- rowSums(is.inf) > 0
w.mat[has.infinite, ] <- as.numeric(is.inf[has.infinite, ])
If you only want to work with the 3 closest stations, you can write a function that will only keep the three highest weights on each row and turn all other weights to zero:
keep_n <- function(x, n = 3) ifelse(rank(x) > length(x) – n, x, 0)
and run it on each row of the weight matrix:
w.mat <- t(apply(w.mat, 1, keep_n))
Then, you want to scale your weights so that they sum to 1 on each row. Easy:
w.mat <- w.mat / rowSums(w.mat)
Now that you have your weights, computing the weighted averages for all main stations and all days is done in one simple matrix multiplication:
new_val <- w.mat %*% s.values
If s.values
contains NA
s, then it is a little bit more difficult:
z <- is.na(s.values)
new_val <- (w.mat %*% ifelse(z, 0, s.values)) / (w.mat %*% !z)
I think you will know how to take it from here. Note that if your data is so large that you cannot handle all stations at the same time (i.e can't even compute d.mat
), you could loop on the main stations by bunches, e.g. 1000 main stations at a time.
One remark: your earlier solution and the one I suggested here both use basic euclidean distance to compute distances. This will probably be fine if your data is restricted to a small region and far from where longitude jumps. Otherwise, you might want to look in a more appropriate function for computing the distance matrix.
-
\$\begingroup\$ Thanks for your very thorough answer. I like the computation using rdist better than idw. I have not had a chance to actually run the code and check the results. I'm a bit confused on the output though. Would you mind showing the finished code using the weights an applying the relative anomaly so I can see how that works? Thanks for a great answer. \$\endgroup\$Amstell– Amstell2015年11月27日 17:29:37 +00:00Commented Nov 27, 2015 at 17:29
-
\$\begingroup\$ I had a chance to run the code and see that the output is providing the adjusted value. You've sparked an idea to find the Euclidean distance first and create a lookup table to run through. The data is currently structured so that I load each individual main station and the closest local stations so I'll adjust. It might take a bit longer but I need the stations split up. Thanks for your answer. \$\endgroup\$Amstell– Amstell2015年11月27日 18:47:03 +00:00Commented Nov 27, 2015 at 18:47
-
\$\begingroup\$ This is a great answer and exactly what I am trying to do with
idw
, but did not look at the function. I appreciate your time and thanks for your help. \$\endgroup\$Amstell– Amstell2015年11月28日 23:56:54 +00:00Commented Nov 28, 2015 at 23:56 -
\$\begingroup\$ Hit up my other question with your answer and I'll award the bounty(100 pts) to you since no one has responded \$\endgroup\$Amstell– Amstell2015年11月29日 00:13:59 +00:00Commented Nov 29, 2015 at 0:13
N
of local stations? Are the two setsM
andN
distinct, identical, or do they overlap? \$\endgroup\$N
stations. You pick one of them as the main station and consider all the other as local (or maybe only a subset: the 3 closest or those within a specified distance of the main station). But the important point is that you will end up repeating the process for each of theN
stations. So my point is that it would be useful to compute aN
-by-N
matrix of distances to start with. Do I understand correctly? \$\endgroup\$M
main stations andN
local stations. EachM
station has 3N
local stations associated with it that are the closest by proximity. ThenIDW
is computed to smooth out relative anomalies for each day of the year. After thisM
station has all their relative anomalies applied to thevalue
the process moves to the nextM
station. Is this more clear? \$\endgroup\$