-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
We have been investigating the background refresh logic and have realized that it the refresh of version lists and version queries is not optimally paced. From the very beginning, the general approach has been (1) to refresh cached items ahead of their expiration, so that requests to proxy.golang.org can be answered immediately instead of refreshing during the request and (2) not to refresh cached items that have not been requested recently, so that we don't waste bandwdith on items no one is using. The overall background refresh traffic depends on the constants used in those constraints. The current constants are:
- Unlicensed module versions (mod/@v/v1.0.0.zip): expires after 30 days; refresh after 25 days, only if requested in the past 1 day.
- Version queries (mod/@v/main.info): expires after 1 hour; refresh after 25 minutes (aiming to never serve data >30 minutes out-of-date), only if requested in the past 1 day.
- Version lists (mod/@v/list): expires after 3 hours; refresh after 25 minutes (aiming to never serve data >30 minutes out-of-date), only if requested in the past 3 days.
These were all added at different times years ago, by engineers who have since moved on to other jobs, so we don't have the full rationales for every constant. I do remember that, back then, I and others were pressuring the module mirror team to refresh the version lists and version queries as aggressively as possible, because a common complaint was that fetches like go get module@main or go get module@latest would not immediately serve newly published commits. So I am happy to take responsibility for the unfortunate implications below.
Let's look at the worst case read amplification for each of these cached item types, compared to what would happen if proxy.golang.org did not cache anything at all:
-
Unlicensed module versions: the least frequent request rate that sustains continuous refreshes is one request made every 25 days. Each will be served from cache but then also immediately trigger a refresh upstream. We can view the refresh as a "delayed" read resulting from the request, so the refresh traffic is never more than uncached traffic would be. The worst case read amplification is 1X, which is optimal.
-
Version queries: the least frequent request rate that sustains continuous refreshes is one request made every day. Each will be served from cache but then will justify refreshing the list every 25 minutes for the next 24 hours. The refresh traffic would be 24*60/75 = 57.6 upstream refreshes for each proxy request. The worst case read amplification is 57.6X.
-
Version lists: the least frequent request rate that sustains continuous refreshes is one request made every 3 days. Each will be served from cache but then will justify refreshing the list every 25 minutes for the next 3 days. The refresh traffic would be 32460/25 = 172.8 upstream refreshes for each proxy request. The worst case read amplification is 172.8X.
It is very important to note that these are worst case, not expected case. Most modules are accessed frequently or not at all. The frequently accessed modules have read amplifications far below 1X: the Go module mirror is handling far more traffic than it sends upstream. The modules accessed not at all have no reads. It is also important to note that the version query and version list refresh rates have an absolute rate of once per 25 minutes. The vast majority of modules are small, making a download every 25 minutes not too onerous for upstream. And the vast majority of modules are Git repositories, which only need to handle a lightweight Git handshake every 25 minutes, not a full download.
All that said, not all modules are small, not all modules are Git, and perhaps some modules are requested only about once per day. These (relatively uncommon) cases would see read amplifications above 1X, meaning the Go module mirror sends more traffic did no background refreshing at all. The fix for this is to pace the refreshes by using the same time interval for the refresh rate and the definition of "recent request". That would mean:
- Unlicensed module versions (mod/@v/v1.0.0.zip): refresh after 25 days, only if requested in the past 25 days.
- Version queries (mod/@v/main.info): refresh after 25 minutes, only if requested in the past 25 minutes.
- Version lists (mod/@v/list): refresh after 25 minutes, only if requested in the past 25 minutes.
Doing this would have no impact on the frequently accessed modules (they'll still see requests every 25 minutes) nor on the never-accessed modules (they'll still see no requests), but it will cut the worst case read amplification to 1X for all cached items, so that the module mirror never sends more traffic upstream due to background refreshes than it would if it only refreshed during active HTTP requests.
This issue tracks making that change.