i'm newbie in nginx and proxy server.
We have a problem about googlesource 429 Error, caused by many requests
and because of bandwidth, we took a long time to get googlesource.
We reviewed to make AOSP GIT(android.googlesource.com) Mirror in our server farm,
but We don't know which Repositories exactly needed at that point(Cause We maintained server, not a developer),
and we have too small storage to make AOSP GIT Mirror source...
For reduce requests to googlesource & serve AOSP GIT Source more faster,
We want to make "proxy cache" server.
Here's what we want:
Set developer's server to see the proxy server when developer try to get Google AOSP GIT Code
(git clone or repo sync to android.googlesoure.com -> our proxy server)when proxy server detect the request to google AOSP GIT URL,
it stored data in cache directory & served that data when same request reached.
We don't want developer's server connect android.googlesource.com directly.(control 429 Error)
all traffic for android.googlesource.com must connected in proxy serverfor reduce the size of storage, cache data will be deleted 7 days when nobody asked that data.
and here's what i did in nginx (with chatgpt 5, gemini ...)
user www-data;
worker_processes auto;
pid /run/nginx.pid;
error_log /var/log/nginx/error.log;
include /etc/nginx/modules-enabled/*.conf;
events {
worker_connections 768;
# multi_accept on;
}
http {
##
# Basic Settings
##
sendfile on;
tcp_nopush on;
types_hash_max_size 2048;
# server_tokens off;
# server_names_hash_bucket_size 64;
# server_name_in_redirect off;
include /etc/nginx/mime.types;
default_type application/octet-stream;
##
# SSL Settings
##
ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3; # Dropping SSLv3, ref: POODLE
ssl_prefer_server_ciphers on;
##
# Logging Settings
##
log_format cache_log '$time_local $remote_addr - $upstream_cache_status "$request" $status $body_bytes_sent';
access_log /var/log/nginx/access.log cache_log;
##
# Gzip Settings
##
gzip on;
## Proxy Setting
proxy_temp_path /DATA1/nginx/proxy_temp 1 2;
proxy_cache_path /DATA1/nginx/cache levels=1:2 keys_zone=https_git_cache:10m max_size=500g inactive=7d use_temp_path=off;
proxy_cache_lock on;
proxy_cache_lock_timeout 30s;
proxy_cache_revalidate on;
proxy_cache_background_update on;
proxy_cache_min_uses 1;
proxy_max_temp_file_size 131072m;
map $uri $upstream_host {
~^/googleapi_storage/ storage.googleapis.com;
default android.googlesource.com;
}
map $uri $upstream_uri {
~^/googleapi_storage(?<path>.*) $path;
default $uri;
}
server {
listen 8443;
resolver 8.8.8.8;
proxy_ssl_server_name on;
proxy_ssl_name $upstream_host;
proxy_http_version 1.1;
client_max_body_size 0;
proxy_buffering on;
proxy_buffer_size 128k;
proxy_buffers 64 128k;
proxy_busy_buffers_size 256k;
proxy_request_buffering on;
proxy_ignore_headers Set-Cookie Cache-Control Expires;
add_header X-Cache-Status $upstream_cache_status always;
location / {
proxy_pass https://$upstream_host$upstream_uri$is_args$args;
proxy_cache https_git_cache;
proxy_cache_key "$scheme://$upstream_host$request_uri";
proxy_cache_methods GET HEAD;
proxy_cache_valid 200 302 301 307 7d;
#proxy_cache_valid 301 307 10m;
#proxy_cache_valid any 5m;
proxy_set_header Host $upstream_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
#proxy_buffering on;
#proxy_buffer_size 16k;
#proxy_buffers 8 32k;
proxy_cache_lock on;
}
proxy_read_timeout 600s;
proxy_send_timeout 600s;
}
##
# Virtual Host Configs
##
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
We expected the caching data to be at least 55GB or more, but in reality, only 3.2MB was stored.
According to chatgpt, large amounts of GIT data with https are imported in the form of POST, but the body value will change every time, so even if POST is cached, it cannot be utilized.
is there any suggest that we cached data correctly?
or Should We reviewed another tools?
3 Replies 3
Git 'smart HTTP' protocol is not cacheable in general, as it performs interactive transactions instead of downloading static data. The client negotiates which commits it wants, and the server generates a large 'packfile' on the fly.
A related issue is that the 'smart HTTP' protocol always negotiates a single packfile for everything. So even if those POSTs were cacheable in themselves, both the request and the resulting packfile would become stale literally the next day (even next hour) as new commits flow in to the repository, making your cache fill rapidly with data that becomes useless almost immediately.
Git 'dumb HTTP' protocol would be cacheable, as it downloads static pre-made packfiles (plus individually some objects that haven't yet been packed on the server). Unfortunately googlesource does not host the static files necessary for this protocol to work.
I suggest identifying several largest repositories that the developers need, and setting up an ordinary git clone --mirror cache for those, while still cloning the smaller ones from googlesource directly.
(If they turn out to be related repositories – e.g. 5 forks of the same Linux kernel – Git has a way to share common objects using --reference to reduce the space usage.)