ClientSideWebCache

Is there an existing Python module that takes care of retrieving and caching web page contents?

Something like:

 1 import cachedweb
 2 
 3 cache = cachedweb("/home/user/.web_cache") # Maintain cache data in .web_cache
 4 print cache.get("http://example.net")

Perhaps there are different options for where and how to store cache data.

I have written at least three programs that do this, (nLSD interpreter, and two Local Names servers,) and am about to embark on a fourth program.

Has anyone created a standard module or interface for this sort of thing?

Some things that would be nice:

Optional attention to HTTP cache directives.
Specify directory to store cache entries in.
Optional compression, decompression, of cached data.
Optional connection with a client-side Squid cache. (Pooling a web cache with other programs.)
Conceivably, a caching module could be a drop-in replacement for urllib.

Some info for would-be cachers:

Caching Tutorial for Web Authors and Webmasters - talks about HTTP headers having to do with caching
urllib.urlretrieve - performs some of what we want, though you have to do a lot of maintenance yourself

-- LionKimbro 2005年03月29日 06:45:28

Primitive Example

Here's a very simple example:

 1 #!/usr/bin/python
 2 """Retrieve and cache web pages.
 3 
 4 webcache retrieves and caches web pages. If the webpage has been
 5 retrieved before, the cached version is used.
 6 
 7 The module is primitive; it DOES NOT respect HTTP cache headers. Cached
 8 pages are stored in a BSD database.
 9 
 10 WebCache -- cache for web pages
 11 """
 12 
 13 import time
 14 import urllib
 15 import optparse
 16 import bsddb
 17 
 18 
 19 class WebCache:
 20 
 21  """BSD DB cache for web pages.
 22 
 23  get_page -- retrieve a page from cache or web
 24  dump_page -- dump a cache entry
 25  clean -- vet expired cache entries
 26  """
 27 
 28  def __init__(self, page_db_filename, time_db_filename, cache_ttl):
 29  """Initialize web cache.
 30 
 31  Berkeley databases are created if they don't already exist. The
 32  page database stores the contents of web pages. The time
 33  database stores the times that the pages were loaded.
 34 
 35  Times are stored in seconds since the epoch.
 36 
 37  page_db_filename -- filename of page database
 38  time_db_filename -- filename of load timestamp database
 39  cache_ttl -- cache time to live in seconds
 40  """
 41  self._page_db = bsddb.hashopen(page_db_filename)
 42  self._time_db = bsddb.hashopen(time_db_filename)
 43  self.cache_ttl = cache_ttl
 44 
 45  def get_page(self, url):
 46  """Retrieve a page from the web or the cache.
 47 
 48  get_page returns the page contents retrieved by urllib.urlopen.
 49 
 50  url -- URL of web page to retrieve
 51  """
 52 
 53  now = time.time()
 54  if url in self._time_db:
 55  last_read = float(self._time_db[url])
 56  if now < last_read + self.cache_ttl:
 57  return self._page_db[url]
 58 
 59  contents = urllib.urlopen(url).read()
 60 
 61  self._page_db[url] = contents
 62  self._time_db[url] = str(now)
 63  self._page_db.sync()
 64  self._time_db.sync()
 65 
 66  return contents
 67 
 68  def dump_page(self, url):
 69  """Force a cache entry to expire."""
 70 
 71  del self._time_db[url]
 72  del self._page_db[url]
 73  self._time_db.sync()
 74  self._page_db.sync()
 75 
 76  def clean(self):
 77  """Vet cache of expired entries.
 78 
 79  Note that the BSD database file may not actually get smaller.
 80  (Rather, older data will be overwritten by new data.)
 81  """
 82 
 83  now = time.time()
 84  for (url, last_read) in self._time_db.items():
 85  last_read = float(last_read)
 86  if now >= last_read + self.cache_ttl:
 87  del self._time_db[url]
 88  del self._page_db[url]
 89  self._time_db.sync()
 90  self._page_db.sync()
 91 
 92 
 93 if __name__ == "__main__":
 94  parser = optparse.OptionParser("usage: %prog [options]\n"
 95  "cleans the cache if no URL is"
 96  " supplied")
 97  parser.add_option("-p", "--pages", dest="page_db_filename",
 98  default="pages.db", type="string",
 99  help="pages BSD database filename")
 100  parser.add_option("-t", "--times", dest="time_db_filename",
 101  default="times.db", type="string",
 102  help="timestamps BSD database filename")
 103  parser.add_option("-T", "--ttl", dest="ttl", default=60*60,
 104  type="int", help="time to live (in seconds)")
 105  parser.add_option("-u", "--url", dest="url", type="string",
 106  help="url of page to retrieve and display")
 107 
 108  (options, args) = parser.parse_args()
 109  if len(args) > 0:
 110  parser.error("incorrect number of arguments")
 111 
 112  cache = WebCache(options.page_db_filename, options.time_db_filename,
 113  options.ttl)
 114  if options.url is None:
 115  cache.clean()
 116  else:
 117  print cache.get_page(args[0])