|  | 
|  | 1 | +# **Python Cache: How to Speed Up Your Code with Effective Caching** | 
|  | 2 | + | 
|  | 3 | +This article will show you how to use caching in Python with your web | 
|  | 4 | +scraping tasks. You can read the [<u>full | 
|  | 5 | +article</u>](https://oxylabs.io/blog/python-cache-how-to-use-effectively) | 
|  | 6 | +on our blog, where we delve deeper into the different caching | 
|  | 7 | +strategies. | 
|  | 8 | + | 
|  | 9 | +## **How to implement a cache in Python** | 
|  | 10 | + | 
|  | 11 | +There are different ways to implement caching in Python for different | 
|  | 12 | +caching strategies. Here we’ll see two methods of Python caching for a | 
|  | 13 | +simple web scraping example. If you’re new to web scraping, take a look | 
|  | 14 | +at our [<u>step-by-step Python web scraping | 
|  | 15 | +guide</u>](https://oxylabs.io/blog/python-web-scraping). | 
|  | 16 | + | 
|  | 17 | +### **Install the required libraries** | 
|  | 18 | + | 
|  | 19 | +We’ll use the [<u>requests | 
|  | 20 | +library</u>](https://pypi.org/project/requests/) to make HTTP requests | 
|  | 21 | +to a website. Install it with | 
|  | 22 | +[<u>pip</u>](https://pypi.org/project/pip/) by entering the following | 
|  | 23 | +command in your terminal: | 
|  | 24 | + | 
|  | 25 | +python -m pip install requests | 
|  | 26 | + | 
|  | 27 | +Other libraries we’ll use in this project, specifically time and | 
|  | 28 | +functools, come natively with Python 3.11.2, so you don’t have to | 
|  | 29 | +install them. | 
|  | 30 | + | 
|  | 31 | +### **Method 1: Python caching using a manual decorator** | 
|  | 32 | + | 
|  | 33 | +A [<u>decorator</u>](https://peps.python.org/pep-0318/) in Python is a | 
|  | 34 | +function that accepts another function as an argument and outputs a new | 
|  | 35 | +function. We can alter the behavior of the original function using a | 
|  | 36 | +decorator without changing its source code. | 
|  | 37 | + | 
|  | 38 | +One common use case for decorators is to implement caching. This | 
|  | 39 | +involves creating a dictionary to store the function's results and then | 
|  | 40 | +saving them in the cache for future use. | 
|  | 41 | + | 
|  | 42 | +Let’s start by creating a simple function that takes a URL as a function | 
|  | 43 | +argument, requests that URL, and returns the response text: | 
|  | 44 | + | 
|  | 45 | +def get_html_data(url): | 
|  | 46 | + | 
|  | 47 | +response = requests.get(url) | 
|  | 48 | + | 
|  | 49 | +return response.text | 
|  | 50 | + | 
|  | 51 | +Now, let's move toward creating a memoized version of this function: | 
|  | 52 | + | 
|  | 53 | +def memoize(func): | 
|  | 54 | + | 
|  | 55 | +cache = {} | 
|  | 56 | + | 
|  | 57 | +def wrapper(\*args): | 
|  | 58 | + | 
|  | 59 | +if args in cache: | 
|  | 60 | + | 
|  | 61 | +return cache\[args\] | 
|  | 62 | + | 
|  | 63 | +else: | 
|  | 64 | + | 
|  | 65 | +result = func(\*args) | 
|  | 66 | + | 
|  | 67 | +cache\[args\] = result | 
|  | 68 | + | 
|  | 69 | +return result | 
|  | 70 | + | 
|  | 71 | +return wrapper | 
|  | 72 | + | 
|  | 73 | +@memoize | 
|  | 74 | + | 
|  | 75 | +def get_html_data_cached(url): | 
|  | 76 | + | 
|  | 77 | +response = requests.get(url) | 
|  | 78 | + | 
|  | 79 | +return response.text | 
|  | 80 | + | 
|  | 81 | +The wrapper function determines whether the current input arguments have | 
|  | 82 | +been previously cached and, if so, returns the previously cached result. | 
|  | 83 | +If not, the code calls the original function and caches the result | 
|  | 84 | +before being returned. In this case, we define a memoize decorator that | 
|  | 85 | +generates a cache dictionary to hold the results of previous function | 
|  | 86 | +calls. | 
|  | 87 | + | 
|  | 88 | +By adding @memoize above the function definition, we can use the memoize | 
|  | 89 | +decorator to enhance the get_html_data function. This generates a new | 
|  | 90 | +memoized function that we’ve called get_html_data_cached. It only makes | 
|  | 91 | +a single network request for a URL and then stores the response in the | 
|  | 92 | +cache for further requests. | 
|  | 93 | + | 
|  | 94 | +Let’s use the time module to compare the execution speeds of the | 
|  | 95 | +get_html_data function and the memoized get_html_data_cached function: | 
|  | 96 | + | 
|  | 97 | +import time | 
|  | 98 | + | 
|  | 99 | +start_time = time.time() | 
|  | 100 | + | 
|  | 101 | +get_html_data('https://books.toscrape.com/') | 
|  | 102 | + | 
|  | 103 | +print('Time taken (normal function):', time.time() - start_time) | 
|  | 104 | + | 
|  | 105 | +start_time = time.time() | 
|  | 106 | + | 
|  | 107 | +get_html_data_cached('https://books.toscrape.com/') | 
|  | 108 | + | 
|  | 109 | +print('Time taken (memoized function using manual decorator):', | 
|  | 110 | +time.time() - start_time) | 
|  | 111 | + | 
|  | 112 | +Here’s what the complete code looks like: | 
|  | 113 | + | 
|  | 114 | +\# Import the required modules | 
|  | 115 | + | 
|  | 116 | +from functools import lru_cache | 
|  | 117 | + | 
|  | 118 | +import time | 
|  | 119 | + | 
|  | 120 | +import requests | 
|  | 121 | + | 
|  | 122 | +\# Function to get the HTML Content | 
|  | 123 | + | 
|  | 124 | +def get_html_data(url): | 
|  | 125 | + | 
|  | 126 | +response = requests.get(url) | 
|  | 127 | + | 
|  | 128 | +return response.text | 
|  | 129 | + | 
|  | 130 | +\# Memoize function to cache the data | 
|  | 131 | + | 
|  | 132 | +def memoize(func): | 
|  | 133 | + | 
|  | 134 | +cache = {} | 
|  | 135 | + | 
|  | 136 | +\# Inner wrapper function to store the data in the cache | 
|  | 137 | + | 
|  | 138 | +def wrapper(\*args): | 
|  | 139 | + | 
|  | 140 | +if args in cache: | 
|  | 141 | + | 
|  | 142 | +return cache\[args\] | 
|  | 143 | + | 
|  | 144 | +else: | 
|  | 145 | + | 
|  | 146 | +result = func(\*args) | 
|  | 147 | + | 
|  | 148 | +cache\[args\] = result | 
|  | 149 | + | 
|  | 150 | +return result | 
|  | 151 | + | 
|  | 152 | +return wrapper | 
|  | 153 | + | 
|  | 154 | +\# Memoized function to get the HTML Content | 
|  | 155 | + | 
|  | 156 | +@memoize | 
|  | 157 | + | 
|  | 158 | +def get_html_data_cached(url): | 
|  | 159 | + | 
|  | 160 | +response = requests.get(url) | 
|  | 161 | + | 
|  | 162 | +return response.text | 
|  | 163 | + | 
|  | 164 | +\# Get the time it took for a normal function | 
|  | 165 | + | 
|  | 166 | +start_time = time.time() | 
|  | 167 | + | 
|  | 168 | +get_html_data('https://books.toscrape.com/') | 
|  | 169 | + | 
|  | 170 | +print('Time taken (normal function):', time.time() - start_time) | 
|  | 171 | + | 
|  | 172 | +\# Get the time it took for a memoized function (manual decorator) | 
|  | 173 | + | 
|  | 174 | +start_time = time.time() | 
|  | 175 | + | 
|  | 176 | +get_html_data_cached('https://books.toscrape.com/') | 
|  | 177 | + | 
|  | 178 | +print('Time taken (memoized function using manual decorator):', | 
|  | 179 | +time.time() - start_time) | 
|  | 180 | + | 
|  | 181 | +Here’s the output: | 
|  | 182 | + | 
|  | 183 | +Notice the time difference between the two functions. Both take almost | 
|  | 184 | +the same time, but the supremacy of caching lies behind the re-access. | 
|  | 185 | + | 
|  | 186 | +Since we’re making only one request, the memoized function also has to | 
|  | 187 | +access data from the main memory. Therefore, with our example, a | 
|  | 188 | +significant time difference in execution isn’t expected. However, if you | 
|  | 189 | +increase the number of calls to these functions, the time difference | 
|  | 190 | +will significantly increase (see [<u>Performance | 
|  | 191 | +Comparison</u>](#performance-comparison)).  | 
|  | 192 | + | 
|  | 193 | +### **Method 2: Python caching using LRU cache decorator** | 
|  | 194 | + | 
|  | 195 | +Another method to implement caching in Python is to use the built-in | 
|  | 196 | +@lru_cache decorator from functools. This decorator implements cache | 
|  | 197 | +using the least recently used (LRU) caching strategy. This LRU cache is | 
|  | 198 | +a fixed-size cache, which means it’ll discard the data from the cache | 
|  | 199 | +that hasn’t been used recently. | 
|  | 200 | + | 
|  | 201 | +To use the @lru_cache decorator, we can create a new function for | 
|  | 202 | +extracting HTML content and place the decorator name at the top. Make | 
|  | 203 | +sure to import the functools module before using the decorator:  | 
|  | 204 | + | 
|  | 205 | +from functools import lru_cache | 
|  | 206 | + | 
|  | 207 | +@lru_cache(maxsize=None) | 
|  | 208 | + | 
|  | 209 | +def get_html_data_lru(url): | 
|  | 210 | + | 
|  | 211 | +response = requests.get(url) | 
|  | 212 | + | 
|  | 213 | +return response.text | 
|  | 214 | + | 
|  | 215 | +In the above example, the get_html_data_lru method is memoized using the | 
|  | 216 | +@lru_cache decorator. The cache can grow indefinitely when the maxsize | 
|  | 217 | +option is set to None. | 
|  | 218 | + | 
|  | 219 | +To use the @lru_cache decorator, just add it above the get_html_data_lru | 
|  | 220 | +function. Here’s the complete code sample: | 
|  | 221 | + | 
|  | 222 | +\# Import the required modules | 
|  | 223 | + | 
|  | 224 | +from functools import lru_cache | 
|  | 225 | + | 
|  | 226 | +import time | 
|  | 227 | + | 
|  | 228 | +import requests | 
|  | 229 | + | 
|  | 230 | +\# Function for getting HTML Content | 
|  | 231 | + | 
|  | 232 | +def get_html_data(url): | 
|  | 233 | + | 
|  | 234 | +response = requests.get(url) | 
|  | 235 | + | 
|  | 236 | +return response.text | 
|  | 237 | + | 
|  | 238 | +\# Memoized using LRU Cache | 
|  | 239 | + | 
|  | 240 | +@lru_cache(maxsize=None) | 
|  | 241 | + | 
|  | 242 | +def get_html_data_lru(url): | 
|  | 243 | + | 
|  | 244 | +response = requests.get(url) | 
|  | 245 | + | 
|  | 246 | +return response.text | 
|  | 247 | + | 
|  | 248 | +\# Getting time for Normal function to extract HTML content | 
|  | 249 | + | 
|  | 250 | +start_time = time.time() | 
|  | 251 | + | 
|  | 252 | +get_html_data('https://books.toscrape.com/') | 
|  | 253 | + | 
|  | 254 | +print('Time taken (normal function):', time.time() - start_time) | 
|  | 255 | + | 
|  | 256 | +\# Getting time for Memoized function (LRU cache) to extract HTML | 
|  | 257 | +content | 
|  | 258 | + | 
|  | 259 | +start_time = time.time() | 
|  | 260 | + | 
|  | 261 | +get_html_data_lru('https://books.toscrape.com/') | 
|  | 262 | + | 
|  | 263 | +print('Time taken (memoized function with LRU cache):', time.time() - | 
|  | 264 | +start_time) | 
|  | 265 | + | 
|  | 266 | +This produced the following output: | 
|  | 267 | + | 
|  | 268 | +### **Performance comparison** | 
|  | 269 | + | 
|  | 270 | +In the following table, we’ve determined the execution times of all | 
|  | 271 | +three functions for different numbers of requests to these functions: | 
|  | 272 | + | 
|  | 273 | +| **No. of requests** | **Time taken by normal function** | **Time taken by memoized function (manual decorator)** | **Time taken by memoized function (lru_cache decorator)** | | 
|  | 274 | +|---------------------|-----------------------------------|--------------------------------------------------------|-----------------------------------------------------------| | 
|  | 275 | +| 1 | 2.1 Seconds | 2.0 Seconds | 1.7 Seconds | | 
|  | 276 | +| 10 | 17.3 Seconds | 2.1 Seconds | 1.8 Seconds | | 
|  | 277 | +| 20 | 32.2 Seconds | 2.2 Seconds | 2.1 Seconds | | 
|  | 278 | +| 30 | 57.3 Seconds | 2.22 Seconds | 2.12 Seconds | | 
|  | 279 | + | 
|  | 280 | +As the number of requests to the functions increases, you can see a | 
|  | 281 | +significant reduction in execution times using the caching strategy. The | 
|  | 282 | +following comparison chart depicts these results: | 
|  | 283 | + | 
|  | 284 | +The comparison results clearly show that using a caching strategy in | 
|  | 285 | +your code can significantly improve overall performance and speed. | 
|  | 286 | + | 
|  | 287 | +Feel free to visit our [<u>blog</u>](https://oxylabs.io/blog) for an | 
|  | 288 | +array of intriguing web scraping topics that will keep you hooked! | 
0 commit comments