I been doing some testing on various postgres SQL queries.
Testing often involves changing the syntax of queries, altering table joins, or occasional completely re-writing the query.
I've noticed that I sometimes get an 'apparent' big performance increase. I'll run a query, it'll take (say) 60 seconds to run, I'll make a minor change and it'll then take (say) 5 seconds to run.
At first I though that this was because my minor tweak had improved the performance. I've since realised that actually there must be some caching going on, (to see this, try running a 60 second query, and then running it again a few seconds later - it will always run quicker the 2nd time), I assume this is because the data has been cached locally somewhere so when the data needs reading a 2nd time it's already to hand.
I'm sure this a useful performance feature, but it does make it very hard to spot genuine performance improvements when tweaking a query. Is it possible to flush the cache before each execution to ensure that each test starts from the same position?
Thanks
1 Answer 1
PostgreSQL relies heavily on the OS cache as well as its own cache. So you would have to clear the PostgreSQL cache (restart the postgres service) and clear the OS cache (either restart the machine, or follow the method in the comments to clear without restarting if you are Linux).
But, why do you want to do it this way? If the query is parameterized, you could just change the parameters each time so they refer to a different part of the data. If the query is not parameterized, then why is the data it needs getting driven out of the cache between runs on the production machine?
When the different parameterizations have different performance due to different result sizes, the most definitive solution is to write a driver program that repeats the query several times with random (but realistic) parameterizations, and hopes that the parameter-induced variation averages out between your tuning settings. This is a pain, but when I have had to resort to it I generally came to wish I had to done it sooner. This is better than doing each query from a cold cache, because in the real production situation you are unlikely to have a fully cold cache. Certain parts of the data will be in common to all parameterizations, and that part will be "hot" enough to always be in the cache.
-
3the query IS parameterised, but testing against different parameters will give different performance results (some queries will return 1 row, some 10,000) so only by doing the test(s) on the same set of data returning the same set of rows can you really confirm if any performance tweaks have really made a difference.ConanTheGerbil– ConanTheGerbil2018年05月17日 10:03:02 +00:00Commented May 17, 2018 at 10:03
-
@Hemel "doing the test(s) on the same set of data" will show you improvements for that specific set of parameters, but if your query has a varied range of possible parameters then you are much better off averaging (or something more statistically significant like median or 95th percentile) a range of different parameter sets. You're effectively benchmarking and benchmarks are constructed exactly that way. jjanes has made some good suggestions, including generating a lot of random-realistic scenarios. If you run the same ones, you can't avoid the caching effect.Davos– Davos2018年11月29日 00:32:52 +00:00Commented Nov 29, 2018 at 0:32
-
1Production systems are usually serving a lot more traffic on a lot more tables than a test system. I don’t think it’s reasonable to assume that cache benefits will be similar at all.Aryeh Leib Taurog– Aryeh Leib Taurog2025年03月10日 14:40:28 +00:00Commented Mar 10 at 14:40
Explore related questions
See similar questions with these tags.
EXPLAIN
andEXPLAIN ANALYZE
. And: stackoverflow.com/questions/1216660/… and stackoverflow.com/questions/24252455/…explain (analyze, buffers)
rather than just once and compare the average execution time. If you try to minimize the "buffers" that are needed in the plan you will get an improvement in case the data is not cached and you'll get an improvement if the data is read from cache because you avoid the synchronization on the memory structures (=less CPU)