What's the best way to cache a growing database table for html generation?

Question 1

I've got a database table which will grow in size by about 5000 rows a hour. For a key that I would be querying by, the query will grow in size by about 1 row every hour. I would like a web page to show the latest rows for a key, 50 at a time (this is configurable). I would like to try and implement memcache to keep database activity low for reads.

If I run a query and create a cache result for each page of 50 results, that would work until a new entry is added. At that time, the page of latest results gets new result and the oldest results drops off. This cascades down the list of cached pages causing me to update every cache result. It seems like a poor design.

I could build the cache pages backwards, then for each page requested I should get the latest 2 pages and truncate to the proper length of 50. I'm not sure if this is good or bad?

Ideally, the mechanism I use to insert a new row would also know how to invalidate the proper cache results.

Has someone already solved this problem in a widely acceptable way? What's the best method of doing this?

EDIT:

If my understanding of the MYSQL query cache is correct, it has table level granularity in invalidation. Given the fact that I have about 5000 updates before a query on a key should need to be invalidated, it seems that the database query cache would not be used. MS SQL caches execution plans and frequently accessed data pages, so it may do better in this scenario.

My query is not against a single table with TOP N. One version has joins to several tables and another has sub-selects.

Also, since I want to cache the html generated table, I'm wondering if a cache at the web server level would be appropriate? Is there really no benefit to any type of caching? Is the best advice really to just allow a website site query to go through all the layers and hit the database every request?

Question 2

Why do you need to cache this data? Have you run performance tests and determined that the impact of what appears to be a simple TOP N query is actually significant?

Question 3

I was planning on serving the data as json and as an html table. My thought was to process the results and cache the json and html in separate cache keys. The database will be doing a lot of other tasks, so I thought I'd try and keep web server off of it whenever possible. I don't have stats yet, so this is premature optimization. Please don't say "don't optimize prematurely." I'm looking for answers that come from actual experience with a similar volume, not the correct answer for my scenario.

Question 4

It's hard to compare similar volume and workload, but to provide a rough indication, I've been working with a database running on old hardware with one table growing by about 100,000 records per day, another by about 200,000 and another by about 400,000. They are all linked and used in searches and reports and we have never needed caching. For some monthly reports, due to the volume of data, we implemented summary tables where we increment certain values upon certain inserts.

Question 5

Don't optimize prematurely. Why ask the question at all if you're already dead-set on a particular course of action? The only way to do this as per your spec which isn't likely to make performance worse is to use a write-back/write-through cache, and the complexity of that is almost certainly not worth the effort.

Question 6

I'm not dead set on using a cache. I'm also willing to accept answers that involve using frameworks that have already solved these issues. If/when I get to the point where I need something, I want to know what some good options are. I'm also interesting in learning where the line is to start looking at optimization.

Question 7

Unless I'm misunderstanding the question, I don't think that this is an appropriate scenario for caching.

Cached data normally has at least one of the following attributes (usually all of them):

Expensive to retrieve or compute;
Highly static - may change occasionally but very rarely;
Non-critical - OK if the requester sees stale data.

It doesn't sound like any of these apply to your situation.

The query is a simple SELECT, probably TOP N, just an index seek;
It changes very frequently;
Your requirements indicate that immediate updates are required.

So why are you caching? Caching isn't a panacea; oftentimes it can actually make performance worse, if the cache memory could be better used for some other purpose.

Databases do their own caching. As long as the DB server has plenty of memory then it may cache the entire table in memory if it's frequently queried; the performance of that will be just as good as your cache if not better.

Some further thoughts/suggestions:

If stale data is OK, then the simplest solution would be to use a fixed interval (i.e. expiration). This method is used very effectively in hundreds of thousands of sites and systems. You can either force an update on expiration or just wait until it's requested again.
If you're concerned about conflicts between reads and writes, then (a) don't be, until you've profiled it, and (b) if it really is an issue then instead of trying to cache it, just use a redundant table or a NOLOCK hint.

If you need to invalidate the cache every single time a row is added/changed then you have completely defeated the purpose of an application cache, and are now trying to implement an in-memory database. Please don't do this unless you have an extremely good reason for it.

Question 8

The volume of rows you are dealing with is very low - less than 10,000 per year.

Implementing a caching mechanism for this would overly complicate something the database can do very quickly and easily, especially with the right indexes in place.

Is there a specific reason you are trying to implement memcache for this scenario?

MySQL's built in query cache will actually work well here. Since your table doesn't change very often it will cache you results nicely and reduce the database read activity anyway.

Note: I have assumed MySQL since that's where I usually see memcache implemented :)

Edit: Based on the updated details, I would still suggest going with a straight database solution. The load on the database for reading only really becomes an issue in high volume environments. Good indexing and query optimisation will usually provide good performance in many environments.

If you do need to take the memcache path, I would suggest that you don't try to micro-manage the pages in the cache.

Each insert can check memcache - if the insert introduces the new entry for the key (which you have said happens about once every hour) then it should invalidate the entire cache relating to that table.

When someone requests any page of the results, you would check memcache. If the results for that page are already there then use them. If not, the run a query specifically to get just that page, cache the page details in memcache and return the results.

With this approach, the cache management is simple and you only regenerate the cache page the first time it is requested after an invalidation. Following requests will use the cache until the next invalidation. This approach will also mean only caching data for pages which are actually requested.

Question 9

I incorrectly specified the volume and edited the question to reflect that. 1 per hour is for a specific key, the table grows much faster.

Question 10

Use the database cache

It'll handle this low volume with ease. Just use the following query:

select top 50 from your_view

Question 11

and for page 2?

Question 12

Just use a paged query as normal. You don't need caching at all for your scenario.

Question 13

Another trick that works in many database is a composite index. Usually composite indexes perform well if the query is done in the same order as the order of columns in the index. In your case, the index would be on .

This way, the index is built like (key1,row1) (key1, row2) (key2, row3) (key2, row5) (key3, row6) and so on. The index will find it very easy to query based on the prefix key here - and the top / latest element fetch would be lightning quick. If you go down this route, ensure your db is actually picking this composite index for this top query.

Question 14

I wouldn't exactly call this a "trick", it's more like a fundamental concept of indexing. And most of the time, the order of predicates in the query doesn't matter at all, since the optimizer will rearrange them as needed.

Question 15

By order of querying I meant if your first reduction parameter and second are of the same order as the keys in a composite index

Aaronaught Aaronaught 44.5k11 gold badges98 silver badges127 bronze badges · Answer 1 · 2011-05-23 23:06:52Z