3

I read an article on the Trello tech stack, under the section "MongoDB", it states the following:

[...] One of the coolest and most performance-obsessed teams we know is our next-door neighbor and sister company StackExchange. Talking to their dev lead David at lunch one day, I learned that even though they use SQL Server for data storage, they actually primarily store a lot of their data in a denormalized format for performance, and normalize only when they need to.

I've read: Which tools and technologies are used to build the Stack Exchange Network? and https://blog.stackoverflow.com/2008/09/what-was-stack-overflow-built-with/

But it doesn't cover the above statement from the blog, and the blog was written 4 years ago, January 19th, 2012 to be exact.

  1. Does SE store its data in a denomalised format then normalise it in SQL Server when needed? (Has it ever?)
  2. If so then how does this work? At what point is the data normalised? After a timeframe, event, whatever?

1 Answer 1

9

primarily store a lot of their data in a denormalized format for performance

I am not sure what exactly this is supposed to mean. I feel it overstates what we actually do and may mean something entirely different from what I (and yourself) got from it. Do note that I joined Stack Overflow about a year after that article was written.

Our single source of truth is SQL Server, and the relational design there is mostly normalized (you can see a bit of that in SEDE, though that's a simplified version of our actual data model) - normalized is our default and we only denormalize when there is a good performance benefit.

We do store some denormalized data for performance reasons and usually it is updated at the same time as the normalized tables. An example for that is the Tags column of the Posts table - it is a denormalized view of what tags have been set on a question - it and the PostTags table have the same information and are updated at the same time.

Now, due to race conditions, the normalized and denormalized versions could get out of sync, so we have daily jobs (IIRC) that look at the normalized data and fix any inconsistencies in the denormalized data.

We do cache materialized objects that came from the database (both on the web servers and in Redis) - so this might be thought of as denormalized data, but this is not persistent storage.

answered Feb 19, 2016 at 10:53
4
  • That makes sense. In the context of the article I assumed that SE used a NoSQL store and normalised it intermittently. Thanks for clearing that up Oded! Commented Feb 19, 2016 at 10:58
  • @Jezzabeanz - no, we use SQL Server as our persistent data store and Redis (mostly) for caching. You can look at this recent blog by Nick Craver about our architecture if you want more details :) Commented Feb 19, 2016 at 11:00
  • Sorry if my comment was unclear, I was simply stating what I thought (before your answer) was being implied. Thanks for the blog post I'll give that a read on my lunch Commented Feb 19, 2016 at 11:01
  • 8
    +1 The logic isn't "store denormalized, normalize if you need to", but the reverse: Normalization is the default, but if necessary, also store denormalized data in addtion. Another example is post score (=upvotes minus downvotes). Upvotes and downvotes are stored, but the resulting score is also stored, so it doesn't have to be recalculated everytime a post is displayed. Still, the normalized data is ground truth here, and mismatches (e.g. because of potential race conditions) will be fixed eventually, within 24 hours I think. /cc @Jezzabeanz Commented Feb 19, 2016 at 11:01

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.