Oren Eini

aka Ayende Rahien

Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,616
|
Comments: 51,246

Copyright ©️ Ayende Rahien 2004 — 2025

Privacy Policy · Terms
filter by tags archive
Comparison page for RavenDB and MongoDB

Reviewing Lightning memory-mapped database libraryTransactions & commits

time to read 3 min | 567 words

Okay, so I have a pretty good idea about how things works now, we have transactions, which contains the dirty pages (and a transaction can store up to 128K of pages, so there is a max about 512MB of changes in a single transaction). While inside the transaction, you are using the local dirty pages to get consistent view of the data, and keep track of the freed pages. But how do we actually get it committed, and how does it works with ensuring the DB is ACID?

A transaction would go to disk in one of two cases, either it has some dirty pages that it needs to flush, or it has to update the db flags (which aren’t really interesting for us right now).

The first thing that happen in the transaction commit is that we save the freed pages using mdb_freelist_save. Now, the interesting about this is that we save the freed pages in the file... in the file. This leads to some really interesting code, in which you are trying to write to the B-Tree about free pages, and during the write, you are freeing pages, so you need to record that too.

The data about free pages is stored in the FREE_DBI, and it is stored there with the transaction id as the key, and the list of freed pages as the value. There is also a bunch of code there that refers to overflow pages, but I am going to skip that for now.

And now, this is probably the most important part:

image

mdb_page_flush() will write all the data to disk. If using writable mmap, by just updating the memory and clearing the dirty flag, or by doing file I/O. The next part, mdb_env_sync basically just call fsync() on the newly written data.

But that just make sure that the data is on disk, it doesn’t commit it yet. This is done via mdb_env_write. Since this is the most essential part of the commit, it is interesting to see how LMDB ensure that this is safe. If you remember, when we created the file we saved the first two pages as copies of the environment metadata. At the time, I wasn’t sure why that was the case. It brought to mind the CouchDB’s method of writing the start of the B-Tree in the start of the file twice, to ensure safety. But I think that the LMDB method is better, what it does, the first time, it create a duplicate entry.

After that, however, it works by alternating between the two. One transaction will flush the data to the first page and the next to the second one. When starting up, LMDB will read the two entries and select the most recent of them. It is a really nice way of handling this. But I think that I would be happy with a better way to handle corruptions. For example, a CRC32 or something like that, to make sure that this isn’t actually a failed write that got midway through.

Next up, I need to figure out how this applies with regards to selecting a free page with respect to the oldest running transaction... But that is a topic for the next post.

Tweet 15 comments

Related posts that you may find interesting:

07 Aug 2013 Reviewing Lightning memory-mapped database library: What about free pages?
11 Sep 2013 Degenerate performance scenario for LMDB
05 Aug 2016 The Guts n’ Glory of Database Internals: Merging transactions

More posts in "Reviewing Lightning memory-mapped database library" series:

  1. (08 Aug 2013) MVCC
  2. (07 Aug 2013) What about free pages?
  3. (06 Aug 2013) Transactions & commits
  4. (05 Aug 2013) A thoughtful hiatus
  5. (02 Aug 2013) On page splits and other painful things
  6. (30 Jul 2013) On disk data
  7. (25 Jul 2013) Stepping through make everything easier
  8. (24 Jul 2013) going deeper
  9. (15 Jul 2013) Because, damn it!
  10. (12 Jul 2013) tries++
  11. (09 Jul 2013) Partial

Comments

Howard Chu

All of LMDB's transaction/commit architecture was explained in the published papers and presentations. You've done yourself a disservice by not reading them first. http://symas.com/mdb/

CouchDB is a pure append-only design, with all of its inherent flaws.

LMDB is immune to torn writes. The relevant portion of a meta page is less than 128 bytes, while the minimum unit a drive can write is 512 bytes. Other systems (like BDB, SQLite, etc.) require an entire page to be written atomically for them to maintain consistency.

Ayende Rahien

Howard, It is a lot more interesting (and frustrating, I admit) from my point of view to go into a code base cold. In particular, that forces me to go over what you have done and figure it out.

Howard Chu

More interesting perhaps, but you're drawing wrong conclusions about both what was done and why. When the point is to learn, you should learn. The correct lessons.

LMDB's commit technique is one of the most basic high throughput concurrency techniques in computing - double-buffering. There's certainly some elegance to it but there's nothing mysterious or exotic here, it's just a very tried-and-true practice. It should already be in every programmer's base knowledge.

Ayende Rahien

Howard, I think that there is some difference in the world view that you have and I with regards to basic knowledge. Usually when I think of committing data, I think about something like either CouchDB's model of the double signed prefix or log file with signed records. I never thought about doing it this way, and I think it is a nice way of doing that.

Ayende Rahien

Howard, Also, please note that I am doing that for my own learning, and figuring out things on my own is a joy. It also mean that the learning is done at a much higher deeper level than if I was just able to repeat by rote some stuff I read.

Howard Chu

What I meant is that the technique of double-buffering itself is well known. It's a staple of graphics libraries, as well as most algorithms where more than one actor can operator on an item concurrently. Whether or not it's commonly used for txn commit is irrelevant. What is commonly done in code today is mostly garbage. Clearly double-buffering is good for the purpose, as LMDB demonstrates.

Beyers

I've been following this series purely to read the comments between Ayende and Howard. Both of you are clearly highly talented and experts in your fields, but it's interesting to see the different mindsets to approach things.

It's also somewhat comical to see someone else have a similar direct and to the point with a hint of condescending approach in responses that Ayende is used to dishing out, but I'm sure seldom receive :)

Duke

Haha Yeah me too, Keep it up lads :)

peter

I must agree with Oren that just looking at things from a more or less blank slate is a Good Thing for someone who like to own his knowledge. I commented on this a year or two ago, referring to a short scifi story about a child prodigy who was kept away from all other music so that he wouldn't end up copying, or worse, not compose something due to fears he was copying. (BTW the tone of "Ayende" has changed from long ago, when he used to post alot more about the scifi he read.) Howard, I think you should lighten up a bit and let Oren do his thing. Also the future-post nature of the blog means that we are always sort of commenting on old news.

Duke

Disagree, Howard keep saying it as you see it

Rafal

Yeah, Howard's remarks make this blog interesting again, even if he's somewhat lacking in the courtesy department. And I really don't mind if few egos get hurt in the process, successful learning requires a bit of humility.

Howard Chu

peter: obviously Oren can do whatever he wants and nothing I say will change it. But there are more efficient ways to learn about a subject area, than by completely ignoring all of the information already written about it, whose sole purpose is to explain the subject. And by now it should be obvious, I abhor anything that is not at peak efficiency.

Brian

This really has been an interesting series of comment threads. My take is that Howard is just annoyed that he has to keep coming back here to defend against "ignorant statements" (my interpretation of his feelings) that "would never have been made in the first place if Oren would just read the documentation I took the time to write for reasons not the least of which was to avoid having to do just this" and that these "lemmings" who read Oren's blog will likely just believe the half-assed statements without question. Oren really is just enjoying the challenge of going through an obscure codebase and "learning his way around". Howard would seem to prefer he limit the critical commentary to areas where he hasn't already covered it in his prior writings or in the accumulated knowledge of those old skool low level programmers who "know what the eff they are talking about."

Whether I'm off base or not, I'm enjoying the sparring. I doubt Howard is as much of a douche nozzle as he can come across as (though sometimes it does seem poetic justice that it's levied against the king of all apple polishers). But either way, I respect both of their minds and wouldn't be here if I didn't feel I learned something useful.

Howard Chu

Brian - I'm annoyed that he hasn't yet gotten to the parts of the code that I found difficult. ;)

Kelly Sommers

Ayende,

I have to agree with Howard that this technique isn't anything new and uncommon. This is essentially what's called Shadow Paging. Pat Helland approached Jim Gray about Shadow Paging back in the 1980's in discussions about database architectures.

Many databases since then have been doing Shadow Paging in the 1990's. Unfortunately a lot of modern databases ignore the research of the past which is still for the most part relevant today.

Comment preview

Comments have been closed on this topic.

Markdown formatting

ESC to close

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic* **bold**
_italic_ __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
 [id]: http://example.com/ "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1. Foo
2. Bar

Unordered, with paragraphs:

* A list item.
 With multiple paragraphs.
* Bar

You can nest them:

* Abacus
 * answer
* Bubbles
 1. bunk
 2. bupkis
 * BELITTLER
 3. burper
* Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - - 

Manual Line Breaks

End a line with two or more spaces:

Roses are red, 
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit |Color
---------|----------
Apples |Red
Pears	 |Green
Bananas |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

FUTURE POSTS

  1. Using multi-staged actions with AI Agents to reduce costs & time - 10 hours from now

There are posts all the way to Nov 17, 2025

RECENT SERIES

  1. Recording (18):
    29 Sep 2025 - How To Run AI Agents Natively In Your Database
  2. Webinar (8):
    16 Sep 2025 - Building AI Agents in RavenDB
  3. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  4. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  5. RavenDB News (2):
    02 May 2025 - May 2025
View all series

Syndication

Main feed ... ...
Comments feed ... ...
}

AltStyle によって変換されたページ (->オリジナル) /