In a Python Bottle server using SQLite, I noticed that doing a DB commit after each INSERT is not efficient: it can use 100ms after each client request. Thus I wanted to improve this in How to commit DB changes in a Flask or Bottle app efficiently? .
I finally came to this solution, which is more or less a "debounce"-like method: if multiple SQL INSERT
happen during a 10-second timeframe, group all of them in a single DB commit.
What do you think of the following code? Is it safe to do like this?
(Note: I know that the use of a global
variable should be avoided, and replaced by a class
/ object with attributes, etc. I'll do this, but this part is not really on topic here).
import bottle, sqlite3, random, threading, time
@bottle.route('/')
def index():
global committhread
c = db.cursor()
c.execute('INSERT INTO test VALUES (?)', (random.randint(0, 10000),))
c.close()
if not committhread:
print('Calling commit()...')
committhread = threading.Thread(target=commit)
committhread.start()
else:
print('A commit is already planned.')
return 'hello'
def commit():
global committhread
print("We'll commit in 10 seconds.")
time.sleep(10) # I hope this doesn't block/waste CPU here?
db.commit()
print('Committed.')
committhread = None
db = sqlite3.connect('test.db', check_same_thread=False)
db.execute("CREATE TABLE IF NOT EXISTS test (a int)")
committhread = None
bottle.run(port=80)
If you run this code, opening http://localhost/ once will plan a commit 10 seconds later. If you reopen http://localhost/ multiple times, less than 10 seconds later, you will see that it will be grouped in the same commit, as desired.
Note: this method is not "Do it every 10 seconds" (this would be a classic timer), but rather: "Do it 10 seconds later; if another INSERT comes in the meantime, do all of them together". If there is no INSERT
during 60 minutes, with my method it won't do anything at all. With a timer it would still periodically call a function (and notice there's nothing to do).
Worth reading too: How to improve SQLite insert performance in Python 3.6?
1 Answer 1
What do you think of the following code? Is it safe to do like this?
In my opinion, it is not. So many things could go wrong.
One example: in this code there is no exception handling. If your program crashes for any reason, your routine may not be triggered. Fix: add an exception handler that does some cleanup, commits and closes the DB. Or better yet, just do that commit in the finally
block.
By the way the doc says this about the close
function:
This closes the database connection. Note that this does not automatically call commit(). If you just close your database connection without calling commit() first, your changes will be lost!
So it is a good idea to commit systematically.
time.sleep(10) # I hope this doesn't block/waste CPU here?
time.sleep
blocks the calling thread. It is useless anyway because what you want here is a timer, not a thread. You can have a timer routine that runs every n seconds to perform a given task. But you should still have a commit in the finally
block, so that all pending changes are written to the DB when the program ends, even after an exception.
Now to discuss the functionality more in depth:
You say:
In a Python Bottle server using SQLite, I noticed that doing a DB commit after each INSERT is not efficient: it can use 100ms after each client request.
That may not be 'efficient' but if I have to choose between slow and safe there is no hesitation. Have you actually measured how long it takes on your own environment ?
On Stack Overflow you wrote:
I optimized my server to serve pages very fast (10ms), and it would be a shame to lose 100ms because of DB commit.
While I applaud your obsession with performance, does 100 ms really make a difference to your users ? It normally takes more than 100 ms to load a page or even refresh a portion of it using Ajax or a websocket. The latency resides in the network transport. I don't know how your application is structured but my priority would be to deliver as little traffic as possible to the users. Websocket + client-side JS should do.
Perhaps using a different storage medium could improve IO performance. If you are not using a SSD drive, maybe you could consider it or at least test it.
Before writing code like this I would really try to exhaust all possibilities, but it is better (more reliable) to let SQLite handle things using the options that already exist. What have you tried so far ?
Would this be acceptable to you ?
PRAGMA schema.synchronous = OFF
With synchronous OFF (0), SQLite continues without syncing as soon as it has handed data off to the operating system. If the application running SQLite crashes, the data will be safe, but the database might become corrupted if the operating system crashes or the computer loses power before that data has been written to the disk surface. On the other hand, commits can be orders of magnitude faster with synchronous OFF.
Source: PRAGMA Statements
There is a risk of corruption in case of power loss but this is no worse than what you are doing. If on the other hand data integrity is more important you should stick to a full commit after every operation.
You should also have a look at Write-Ahead Logging. This may interest you if there are concurrent writes to your database. Otherwise opening the DB in EXCLUSIVE mode may bring some benefits (see the PRAGMA page for details).
More detailed discussions:
Last but not least: transactions. SQLite starts an implicit transaction automatically every time you run a SQL statement and commits it after execution. You could initiate the BEGIN & COMMIT TRANSACTION statements yourself. So if you have a number or related writes, regroup them under one single transaction. Thus you do one commit for the whole transaction instead of one transaction per statement (there is more consistency too: in case an error occurs in the middle of the process you won't be left with orphaned records).
There are quite many things you can try until you find the mix that is right for you.
-
\$\begingroup\$ Thank you very much for your detailed answer. You're right: there are many other things to try before doing my
do it in 10 seconds, and group all further INSERT in the same commit
method. I just tried your PRAGMA suggestion, and thisc.execute('PRAGMA journal_mode = OFF')
gives a massive improvement.c.execute('PRAGMA synchronous = OFF');
makes it asynchronous so when we measure it's 0 ;) So it's probably done "when the current process has some spare time" and it's perfect for my needs. \$\endgroup\$Basj– Basj2020年05月05日 18:24:53 +00:00Commented May 5, 2020 at 18:24 -
\$\begingroup\$ A little remark about time.sleep blocks the calling thread. It is useless anyway because what you want here is a timer, not a thread. You can have a timer routine that runs every n seconds to perform a given task. My method is slightly different though: it's not "do it every 10 seconds" (this would be a classic timer), but rather: "do it 10 sec later; if another INSERT comes in the meantime, do all of them together". If there is no INSERT during 60 minutes, with my method it won't do anything at all. With a timer it would still periodically call a function (and notice there's nothing to do). \$\endgroup\$Basj– Basj2020年05月05日 18:31:13 +00:00Commented May 5, 2020 at 18:31
-
\$\begingroup\$ Last thing: with
synchronous = OFF
, will the OS (Linux) write to disk as soon it has time to do it, or are there cases in which the OS will wait several minutes before doing it? \$\endgroup\$Basj– Basj2020年05月05日 18:55:54 +00:00Commented May 5, 2020 at 18:55 -
\$\begingroup\$ Looks like
PRAGMA synchronous = OFF
works, but for those wondering aboutPRAGMA <schema>.synchronous = OFF
, the default schema inmain
, as inPRAGMA main.synchronous = OFF
\$\endgroup\$Terry Brown– Terry Brown2022年01月14日 20:39:27 +00:00Commented Jan 14, 2022 at 20:39
Explore related questions
See similar questions with these tags.
db.commit()
part (low-end server) so the order of magnitude is 100ms. \$\endgroup\$