Insert into SQLite3 database

Question 1

I have a tsv file containing the mapping for two kinds of identifiers.

accession accession.version taxid gi
V00184 V00184.1 44689 7184
V00185 V00185.1 44689 7186
V00187 V00187.1 44689 7190
X07806 X07806.1 7227 8179

Basically, I want to be able to get the taxid from an accession number. I though I could put this in a database with a PRIMARY KEY on the accession field.

This is what I did and It works but I have around 1.5billion lines and it takes a lot of time to do.

import sqlite3
import time
start = time.time()
connection = sqlite3.connect("test.sqlite")
cursor = connection.cursor() 
cursor.execute("""
 CREATE TABLE IF NOT EXISTS map (
 accession TEXT PRIMARY KEY,
 accession_version TEXT,
 taxid TEXT, gi TEXT
 )""")
def read_large_file(f):
 """Generator for the file"""
 for l in f:
 yield l.strip().split()
with open("test_file.map", "r") as f:
 next(f) # ignore header
 cursor.executemany("INSERT INTO map VALUES (?, ?, ?, ?)", read_large_file(f))
cursor.close()
connection.commit()
connection.close()
print(time.time() - start)

Do you have any tricks or ideas I could use to speed up the insert ? (I will do very basic SELECT on the data using the accession primary key)

Question 2

First, some comments on your code:

An sqlite3 connection can be used as a context manager. This ensures that the statement is committed if it succeeds and rolled back in case of an exception. Unfortunately it does not also close the connection afterwards

with sqlite3.connect("test.sqlite") as connection, open("test_file.map") as f:
 connection.execute("""
 CREATE TABLE IF NOT EXISTS map (
 accession TEXT PRIMARY KEY,
 accession_version TEXT,
 taxid TEXT, gi TEXT
 )""")
 next(f) # ignore header
 connection.executemany("INSERT INTO map VALUES (?, ?, ?, ?)", read_large_file(f))
connection.close()

You should separate your functions from the code calling it. The general layout for Python code is to first define your classes, then your functions and finally have a main block that is protected by a if __name__ == "__main__": guard to allow importing from the script without executing all the code.
open automatically opens a file in read-mode, if not specified otherwise.

That being said, if you have a billion lines, basically any approach is probably going to be slow. Here is an alternate approach using dask. It may or may not be faster, you will have to test it. The usage is very similar to pandas, except that the computations are only performed once committed to with the call to compute().

First, to install dask:

pip install dask[dataframe] --upgrade

Then for the actual usecase you mention, finding a specific gi in the table:

from dask import dataframe
df = dataframe.read_csv("test_file.map", sep="\t")
df[df.gi == 7184].compute()
# accession accession.version taxid gi
# 0 V00184 V00184.1 44689 7184

In the call to dataframe.read_csv you can set it to read the file in blocks if needed, e.g. in 25MB chunks:

df = dataframe.read_csv("test_file.map", sep="\t", blocksize=25e6)

Question 3

Thank you ! I've tried dask and it wasn't faster for this particular case but It seems like I could use it somewhere else. Something I noticed is the PRIMARY KEY statement, removing it seemed to speed up the insertion. However SQLite3 does not provide an ALTER TABLE ADD CONSTRAINT functionality. I'm trying with Postgresql to see if adding the constraint afterwards leads to any significant improvements

Graipher Graipher 41.7k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2019-04-23 14:11:09Z

First, some comments on your code:

An sqlite3 connection can be used as a context manager. This ensures that the statement is committed if it succeeds and rolled back in case of an exception. Unfortunately it does not also close the connection afterwards

with sqlite3.connect("test.sqlite") as connection, open("test_file.map") as f:
 connection.execute("""
 CREATE TABLE IF NOT EXISTS map (
 accession TEXT PRIMARY KEY,
 accession_version TEXT,
 taxid TEXT, gi TEXT
 )""")
 next(f) # ignore header
 connection.executemany("INSERT INTO map VALUES (?, ?, ?, ?)", read_large_file(f))
connection.close()

You should separate your functions from the code calling it. The general layout for Python code is to first define your classes, then your functions and finally have a main block that is protected by a if __name__ == "__main__": guard to allow importing from the script without executing all the code.
open automatically opens a file in read-mode, if not specified otherwise.

That being said, if you have a billion lines, basically any approach is probably going to be slow. Here is an alternate approach using dask. It may or may not be faster, you will have to test it. The usage is very similar to pandas, except that the computations are only performed once committed to with the call to compute().

First, to install dask:

pip install dask[dataframe] --upgrade

Then for the actual usecase you mention, finding a specific gi in the table:

from dask import dataframe
df = dataframe.read_csv("test_file.map", sep="\t")
df[df.gi == 7184].compute()
# accession accession.version taxid gi
# 0 V00184 V00184.1 44689 7184

In the call to dataframe.read_csv you can set it to read the file in blocks if needed, e.g. in 25MB chunks:

df = dataframe.read_csv("test_file.map", sep="\t", blocksize=25e6)

Thank you ! I've tried dask and it wasn't faster for this particular case but It seems like I could use it somewhere else. Something I noticed is the PRIMARY KEY statement, removing it seemed to speed up the insertion. However SQLite3 does not provide an ALTER TABLE ADD CONSTRAINT functionality. I'm trying with Postgresql to see if adding the constraint afterwards leads to any significant improvements

Stack Exchange Network

Insert into SQLite3 database

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Insert into SQLite3 database

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions