I have a tsv file containing the mapping for two kinds of identifiers.
accession accession.version taxid gi
V00184 V00184.1 44689 7184
V00185 V00185.1 44689 7186
V00187 V00187.1 44689 7190
X07806 X07806.1 7227 8179
Basically, I want to be able to get the taxid
from an accession
number. I though I could put this in a database with a PRIMARY KEY on the accession
field.
This is what I did and It works but I have around 1.5billion lines and it takes a lot of time to do.
import sqlite3
import time
start = time.time()
connection = sqlite3.connect("test.sqlite")
cursor = connection.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS map (
accession TEXT PRIMARY KEY,
accession_version TEXT,
taxid TEXT, gi TEXT
)""")
def read_large_file(f):
"""Generator for the file"""
for l in f:
yield l.strip().split()
with open("test_file.map", "r") as f:
next(f) # ignore header
cursor.executemany("INSERT INTO map VALUES (?, ?, ?, ?)", read_large_file(f))
cursor.close()
connection.commit()
connection.close()
print(time.time() - start)
Do you have any tricks or ideas I could use to speed up the insert ? (I will do very basic SELECT on the data using the accession
primary key)
1 Answer 1
First, some comments on your code:
An
sqlite3
connection can be used as a context manager. This ensures that the statement is committed if it succeeds and rolled back in case of an exception. Unfortunately it does not also close the connection afterwardswith sqlite3.connect("test.sqlite") as connection, open("test_file.map") as f: connection.execute(""" CREATE TABLE IF NOT EXISTS map ( accession TEXT PRIMARY KEY, accession_version TEXT, taxid TEXT, gi TEXT )""") next(f) # ignore header connection.executemany("INSERT INTO map VALUES (?, ?, ?, ?)", read_large_file(f)) connection.close()
You should separate your functions from the code calling it. The general layout for Python code is to first define your classes, then your functions and finally have a main block that is protected by a
if __name__ == "__main__":
guard to allow importing from the script without executing all the code.open
automatically opens a file in read-mode, if not specified otherwise.
That being said, if you have a billion lines, basically any approach is probably going to be slow. Here is an alternate approach using dask
. It may or may not be faster, you will have to test it. The usage is very similar to pandas
, except that the computations are only performed once committed to with the call to compute()
.
First, to install dask
:
pip install dask[dataframe] --upgrade
Then for the actual usecase you mention, finding a specific gi
in the table:
from dask import dataframe
df = dataframe.read_csv("test_file.map", sep="\t")
df[df.gi == 7184].compute()
# accession accession.version taxid gi
# 0 V00184 V00184.1 44689 7184
In the call to dataframe.read_csv
you can set it to read the file in blocks if needed, e.g. in 25MB chunks:
df = dataframe.read_csv("test_file.map", sep="\t", blocksize=25e6)
-
1\$\begingroup\$ Thank you ! I've tried dask and it wasn't faster for this particular case but It seems like I could use it somewhere else. Something I noticed is the PRIMARY KEY statement, removing it seemed to speed up the insertion. However SQLite3 does not provide an
ALTER TABLE ADD CONSTRAINT
functionality. I'm trying with Postgresql to see if adding the constraint afterwards leads to any significant improvements \$\endgroup\$Plopp– Plopp2019年04月24日 15:13:22 +00:00Commented Apr 24, 2019 at 15:13