2
\$\begingroup\$

I am parsing a huge XML file. It contains some million article entries like this one:

<article key="journals/cgf/HaeglerWAGM10" mdate="2010-11-12">
 <author>Simon Haegler</author>
 <author>Peter Wonka</author>
 <author>Stefan Müller Arisona</author>
 <author>Luc J. Van Gool</author>
 <author>Pascal Müller</author>
 <title>Grammar-based Encoding of Facades.</title>
 <pages>1479-1487</pages>
 <year>2010</year>
 <volume>29</volume>
 <journal>Comput. Graph. Forum</journal>
 <number>4</number>
 <ee>http://dx.doi.org/10.1111/j.1467-8659.2010.01745.x</ee>
 <url>db/journals/cgf/cgf29.html#HaeglerWAGM10</url>
</article>

I step through the file and parse those articles by LXML. If I let the code without storing the items into my database it makes some 1000 entries in ~3 seconds. But if I activate the storage which is done by the function below it makes some 10 entries per second. Is this normal? I remember parsing the file once upon a time and the database was not such a bottleneck. But I had a different approach... (looking through my files to find it)

def add_paper(paper, cursor):
 questionmarks = str(('?',)*len(paper)).replace("'", "") # produces (?, ?, ?, ... ,?) for oursql query
 keys, values = paper.keys(), paper.values()
 keys = str(tuple(keys)).replace("'", "") # produces (mdate, title, ... date, some_key)
 query_paper = '''INSERT INTO dblp2.papers {0} VALUES {1};'''.\
 format(keys, questionmarks)
 values = tuple(v.encode('utf8') for v in values)
 cursor.execute(query_paper, values)
 paper_id = cursor.lastrowid
 return paper_id
def populate_database(paper, authors, cursor):
 paper_id = add_paper(paper, cursor)
 query_author ="""INSERT INTO dblp2.authors (name) VALUES (?) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)"""
 query_link_table = "INSERT INTO dblp2.author_paper (author_id, paper_id) VALUES (?, ?)"
 for author in authors:
 cursor.execute(query_author, (author.encode('utf8'),))
 author_id = cursor.lastrowid
 cursor.execute(query_link_table, (author_id, paper_id))

Edit:

I was able to narrow the problem to those three cursor.executes. Perhaps it is a database problem. I will ask over at Stack Overflow, if someone has an idea, why it is that slow. Meanwhile I would be interested if the code can be refactored to be more pythonic. Any ideas?

Edit 2:

If use a similar approach like me storing row per row into the database, don't use the InnoDB engine. It's slower in orders of magnitude. The code is speeding up again, after I changed the engine.

Malachi
29k11 gold badges86 silver badges188 bronze badges
asked Jul 7, 2011 at 16:52
\$\endgroup\$

2 Answers 2

3
\$\begingroup\$

You might want to try doing everything inside a transaction. It may be faster that way. I think with InnoDB you might be creating a committing a transaction for every statement which will be slow.

keys = str(tuple(keys)).replace("'", "")

That's an odd way of doing that, use

keys = '(%s)' % ','.join(keys)
answered Jul 8, 2011 at 8:38
\$\endgroup\$
2
  • \$\begingroup\$ Would you please explain what you mean by doing everything inside a transactio, @Winston? \$\endgroup\$ Commented Jul 8, 2011 at 16:28
  • 1
    \$\begingroup\$ @Aufwind, lookup database transactions. They allow you to group together statements so that either they all happen or none of them happen. InnoDB supports them whereas the other MySQL engine's don't. I think that might the reason for the speed diference. \$\endgroup\$ Commented Jul 8, 2011 at 23:30
3
\$\begingroup\$

Have you considered cursor.executemany? You could potentially build up a list of several articles, and then process all at once. The only limitation to doing this would be how much memory you have.

If you've got relatively few authors and lots of papers, you could potentially hold a dictionary of author names to database id. This would make it easier to get the database id without querying the database.

answered Jun 23, 2014 at 14:27
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.