Monday, February 22, 2010

Asynchronous programming and MySQLdb

It was asked in the previous post whether or not there would be better async support in MySQLdb-2.0. The answer is a qualified yes.

libmysqlclient (the MySQL C client library) has blocking calls, and doesn't have a true async interface. In a nutshell, the three blocking calls that are most important are mysql_query(), mysql_store_result() or mysql_use_result(), and mysql_fetch_row() (sometimes).

The original design of MySQLdb uses mysql_store_result(), which stores the entire result set in a MYSQL_RESULT structure; in this case, mysql_fetch_row() does not block. To save memory, the result is immediately converted to Python types and then freed.

An alternative cursor implementation (SSCursor) uses mysql_use_result(), which does not internally buffer the result set; this does cause mysql_fetch_row() to block, however. A further complication is that no new queries can be issued on the connection until the entire result set has been fetched. This is the primary reason why mysql_store_result() is used by default, because overall it causes less problems.

The main issue with using mysql_store_result() it can consume a lot of memory the result set is large. There's still the option of using a LIMIT clause, but it's inconvenient. But mysql_use_result() has it's own problems, in that you have to be careful to cycle through the entire result set before issuing another query. Otherwise you will get the "commands out of sequence" error.

I am pretty sure that inside the C client library, the wire protocol is exactly the same, and mysql_store_result() is just pre-buffering everything.

So in the end, anything using the C API is going to have to deal with some blocking calls. However, there are some design changes that will make some of these limitations a bit easier to deal with.

First, all the various Cursor classes are going away, and there will be only one True Cursor.

By default, cursors will still use mysql_store_result() on most queries, because there is a lot that it works very well for. This includes INSERT, UPDATE, and DELETE, which do not return any rows, but also some meta-queries such as SHOW WARNINGS, SHOW TABLES, etc. which do return rows but always a relatively small number, and don't take a long time to execute.

SELECT statements, on the other hand, will be detected, and those queries will use mysql_use_result() instead so that rows are not buffered in the C client. They will only be fetched upon demand.

The irony of using mysql_use_result() on SELECTs is, you can't scroll (mysql_data_seek()) on the result set, and this is the primary use case for scrolling. However, I still expect to make scroll work, possibly in a limited way, because these cursors will still be somewhat buffered. There will be a user-configurable maximum row limit that will be buffered, and once that buffer is filled, the oldest rows will be discarded. This limit will probably be 1 row by default.

How will the "commands out of sequence" error be avoided? If another cursor is created, the first cursor will be flushed, i.e. any remaining rows in the result set, and any additional pending result sets, will be read so that the query can be issued. There is also a clear method for the cursor, which also reads all the rows, but instead of buffering them, it discards them. It also avoids doing any Python type conversion.

So if you've gotten this far, you're probably wondering: How does any of this affect asynchronous use? Well... it doesn't. The C client library just isn't designed for that sort of thing. But there is still hope.

I've wanted a ctypes implementation that would still use the libmysqlclient library but wouldn't require compiler tools to build (and hopefully shut up Windows users). At the time, ctypes was still very early, and had not made it's way into the Python standard library yet (included as of 2.5). Fortunately, Jason Coombs has a patch against MySQLdb (jaraco.mysql) which implements this. The reason I haven't integrate this patch it would replace the original driver, instead of being an option. MySQLdb was never designed for multiple drivers. This has made it pretty close to the top of my TODO list for MySQLdb, because there's another case where this would be useful: There currently is no way to have _mysql (the current C driver module) built against libmysqlclient and libmysqld (the embedded server library, which uses the same API) at the same time without using virtualenv or something like that.

OK, so now the async people are asking: So how does this help async programming again? Well... it still doesn't. To truly have a non-blocking async driver probably means there will need to be a new implementation that is designed with that in mind. A pure Python implementation could do this. I have a start on one which I got from Monty Taylor of MySQL some time ago. It's not asynchronous either, but could be made so. At least it looks like a good starting point.

But here's the thing: As far as I know, there is no standard database API for Python which supports asynchronous operation. And it seems like there should be one. Maybe it's time for PEP-249 to be extended for an asynchronous API. Otherwise every database implementor is going to end up doing their own thing, and it sounds like there is a need for this sort of thing.

6 comments:

Unknown said...

Hi!

The libdrizzle C library, which speaks the MySQL protocol, was designed with non-blocking and async use in mind. Monty Taylor and Max Goodman have been working on the Python API around this and it's mostly functional. Not all of the async components have been exposed yet in Python, but it wouldn't be too much work to do so.

Monty Taylor said...

Funny - I was just going to say the same thing as Eric. I should also mention that libdrizzle is BSD licensed, FWIW.

Also, Geert has written a pure-python implementation of the protocol:

https://launchpad.net/myconnpy

Unknown said...

Drizzle is one of the other reasons why I want a pluggable back-end driver. How stable is that API now?

Noah Gift said...

Looking forward to hearing more about an asynch driver. Glad it is on your radar.

Anonymous said...

I'd also love some tasty async support, because then Eventlet could make it look blocking to the developer again, while retaining the performance benefits of asynchronous I/O.

I'd love it if MySQLdb had an interface similar to the one that Daniele Varrazzo is adding to psycopg2. Check it out: https://lists.secondlife.com/pipermail/eventletdev/2010-April/000800.html

Dhruv Matani said...

Hey, thanks for the detailed updated on async I/O support for MySQLdb.
Could we not do an async model where the underlying calls to MySQL are still blocking, but they can all be multiplexed on a single select.