Monday, February 22, 2010

Asynchronous programming and MySQLdb

It was asked in the previous post whether or not there would be better async support in MySQLdb-2.0. The answer is a qualified yes.

libmysqlclient (the MySQL C client library) has blocking calls, and doesn't have a true async interface. In a nutshell, the three blocking calls that are most important are mysql_query(), mysql_store_result() or mysql_use_result(), and mysql_fetch_row() (sometimes).

The original design of MySQLdb uses mysql_store_result(), which stores the entire result set in a MYSQL_RESULT structure; in this case, mysql_fetch_row() does not block. To save memory, the result is immediately converted to Python types and then freed.

An alternative cursor implementation (SSCursor) uses mysql_use_result(), which does not internally buffer the result set; this does cause mysql_fetch_row() to block, however. A further complication is that no new queries can be issued on the connection until the entire result set has been fetched. This is the primary reason why mysql_store_result() is used by default, because overall it causes less problems.

The main issue with using mysql_store_result() it can consume a lot of memory the result set is large. There's still the option of using a LIMIT clause, but it's inconvenient. But mysql_use_result() has it's own problems, in that you have to be careful to cycle through the entire result set before issuing another query. Otherwise you will get the "commands out of sequence" error.

I am pretty sure that inside the C client library, the wire protocol is exactly the same, and mysql_store_result() is just pre-buffering everything.

So in the end, anything using the C API is going to have to deal with some blocking calls. However, there are some design changes that will make some of these limitations a bit easier to deal with.

First, all the various Cursor classes are going away, and there will be only one True Cursor.

By default, cursors will still use mysql_store_result() on most queries, because there is a lot that it works very well for. This includes INSERT, UPDATE, and DELETE, which do not return any rows, but also some meta-queries such as SHOW WARNINGS, SHOW TABLES, etc. which do return rows but always a relatively small number, and don't take a long time to execute.

SELECT statements, on the other hand, will be detected, and those queries will use mysql_use_result() instead so that rows are not buffered in the C client. They will only be fetched upon demand.

The irony of using mysql_use_result() on SELECTs is, you can't scroll (mysql_data_seek()) on the result set, and this is the primary use case for scrolling. However, I still expect to make scroll work, possibly in a limited way, because these cursors will still be somewhat buffered. There will be a user-configurable maximum row limit that will be buffered, and once that buffer is filled, the oldest rows will be discarded. This limit will probably be 1 row by default.

How will the "commands out of sequence" error be avoided? If another cursor is created, the first cursor will be flushed, i.e. any remaining rows in the result set, and any additional pending result sets, will be read so that the query can be issued. There is also a clear method for the cursor, which also reads all the rows, but instead of buffering them, it discards them. It also avoids doing any Python type conversion.

So if you've gotten this far, you're probably wondering: How does any of this affect asynchronous use? Well... it doesn't. The C client library just isn't designed for that sort of thing. But there is still hope.

I've wanted a ctypes implementation that would still use the libmysqlclient library but wouldn't require compiler tools to build (and hopefully shut up Windows users). At the time, ctypes was still very early, and had not made it's way into the Python standard library yet (included as of 2.5). Fortunately, Jason Coombs has a patch against MySQLdb (jaraco.mysql) which implements this. The reason I haven't integrate this patch it would replace the original driver, instead of being an option. MySQLdb was never designed for multiple drivers. This has made it pretty close to the top of my TODO list for MySQLdb, because there's another case where this would be useful: There currently is no way to have _mysql (the current C driver module) built against libmysqlclient and libmysqld (the embedded server library, which uses the same API) at the same time without using virtualenv or something like that.

OK, so now the async people are asking: So how does this help async programming again? Well... it still doesn't. To truly have a non-blocking async driver probably means there will need to be a new implementation that is designed with that in mind. A pure Python implementation could do this. I have a start on one which I got from Monty Taylor of MySQL some time ago. It's not asynchronous either, but could be made so. At least it looks like a good starting point.

But here's the thing: As far as I know, there is no standard database API for Python which supports asynchronous operation. And it seems like there should be one. Maybe it's time for PEP-249 to be extended for an asynchronous API. Otherwise every database implementor is going to end up doing their own thing, and it sounds like there is a need for this sort of thing.

New releases coming soon

Kyle Vanderbeek is going to take over as release manager for MySQLdb-1.2. We should have one more release candidate of 1.2.3 first, followed quickly by the final release.

Development on MySQLdb-2.0 has been progressing, and has recently moved to a Mercurial repository on SourceForge. This was imported from the SVN trunk. If you pull from the SVN trunk in the future, you may be disappointed.

2.0 is turning into a very major rewrite. I should have an alpha release soon. For now, the hg repository builds and passes all tests, but there are probably a few things that aren't thoroughly tested yet, particularly scrolling on cursors. I'll post more detail along with the alpha release.

Python3 support is not immediately in the works, and I probably won't work on it until I am close to a beta. At this point, I would target Python-3.1, maybe 3.2. 3.0 would probably work too.