Thursday, January 03, 2008

Django on Jython: Minding the Gap

Summary

The most important thing to know about Django on Jython is that we are almost there, and with clean code. End-to-end functionality is demonstrated by the admin tool running in full CRUD, along with a substantial number of unit tests and syncdb. But this has been achieved by so far requiring only 6 lines of code in changes to Django trunk. (There will be more, however, see below.)

Running on Jython

To run Django on Jython, with a PostgreSQL backend, the following steps are necessary:

  • Use the Modern branch of Jython. This consolidated the bugs, workarounds, and patches of numerous people - plus a bunch more - in a stable, almost-ready-to-be-merged-into-trunk version of Jython. The most important aspect is that we have tried to make Jython conform more to CPython, using Django as our guide, although there are some gaps - especially if Django already had incorporated fixes. Our driving goal is to converge on these gaps over time. Please note that is intended to be stable, performant code.

  • Use the Django trunk (tested with rev 6992, later should be OK too).
  • Apply these two patches, django.dispatch.robustapply (diff) and django.views.debug (diff) due to Leo Soto. I would imagine these will be in Django trunk soon.

  • Copy these three files from CPythonLib to Lib: gettext.py, locale.py, optparse.py. Please note that these files are only partially working on Jython, that's why they haven't been promoted yet (gettext.py actually works, as verified by test_gettext.py, but depends on still failing locale.py). But they are very close, and they appear to be fine for Django. Certainly fine for this round of development!
  • Use the database backend zxjdbc_postgresql, which was contributed by Leo Soto. Frank Wierzbicki has an experimental backend for MySQL, this should be incorporated soon.

Status

Here's what works:

syncdb and the very cool Django admin run; many unit tests pass. You can run with internationalization enabled. You do need to run the dev server with --noreload for now. We need to document here how to run with modjy, which is Alan Kennedy's servlet container for WSGI apps.

In running the model unit tests, here are the things we seem to be missing, accounting for most of the approximately 75 failures:

  • Many doctests are fragile, because they depend on the dict traversal ordering; in Jython, this is different that CPython, and if we adopt ConcurrentHashMap, it's not even repeatable. This would seem to be a pervasive bug in Django.

  • We still have some encoding problems, again seen in doctests. An example where output is expected to be lower case hex, not upper case. I fixed the problem in PyUnicode, but there are more places.

  • Problem with the ManagerDescriptor handling, in django.db.models.manager.

  • No decorators yet! (But they are coming soon, and are now available experimentally for Jython in the newcompiler work I have been leading.)

There may be some other rough categories, we need to look at the failures more systematically. All that doctest noise is certainly annoying!

Next Steps

On the Django front, get more of the unit tests running!

Before we can push modern into trunk, the following needs to be done:

  • The test_extcall unit test currently fails. This appears to be a dependency on dict traversal being repeatable, a bad assumption. However, it's a mind bending test. The 2.3 version is particularly problematic because it's not modular at all. Google's GHOP has just produced an improved version for Python 2.6 - we will look at this as a starting point.

  • Tristan King provided a near complete subset of the functionality for time.strptime, as implemented in org.python.modules.time.Time. This needs to be enhanced. I just tested this, and all unit tests in the CPythonLib version of test_time now pass except for strptime -- specifically the conversion specifier '%c' -- so we can also move to that, and discard our Jython version, when this is completed. That should be soon!

  • Decide whether we should use ConcurentHashMap or not as the backing hash map for dict and __dict__. CHM introduces creation overhead, but it should prove to be far more scalable on multicore systems. The programming model is also far nicer with respect to Jython.

19 Comments:

At 1:07 AM , Blogger Charles Oliver Nutter said...

Excellent news...keep it up!

 
At 5:44 AM , Blogger AkitaOnRails said...

Awesome! I always wished Jython to speed up - and unfortunately I am not at the level to do that. I am very pleased that you took the challenge. As a Ruby on Rails developer I only have great things to say about Django and it will be great to have it running over the JVM. That can only be described as a win-win situation.

 
At 7:38 AM , Blogger Stéfane said...

Awesome x2!

Thank you Jim for leading this project. And thanks to all the contributors.

 
At 8:19 AM , Blogger pbx said...

Two things re the dict-related doctest failures: First, have you filed tickets in the Django bug tracker on these so they can be cleaned up (as it sounds like they should)? Second, my understanding is that dict iteration in Python is expected to be "arbitrary but stable": repeatable as long as the dict isn't modified. If using ConcurrentHashMap will violate this principle then I think Jython have problems that extend beyond Django.

In any case, keep us posted. More Django deployment options are always good, so this is cool news!

 
At 12:31 PM , Blogger A.M. Kuchling said...

I'm not sure dict iteration is intended to be 'arbitrary but stable' -- I could imagine a dictionary implementation that did an iteration, and then checked the hash table's load factor, possibly resizing the table and hence changing the order. This is a good question for python-dev.

 
At 1:02 PM , Blogger Tim said...

Python dicts are indeed required to be "arbitrary but stable". Look at section 3.8 of the reference manual, specifically footnote 3:

Mapping Types.

Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary's history of insertions and deletions. If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond. This allows the creation of (value, key) pairs using zip(): "pairs = zip(a.values(), a.keys())". The same relationship holds for the iterkeys() and itervalues() methods: "pairs = zip(a.itervalues(), a.iterkeys())" provides the same value for pairs. Another way to create the same list is "pairs = [(v, k) for (k, v) in a.iteritems()]".

 
At 2:26 PM , Blogger Jim Baker said...

@stéfane:
I'm just one of many people who've worked on this! We wouldn't be here if not for the amazing advances recently that have been done on Jython trunk, the modern branch is just taking advantage of all that.

@pbx:
Very early in the process here! Leo Soto is the author of the two Django patches we need, so he's submitting them now as I understand it. But we will work with them to fix the bugs - assuming we are not mistaken ourselves!

But I'm pretty confident here. Take a look at tests/modeltests/model_forms. You will see in there the literal text it's expecting after print f:

input type="text" name="headline" maxlength="50" (removing HTML here...)

What does the Jython version spit out?
input type="text" maxlength="50" name="headline"

That looks like the wrong sort of dependency to me. I haven't looked through the Django code here implementing this part, but one would think it's using a dict. So yeah, I'll report that bug soon, just as soon as I verify those details!

@tim:
We can certainly use a synchronizedMap() instead; the idea here is to avoid unnecessary synchronization overhead. I think those of us interested in the Java platform are coming out from two directions (possibly both). 1. Integration with the Java ecosystem. 2. Take advantage of a well-defined memory and threading model, along with great libraries for concurrency. So this is with respect to #2. CHM is just cleaner, and the example you cite is superseded I think for most of us by iteritems (which maps nicely to EntrySet, which Python 3 of course is modeling against). But given the weight of that footnote, this sounds like a scenario where CHM should be made an optional implementation of dict. I think that settles our question.

 
At 4:23 PM , Blogger Brett said...

time.strptime since Python 2.3 has been implemented in pure Python (it's hidden name is _strptime). So if you can have your implementation of 'time' just use that code you should be able to pass the %c test without issue.

 
At 4:58 PM , Blogger Malcolm Tredinnick said...

Jim,

The dictionary order issue you're seeing is a kind of well-known PITA. It's pretty much accidental (but fortunate) that it hasn't bitten us across the various CPython implementations. It's one of those areas where *only* the tests care about the behaviour, so adding code to core to work around it would be fundamentally wrong (adding unnecessary overhead), but working around it at the test output level is impossible.

I've put some thought into this before without coming up with any great ideas. Checking the internal data structures (rather than the print output) would be possible, but would defeat the purpose of self-documenting doctests. Suggestions welcome.

 
At 7:39 PM , Blogger Lyndon said...

Malcolm, the literal html string attribute order is 'sorted' so isn't it logical to sort the hash keys in the test?

Sort the hash keys and reorder the string to have the attributes in sort order.

I don't know the code, is this possible?

 
At 9:09 PM , Blogger Jim Baker said...

@brett:
You certainly would know about time.strptime! Thanks, I'll take a look at that.

@malcolm:
I wonder if it's possible to make doctest smarter about working with dict, through some sort of magic that dynamic languages enable. Then we can preserve its goodness. But you're right, this is just an issue in these unit tests, problematic as this is for TDD.

 
At 10:24 PM , Blogger Tom Palmer said...

I'm still looking for Trac and Mercurial on Java. Sounds like they could happen someday. Sounds great.

 
At 7:35 AM , Blogger Fredrik said...

To fix a dict doctest, you can simply change:

>>> foo
{'key': 'value'}

to

>>> foo == {'key': 'value'}
True

A small drawback is that you won't see what the difference is if the test breaks; to work around that, change it to e.g.:

>>> def items(dict):
... return sorted(dict.items())
...

>>> items(foo)
[('key', 'value')]

(the items helper can of course be reused)

Likewise, when serializing data from a dictionary to XML/HTML etc, make sure to *always* do a lexical sort of the output. Looping over sorted(d.items()) is no harder than looping over d.items()...

 
At 9:10 AM , Blogger linnuxxy said...

So will it work in a servlet container like Tomcat?

 
At 10:26 AM , Blogger Jim Baker said...

@fredrik:
First, hi! I really enjoyed my visit to Linköping.

We can certainly make doctests robust in this fashion. However, there may be a couple of other options.

1. Maybe we could optionally inject the sorting when serializing in the context of doctest. Then this would fix these tests, without imposing the (admittedly) small overhead for normal usage.

2. Taking advantage of the extensibility in doctest.OutputChecker, when comparing against XML/HTML or dict literals, do the right thing. But I've not thought through the implications here...

@linnuxxy:
Tomcat should work, via modjy. Expect a post soon on the specifics.

 
At 2:13 AM , Blogger Raphaël Valyi said...

Congrats to you folks!

I would like to see TinyERP server (Python based) and http wrapper (eTiny) on the JVM once they provide some alternatives for the few native libs they use. TinyERP seems to be far better architectured than all java based ERP's I could find indeed...

Then we could do effective ERP programming in JRuby and JRor! Well not for that soon, but I would be happy when this happen.

Keep it up!

 
At 3:51 PM , Blogger akaihola said...

Doesn't pprint.pprint(mydict) solve the dictionary order problem in doctests?

HTML attributes generated by iterating dictionaries is another story of course.

 
At 12:07 PM , Blogger falbriard said...

Thanks for posting this great blog.

I´m a developer working in Brazil and author of a "middleware" which was fullly written in Jython.
My personal background is Java. After one year of hands-on with Jython, all what I can say is that
working with Python syntax is an all pleasant experience. The Django framework looks like an interesting road for new web applications. Running over Jython and Java JVM. it will integrate nicely into the environment. Yep, we run Jython on a big blue server and its like a red Ferrari.

My doubts is about the database drivers: Do you think Django will get available with a DB2 back-end?
Do you have any insights about such plans?

 
At 11:15 AM , Blogger Jim Baker said...

@falbriard:
First, let me say that I have worked with DB2 (and Informix for that matter), but it's been a while. But even though I don’t work with DB2 at this time, it would be great to respond to the challenge issued last year for Django and DB2 support, http://antoniocangiano.com/2007/03/15/python-django-and-db2-we-need-your-input/.

Jython comes standard with a Python DBI-2.0 compliant driver, zxJDBC, which works with any JDBC driver. So the only challenges in implementing Django support for a given database like DB2 are to map specifics of DDL, DML, and some data type handling that is not covered by DBI. Let's look at how this was done with one specific example. Take a look at the postgresql_zxjdbc driver that Leo Soto authored, and you will see that it just uses the postgresql_pyscopg2 driver for client, creation, and introspection (currently introspection is duplicated because of backend lookup issues, but that's apparently now fixed in Django trunk). Only base needs support, and that's actually more at the level of JDBC - we should be able to extract most/all of that out for other JDBC drivers.

So what's involved in DB2 specifics then (client, creation, introspection)? I had the good fortune to co-coach a sprint with Jacob Kaplan-Moss over a year ago when we added Oracle support to Django. Some of this has changed somewhat, but our sprint notes are actually useful in capturing the mapping: http://wiki.python.org/moin/BoulderSprint. In terms of actually getting this done, I could imagine it’s doable in a one day sprint (!), but you would want two domain experts: 1. Django domain expert; 2. DB2 domain expert, with extensive experience on all aspects of DB2 data types and somewhat advanced database queries. And then it would be done.

So anybody interested in volunteering as a DB2 domain expert?

 

Post a Comment

<< Home