Tuesday, June 24, 2008

Flipping the 2.5 Bit for Jython

Something worth pointing out; as of 8 AM this morning (MDT) in rev 4748, Frank Wierzbicki flipped the bits and pronounced this about the ASM branch:

jbaker:~/jythondev/asm jbaker$ dist/bin/jython
Jython 2.5a0+ (asm:4750, Jun 24 2008, 10:56:16)
[Java HotSpot(TM) Client VM ("Apple Computer, Inc.")] on java1.5.0_13
Type "help", "copyright", "credits" or "license" for more information.
>>>

Yesterday there were easily the most commits we have seen in the Jython project. The real threshold was reached when we incorporated the UTF-16 and new-style exception branches into this branch, fixed the grammar to support most incremental parses, while repointing the standard library to CPythonLib 2.5. Along with a flurry of other fixes!

There's a lot more to go, but this should be an encouraging sign for everyone interested in Jython!

Labels: ,

Adopting UTF-16

Jython 2.5 standardizes on Java 5 as the base version for its implementation. Jython has always mapped both unicode and str types to java.lang.String, but the semantics of String changed as of Java 5. Instead of encoding characters as UCS-2, that is just the basic multlingual plane of 65536 code points, Java - like .Net - adopted the UTF-16 encoding. UTF-16 can represent all 1114112 Unicode code points (U+0 to U+10FFFF), except for isolated surrogates (U+D800 to U+DFFF). These surrogates act as escape characters in the UTF-16 encoding.

This makes things somewhat more complicated, to put it mildly. And this is without even considering combining characters!

Instead of a simple uniform encoding that we see in the narrow (UCS-2) or wide (UCS-4) builds of CPython, we get a variable-length encoding. And unlike UTF-8, it's usually not too efficient. In addition, we lose the ability to represent the isolated surrogates. Finally, because UTF-16 is so very close to UCS-2, it's prone to bugs.

Here's the implementation strategy we adopted. In supporting the unicode type with PyUnicode, we first determine if it's in the basic plane or not:

private enum Plane {
UNKNOWN, BASIC, ASTRAL
}

private volatile Plane plane = Plane.UNKNOWN;

public boolean isBasicPlane() {
if (plane == Plane.BASIC) {
return true;
} else if (plane == Plane.UNKNOWN) {
plane = (string.length() == getCodePointCount()) ?
Plane.BASIC : Plane.ASTRAL;
}
return plane == Plane.BASIC;
}

getCodePointCount is in turn implemented using String#codePointCount. Like other code point methods, it decodes any surrogate pairs.

String immutability means we can cache the result in the volatile field plane; idempotence of this operation ensures consistency. This allows us to equate code units (char) to code points (int), and use the implementations provided by PyString. As it turns out, this was always done before, the only difference between str and unicode was in the encoding rules.

In the rather rare case it isn't, we read with our SubsequenceIteratorImpl (which does a decode and then moves forward in the string, rather useful) or String#codePointAt and write with StringBuilder#appendCodePoint using iterators. A seemingly good alternative would be to use String#offsetByCodePoints. Too bad it doesn't reliably work. So instead we have our iterator implementations, lots and lots of them. And sometimes crazy stuff like this, seen in the implementation of PyUnicode#unicode_strip:

        return new PyUnicode(new ReversedIterator(
new StripIterator(sep,
new ReversedIterator(
new StripIterator(sep,
newSubsequenceIterator())))));
If strip method was used extensively on strings that weren't in the basic plane, it might make sense to rewrite this to decode to an int[] buffer. But that's not likely to be case.

That's also the reason we avoid making the basic plane test unless we have to. There are many situations where Unicode can pass in and out of Jython - specifically to/from Java - without us caring about what planes its characters are drawn from. We assume some overhead from boxing with PyUnicode (although HotSpot mitigates the indirection cost), but we don't have to overdo it by computing this test on construction.When comparing this with CPython, we do lose the ability to include isolated surrogate code points in Unicode strings. There are even some unit tests for this case. But ultimately this seemed like an implementation detail like testing ref counting, one certainly not worth time spent supporting.

It's worth mentioning that one alternative is to create our own representation, much like JRuby. Ruby's strings are mutable, unlike Python's. This forced the issue for the JRuby developers, because Ruby, like Python, needs good string performance. So JRuby uses byte arrays for strings, although they do use UTF-16 encoded, interned java.lang.String's to uniquely represent symbols (:xyz). Given that symbols are not strings, this works well. Ruby doesn't say anything about the encoding of such strings (ouch!), but JRuby does assume they're UTF-8 encoded when crossing the boundary with Java.

Supporting widened Unicode means having support for this in regular expressions. The first step was to just widen the SRE engine used by Jython to represent characters with int instead of short. So we always unpack to int in this case; see strip above. This engine is a direct translation of the CPython equivalent: it's a mini-VM, much like the pickle VM, and regexes are compiled to SRE bytecode. In the future, we may consider using JRuby's implementation (Joni, a port of Oniguruma to Java), but the devil is in supporting some specifics to Python. As was seen in the CPython case, it was quite straightforward to just doing the widening.

At this point, the biggest outstanding issue is backporting the changes to SRE to support wide character classes (aka big character sets), a pickle problem, as well as various bug fixes. A total of four test cases are currently failing in test_re in the asm branch.

And then that's it, at least until we start doing performance profiling.

Labels: , , , ,

Friday, June 20, 2008

Realizing Jython 2.5

Jython 2.5 is really, finally, unbelievably coming together. This is the next release of Jython, after last summer's 2.2. In a nutshell, we have completed all new language features using an Antlr parser, except for absolute imports. All bytecode generation work, now using an ASM backend, is done. Of course, there are many outstanding bugs. And Python is not just a language; we need to support fully the fact that "batteries are included". But let's look at where we are. Through the prism of what's new in 2.3, 2.4, and 2.5, here's what working:
  • 2.3: sets (PEP 218), generators (255), source code encoding (263), universal newline (278), enumerate (279), logging (282), Boolean (285), distutils (301), new import hooks (302), pickle enhancements (307), extended slices, datetimes, optparse. Still to go: csv, removing a dictionary in builtin that ensures that interned strings don't get GC'ed (pre-2.3 behavior!, it helps to read what's new). Also various string, Unicode, and regex changes are mostly done in a separate utf16 branch that I'm currently in the midst of merging against trunk.

  • 2.4: unifying long integers (237), generator expressions (289), string.Template (292, but also needs new utf16 work), decorators (318), reverse iteration (322), subprocess module (324), multi-line imports (328), removal of OverflowWarning, min & max with keyword support, sorted. But we still need partial import with sys.modules, and I'm sure some more stuff I forgot. Decimal and -m support are working in student branches, we just need to incorporate.

  • 2.5: conditional expressions (308), partial functional (309, but we're cheating with a pure-Python version), distutils metadata (314), unified try/except/finally (341), coroutines and other generator functionality (342), with-statement, including contextlib (343), any, all. But we haven't done the exceptions remapping to new-style classes, absolute and relative imports, or all of the context manager support, such as in file. ctypes was a proposed Google Summer of Code project, but apparently PyPy has some work that's 95% the way there; we will talk with them at EuroPython. We need to look into what is necessary to make ElementTree work. sqlite3 depends on ctypes. As I was writing this, I tried out wsgiref; it works and I just committed it to the asm branch. (At some point, we will repoint everything like this to CPythonLib, but for now we are mixing it up as we go. Bear with us!)

Even quit() and exit() now work; I don't know when these oh-so-major features were added. We even now support large string constants. And of course, who can forget our support for the GIL (global interpreter lock) in Jython, something that Tobias Ivarsson, my Google Summer of Code student who is now working on an advanced compiler, added to __future__ as an Easter egg:

>>> from __future__ import GIL
Traceback (most recent call last):
(no code object) at line 0
File "", line 0
SyntaxError: Never going to happen!
I would imagine that's definitive, we go against Java's native threads and compile to Java bytecode. It would be hard to have a GIL, even if we wanted one.

However, we are just turning the corner. The The Antlr parser in the asm branch currently does not support partial parses, and this breaks not only interactive sessions but doctests. Until this is solved - and Frank Wierzbicki is working like mad on this - we can't merge this branch onto trunk. But that should happen very soon.With few exceptions, we simply go against the standard Python unit tests. Straightforward, cunning, or devious, we have labored against these unit tests. And in others, we have used Python as our foil: we support the same 2.5 AST parse tree, and we know this by comparing our parses with CPython's for all of the standard library - including those unit tests.

There's a lot more going on. I can't say enough about the work done by Charlie Groves, Philip Jenvey, Alan Kennedy, Nicholas Riley, and others to make this happen. Leo Soto, my other GSoC student, is making amazing progress on supporting Django on Jython, while finding and fixing bugs in Jython itself. Supporting Django forces us to find those gaps in compatibility. Similar efforts are going on with Pylons, TurboGears 2 (Ariane Paola, GSoC), and Zope (Georgy Berdyshev, GSoC). I'm also working on greenlet/Stackless support and involved in a collaboration with Jeremy Siek and Joe Angell at the University of Colorado to add gradual typing (yes types! but only when you want to) to Jython. We have a T2000 contributed by Sun to let us see how much concurrency - in this case 32 hardware threads, 64 GB of memory - Jython can take advantage of. And so on.

Back to work!

Updates - 2008-06-24: we have support for new-style exceptions, the parser is now usable (but there are a couple of bugs left there), and Unicode support has been updated to UTF-16. See this posting, Flipping the 2.5 Bit for Jython.

Labels: ,

Saturday, December 16, 2006

Pythoneers Monthly Meeting: This Wednesday, December 20, in Boulder, Colorado

Important Change: we will be meeting at bivio Software instead of Jill's to better accommodate this month's demos.

This coming Wednesday (December 20) we are having our monthly meeting for the Front Ranage Pythoneers. Come join a lively discussion of Python demos, features, tips & techniques, and directions, both for fun and professional development.

Here are the meeting specifics:

  • Date/time: Wednesday, December 20, 6-8 PM
  • Location: bivio Software, Inc., 28th and Iris. Above Hair Elite in Suite S. Google Maps link
  • Tom Churchill and Vinny will demo Churchill Navigation's earth-rendering engine (which looks like Google Earth, only apparently even better and faster ;) ). Vinny (their main Python guy) will explain how they built the glue logic (and why they decided against SWIG) and perhaps some of the implications of using Python as a scripting language in a real-time (60 fps) environment, and the techniques we employed to keep the graphics pipeline from stalling when making an expensive call into their engine from Python.
  • Brian Granger from Tech-X will help us think more deeply about concurrent Python programming, especially as seen in a new version of IPython he has been working on.
  • BoulderSprint. Eric Dobbs proposed we adopt Jython, and this looks like we have enough momentum to actually get some useful work done. We will talk about the upcoming sprint to be held on Saturday, January 6.
We will have food & drink available. Did I mention the free beer? Hope to see you there.

- Jim

Labels: , , ,