Tuesday, June 24, 2008

Adopting UTF-16

Jython 2.5 standardizes on Java 5 as the base version for its implementation. Jython has always mapped both unicode and str types to java.lang.String, but the semantics of String changed as of Java 5. Instead of encoding characters as UCS-2, that is just the basic multlingual plane of 65536 code points, Java - like .Net - adopted the UTF-16 encoding. UTF-16 can represent all 1114112 Unicode code points (U+0 to U+10FFFF), except for isolated surrogates (U+D800 to U+DFFF). These surrogates act as escape characters in the UTF-16 encoding.

This makes things somewhat more complicated, to put it mildly. And this is without even considering combining characters!

Instead of a simple uniform encoding that we see in the narrow (UCS-2) or wide (UCS-4) builds of CPython, we get a variable-length encoding. And unlike UTF-8, it's usually not too efficient. In addition, we lose the ability to represent the isolated surrogates. Finally, because UTF-16 is so very close to UCS-2, it's prone to bugs.

Here's the implementation strategy we adopted. In supporting the unicode type with PyUnicode, we first determine if it's in the basic plane or not:

private enum Plane {
UNKNOWN, BASIC, ASTRAL
}

private volatile Plane plane = Plane.UNKNOWN;

public boolean isBasicPlane() {
if (plane == Plane.BASIC) {
return true;
} else if (plane == Plane.UNKNOWN) {
plane = (string.length() == getCodePointCount()) ?
Plane.BASIC : Plane.ASTRAL;
}
return plane == Plane.BASIC;
}

getCodePointCount is in turn implemented using String#codePointCount. Like other code point methods, it decodes any surrogate pairs.

String immutability means we can cache the result in the volatile field plane; idempotence of this operation ensures consistency. This allows us to equate code units (char) to code points (int), and use the implementations provided by PyString. As it turns out, this was always done before, the only difference between str and unicode was in the encoding rules.

In the rather rare case it isn't, we read with our SubsequenceIteratorImpl (which does a decode and then moves forward in the string, rather useful) or String#codePointAt and write with StringBuilder#appendCodePoint using iterators. A seemingly good alternative would be to use String#offsetByCodePoints. Too bad it doesn't reliably work. So instead we have our iterator implementations, lots and lots of them. And sometimes crazy stuff like this, seen in the implementation of PyUnicode#unicode_strip:

        return new PyUnicode(new ReversedIterator(
new StripIterator(sep,
new ReversedIterator(
new StripIterator(sep,
newSubsequenceIterator())))));
If strip method was used extensively on strings that weren't in the basic plane, it might make sense to rewrite this to decode to an int[] buffer. But that's not likely to be case.

That's also the reason we avoid making the basic plane test unless we have to. There are many situations where Unicode can pass in and out of Jython - specifically to/from Java - without us caring about what planes its characters are drawn from. We assume some overhead from boxing with PyUnicode (although HotSpot mitigates the indirection cost), but we don't have to overdo it by computing this test on construction.When comparing this with CPython, we do lose the ability to include isolated surrogate code points in Unicode strings. There are even some unit tests for this case. But ultimately this seemed like an implementation detail like testing ref counting, one certainly not worth time spent supporting.

It's worth mentioning that one alternative is to create our own representation, much like JRuby. Ruby's strings are mutable, unlike Python's. This forced the issue for the JRuby developers, because Ruby, like Python, needs good string performance. So JRuby uses byte arrays for strings, although they do use UTF-16 encoded, interned java.lang.String's to uniquely represent symbols (:xyz). Given that symbols are not strings, this works well. Ruby doesn't say anything about the encoding of such strings (ouch!), but JRuby does assume they're UTF-8 encoded when crossing the boundary with Java.

Supporting widened Unicode means having support for this in regular expressions. The first step was to just widen the SRE engine used by Jython to represent characters with int instead of short. So we always unpack to int in this case; see strip above. This engine is a direct translation of the CPython equivalent: it's a mini-VM, much like the pickle VM, and regexes are compiled to SRE bytecode. In the future, we may consider using JRuby's implementation (Joni, a port of Oniguruma to Java), but the devil is in supporting some specifics to Python. As was seen in the CPython case, it was quite straightforward to just doing the widening.

At this point, the biggest outstanding issue is backporting the changes to SRE to support wide character classes (aka big character sets), a pickle problem, as well as various bug fixes. A total of four test cases are currently failing in test_re in the asm branch.

And then that's it, at least until we start doing performance profiling.

Labels: , , , ,

1 Comments:

At 12:52 PM , Blogger dissertation said...

it's good to see this information in your post, i was looking the same but there was not any proper resource, thanx now i have the link which i was looking for my research.

Writing A Dissertation

 

Post a Comment

<< Home