Jython 2.5 standardizes on Java 5 as the base version for its implementation. Jython has always mapped both
unicode and
str types to
java.lang.String, but the semantics of
String changed as of Java 5. Instead of encoding characters as UCS-2, that is just the basic multlingual plane of 65536 code points, Java - like .Net - adopted the UTF-16 encoding. UTF-16 can represent all 1114112 Unicode code points (
U+0 to
U+10FFFF), except for isolated surrogates (
U+D800 to
U+DFFF). These surrogates act as escape characters in the UTF-16 encoding.
This makes things somewhat more complicated, to put it mildly. And this is without even considering combining characters!
Instead of a simple uniform encoding that we see in the narrow (UCS-2) or wide (UCS-4) builds of CPython, we get a variable-length encoding. And unlike UTF-8, it's usually not too efficient. In addition, we lose the ability to represent the isolated surrogates. Finally, because UTF-16 is so very close to UCS-2, it's prone to bugs.
Here's the implementation strategy we adopted. In supporting the
unicode type with
PyUnicode, we first determine if it's in the basic plane or not:
private enum Plane {
UNKNOWN, BASIC, ASTRAL
}
private volatile Plane plane = Plane.UNKNOWN;
public boolean isBasicPlane() {
if (plane == Plane.BASIC) {
return true;
} else if (plane == Plane.UNKNOWN) {
plane = (string.length() == getCodePointCount()) ?
Plane.BASIC : Plane.ASTRAL;
}
return plane == Plane.BASIC;
}
getCodePointCount is in turn implemented using String#codePointCount. Like other code point methods, it decodes any surrogate pairs.
String immutability means we can cache the result in the volatile field plane; idempotence of this operation ensures consistency. This allows us to equate code units (char) to code points (int), and use the implementations provided by PyString. As it turns out, this was always done before, the only difference between str and unicode was in the encoding rules.
In the rather rare case it isn't, we read with our SubsequenceIteratorImpl (which does a decode and then moves forward in the string, rather useful) or String#codePointAt and write with StringBuilder#appendCodePoint using iterators. A seemingly good alternative would be to use String#offsetByCodePoints. Too bad it doesn't reliably work. So instead we have our iterator implementations, lots and lots of them. And sometimes crazy stuff like this, seen in the implementation of PyUnicode#unicode_strip:
return new PyUnicode(new ReversedIterator(
new StripIterator(sep,
new ReversedIterator(
new StripIterator(sep,
newSubsequenceIterator())))));
If
strip method was used extensively on strings that weren't in the basic plane, it might make sense to rewrite this to decode to an
int[] buffer. But that's not likely to be case.
That's also the reason we avoid making the basic plane test unless we have to. There are many situations where Unicode can pass in and out of Jython - specifically to/from Java - without us caring about what planes its characters are drawn from. We assume some overhead from boxing with PyUnicode (although HotSpot mitigates the indirection cost), but we don't have to overdo it by computing this test on construction.When comparing this with CPython, we do lose the ability to include isolated surrogate code points in Unicode strings. There are even some unit tests for this case. But ultimately this seemed like an implementation detail like testing ref counting, one certainly not worth time spent supporting.
It's worth mentioning that one alternative is to create our own representation, much like JRuby. Ruby's strings are mutable, unlike Python's. This forced the issue for the JRuby developers, because Ruby, like Python, needs good string performance. So JRuby uses byte arrays for strings, although they do use UTF-16 encoded, interned
java.lang.String's to uniquely represent symbols (
:xyz). Given that symbols are not strings, this works well. Ruby doesn't say anything about the encoding of such strings (ouch!), but JRuby does assume they're UTF-8 encoded when
crossing the boundary with Java.
Supporting widened Unicode means having support for this in regular expressions. The first step was to just widen the SRE engine used by Jython to represent characters with
int instead of
short. So we always unpack to
int in this case; see
strip above. This engine is a direct translation of the CPython equivalent: it's a mini-VM, much like the pickle VM, and regexes are compiled to SRE bytecode. In the future, we may consider using JRuby's implementation (Joni, a port of Oniguruma to Java), but the devil is in supporting some specifics to Python. As was seen in the CPython case, it was quite straightforward to just doing the widening.
At this point, the biggest outstanding issue is backporting the changes to SRE to support wide character classes (aka big character sets), a pickle problem, as well as various bug fixes. A total of four test cases are currently failing in test_re in the asm branch.
And then that's it, at least until we start doing performance profiling.
Labels: 2.5, jython, re, unicode, utf-16