XMLSocket.onData and UTF-8 zero bytes

I’m currently playing around a bit with Java and Flash’s XMLSocket to get them to talk together. The Flash documentation doesn’t mention this anywhere, so I assumed that Flash’s XMLSocket sends out its data encoded in UTF-8 just like XML.load expects its XML files to be UTF-8 encoded (and won’t take anything else even if you tell it to!). This seems to be correct, because I’ve had Java send the string back as UTF-8 and Flash displayed it fine.

However, there is a slight complication. XMLSocket’s EOF marker for both input and output is a single zero byte, but certain high-codepoint Unicode characters get encoded in UTF-8 containing several zero bytes. When I have Java send back a byte sequence containing one of these multi-bytes characters, XMLSocket’s onData seems to fire for the zero bytes that are part of the UTF-8 encoded string, as well as the zero byte EOF marker.

Short example: say I have Flash send “abc” over XMLSocket, what it’ll actually send is “abc\0”. When encoded as UTF-8, the byte sequence (in hex) for this string is:


61 62 63 00

My Java server receives this sequence successfully and echoes it back to Flash identically. So Flash then receives:


61 62 63 00

XMLSocket reads this, fires onData when it hits the last zero byte, and all is fine.

But when I have Flash send a string that contains one or more high-codepoint Unicode characters, say “abc嘹嘻”, then it ends up sending “abc嘹嘻\0”. When encoded in UTF-8, the byte sequence (in hex) for this string is:


61 62 63 E5 98 B9 E5 98 BB 00 00 00 00 00

You can see that the first 3 bytes (61 62 63) are still the same - this is the “abc” part. Now I don’t know the exact specifications of UTF-8 encoding, but I do know that for high Unicode code points, UTF-8 will use 1, 2, 3 or 4 bytes, marking each one by flipping on high bits in a unique way so that decoding is possible. So I guess E5 98 B9 is 嘹 and E5 98 BB is 嘻.

There are 5 extra zero bytes at the end, last one of which is the zero byte Flash appended, so I assume each pair of zero bytes is associated with each of these chinese(?) characters. I’ve noticed the same behaviour with UTF-8 encoded strings with many more high-codepoint characters; they end up with tons of zero bytes at the end. At first I thought Flash might be misencoding this, but my Java testing client encodes to the exact same sequence, so I’m pretty sure it’s correct.

So it gets sent to my server, which then echos back the exact same sequence to Flash:

61 62 63 E5 98 B9 E5 98 BB 00 00 00 00 00

And now, when Flash receives this sequence, it seems to be firing XMLSocket.onData for each of the 5 zero bytes. I can see this happening as I receive it:

xmlSock.onData = function( dat ){
trace( "received: " + dat );
}

When Flash receives this byte sequence, here’s the output I get:

received: abc嘹嘻
received:
received:
received:
received:

It seems to fire with an empty argument for every additional zero byte. I’ve tested this and the number always exactly matches the amount of extra zero bytes.

So this forces me to conclude that XMLSocket’s onData method is, well, stupid. It should first try to parse a UTF-8 string by capping off the last zero byte rather than bluntly going through the received bytes front to back and firing onData on every zero bytes it encounters. But then again, the first output it gives me is “received: abc嘹嘻”, which is in fact the correct UTF-8-decoded string. So why does it go over the extra zero bytes (again?) that were part of the UTF-8 string ?

Now the “fix” for this is easy: add

if( dat == null || dat == "" ) return;

in my onData method. This works, but I don’t consider it a fix.

Any information on this ? Is XMLSocket.onData really as stupid as I think it is, or is it me being stupid ? If it’s set to interpret received data as UTF-8, how can it possibly do this correctly if onData bluntly fires for every zero bytes that crosses its path without considering the possibility that it’s part of a UTF-8 string ?