Random-access into UTF8

I've heard many people state that you can't random-access (that is, read a random codepoint in O(1)) into UTF-8, or other variable-length encodings. This, however, is simply not true. It's possible to generalize deque to provide this functionality.

For the purpose of this article, we consider deque to be an array of arrays- where each subarray has a constant maximum size. For simplicity, we'll consider it as vector<unique_ptr<array<T, N>>>. Thus, for some index i, we simply use i / N to find the subarray, and i % N to find the subarray index. This gives us the final element location in our array of arrays.

The key insight here is that since each subarray has a constant maximum size, it actually doesn't matter whether we use an algorithm with a linear complexity to locate the final element inside this subarray, since it's linear in a constant factor.

So imagine a specialization of deque<codepoint>, which is a vector<unique_ptr<vector<codeunit>>>. Each subarray can hold N codepoints, just like before. So to find the subarray holding the codepoint corresponding to index i, we perform the same i / N step. Now we perform a linear scan decoding each UTF-8 codepoint in the array, but since it's linear in our constant factor of N, it still has constant complexity.

For somewhat less simplicity, we could simply go with unique_ptr<codepoint[]>, and then use the fact that all but the last subarray must be full of N codepoints to find the end. This gives us the exact same core structure as before and as deque<int> it's just that the raw size of the subarrays can shrink to accommodate smaller sizes.

Arguably, it's questionable as to whether this is superior to just using a deque of 32bit codepoints directly, since in theory it offers memory savings but I'm not sure how it plays out in reality, and it's definitely questionable as to how useful it would be to random-access codepoints anyway.

C++ - the new old thing

Sunday 28 September 2014

Random-access into UTF8

No comments:

Post a Comment