Sunday, 28 September 2014

Random-access into UTF8

I've heard many people state that you can't random-access (that is, read a random codepoint in O(1)) into UTF-8, or other variable-length encodings. This, however, is simply not true. It's possible to generalize deque to provide this functionality.

For the purpose of this article, we consider deque to be an array of arrays- where each subarray has a constant maximum size. For simplicity, we'll consider it as vector<unique_ptr<array<T, N>>>. Thus, for some index i, we simply use i / N to find the subarray, and i % N to find the subarray index. This gives us the final element location in our array of arrays.

The key insight here is that since each subarray has a constant maximum size, it actually doesn't matter whether we use an algorithm with a linear complexity to locate the final element inside this subarray, since it's linear in a constant factor.

So imagine a specialization of deque<codepoint>, which is a vector<unique_ptr<vector<codeunit>>>. Each subarray can hold N codepoints, just like before. So to find the subarray holding the codepoint corresponding to index i, we perform the same i / N step. Now we perform a linear scan decoding each UTF-8 codepoint in the array, but since it's linear in our constant factor of N, it still has constant complexity.

For somewhat less simplicity, we could simply go with unique_ptr<codepoint[]>, and then use the fact that all but the last subarray must be full of N codepoints to find the end. This gives us the exact same core structure as before and as deque<int> it's just that the raw size of the subarrays can shrink to accommodate smaller sizes.

Arguably, it's questionable as to whether this is superior to just using a deque of 32bit codepoints directly, since in theory it offers memory savings but I'm not sure how it plays out in reality, and it's definitely questionable as to how useful it would be to random-access codepoints anyway.

Monday, 22 September 2014

Employment

Welp, I found myself work, so I won't be spending all day dicking around on Wide anymore. Life's tough. And hopefully financially independent. For once.


Saturday, 13 September 2014

Constants, variables, and laziness.

Today I finally got rid of that dumb "Every string literal has a unique type" thing. It's a holdover from pre-constant-expression days. I nuked the code paths that used to handle it. And I discovered that this code path also handled my local variable as reference solution.

So I decided to just whack it and change that.

Now I will use "var : type = value" for variables whose type needs to be explicitly specified. And it will also be useful for members (contrast "var : type;" with "var := value;" with NSDMIs or function arguments). Speaking of which, I should look into function defaulted arguments and such again.

I've been thinking about using type & type as a tuple syntax- so you could do f() := int64 & int64 to denote a function returning a tuple. My current module export code REQUIRES that all types have a notation, and I'm not sure that's a bad thing, it's certainly motivational to fix such issues. As a pairing, I've been thinking about using | to denote a kind of language-implemented variant. One of the keys here is that the compiler can translate it into different run-time semantics. For example, if you said
 
    f(arg : int32 | int64)

then the compiler has no obligation to differentiate them at runtime. It could also simply generate two different overloads and branch on call if necessary.

I'm also thinking about implementing something like tuple[i], as long as i is a constant expression.

Long story short, I'm a smidge burned out on modules. I've been faffing around with dependencies and implements too long.

Wednesday, 10 September 2014

Dead Zone

I've been suffering a double whammy of 3RFTS and Planetary Annihilation. My brain is concrete. Using the same machine for both fun and work is problematic.

Friday, 5 September 2014

Module dependency data

So today I successfully exported a user-defined type with opaque members (i.e, the data members were not exposed to the caller). There are three more key features I want to finish up for modules.

First is "virtual headers". How this will work is (somewhat) simple. All you do is nominate a directory as a "header" directory. I copy all the headers in that directory (recursively) and stick them in the module archive in a certain folder. When the consumer uses the interface, the headers in the archive can be accessed as their filepath relative to the original directory they came from- so effectively, import/headers is added as an include directory. So if you create, say, a "Boost" module, then you can add Boost headers right into the module and ship it, so the user doesn't need to crap around with getting the headers.

Second, I want to separate interface and implementation modules. Right now they are all the same thing, but I need to split them off so that you can create a module against an interface of another module, then link in various implementations later. As part of this feature, I also need to decide what information modules need to hold about their direct and indirect dependencies, and how to match implementations with interfaces.

I've been thinking about marking each interface and implementation with a UUID, and then referring to each of them that way. Then you could get implementations from a central database by just asking for an implementation of a given UUID. For direct and indirect dependencies, I could simply list them as a UUID. If I kept around the full interface then if you want to use a dependency directly, it would be simpler.

Thirdly, I need to implement one of the primary features - exporting an implementation against an existing interface. My existing hack for imports and exports (create valid Wide source with some hidden attributes for binary interfacing and then just include it directly) is probably not going to function here. I think I can still re-use the basic subcomponents of the lexer and parser but the analyzer will likely require special support.

I think that #1 will be easy to finish up, the first part of #2 shouldn't be too hard but the second part may be harder, and #3 will be moderately difficult. Fortunately the Wide compiler is already mostly extremely abstract in how it represents things, so implementing them in a funky-tastic way shouldn't be too bad. I think that it will make a clear case that Wide can do things better when I can add data members and change the size/alignment without breaking binary compatibility.

Thursday, 4 September 2014

ABIs, complex types, and move elision

I've run into a slight problem with the ABI for certain types in a specific situation. Bear with me.

Imagine that we have a type T which has as a member a "complex" type. This is basically any type with a non-trivial copy ctor/dtor, so std::string, say. Due to the identity requirements of being able to see the address in the ctor/dtor, this type always has to be held in memory and never in a register. This means that when you have a "value" of that type, under the hood it's really a pointer.

Let's say that you have a value of this type T, and you access it as a member. Currently, Wide elides the move by simply addressing to the subobject's memory.

But when exposing the access as a function, it's a lot harder because the compiler can't see "into" the function and avoid the double destruction of the return value. In addition, the existing ABI for returning complex types does not permit returning a reference into or to one of the arguments, even though this could be useful, I feel. You absolutely must return a new value (in this case, a move).

I think that for now I will simply implement the function to move. But I am uncertain as to whether it would be worth changing the ABI to support this case, and if it was, how I would do so. It seems to be roughly in line with a similar plan that I had where a function can request an arbitrary amount of scratch memory from the caller, then place multiple potential return values in there. When the function returns, it destroys the non-returned values, and simply returns a pointer to something in that space. This scheme would have to be extended slightly to include a value for whether or not to destroy the thing pointed to by the return value.

Not sure if just moving the damn object is slower, after that.

I've also been thinking of a feature where the destructor can take the object by value. The current constructor/destructor system only allows you to express that an object's identity must be destructed; not that it's value must be destructed. To wit, this would state that a trivial move/copy is acceptable, as long as either src or dest is trivially destructed- i.e., every value of a user-defined type must only be destructed once, but the identity is unimportant. This is, again, useful for ABI-related details and various implementations. For example, you could vectorize copies or moves of something like std::string, or return/pass them in registers. As with the above I'm honestly not sure if this theoretical optimization will actually turn out to be worth actually implementing. Not to mention that I have other ideas for passing parameters to destructors.

Monday, 1 September 2014

New parser errors

As a result of the previous rebuild of the parser to a good part table-driven, then I now have automated parser error generation. This means that when the user alters the parser's data tables, the errors are updated automagically. I also cut 200 lines from the parser. Let's have a brief look at the Wide breakdown.

Premake Lua script: 394
Website source: 138 Python, 1133 HTML
VisualWide: 2172 of C#
Lexer: 590
Parser: 1900
ParserTest: 261
Semantic: 11835
SemanticTest: 2971
Util: 1073
CLI: 368
CAPI: 279
WideLibrary: 432
Total: 23549.

Perhaps the thing that needs the most love right now is VisualWide. I haven't worked on it in a long time and the parser/analyzer have changed a lot since then. I doubt the thing really works.