Along the design process of Wide, many people have suggested preprocessing to C++. I declined to go that route because I wanted to have the power to offer different semantics. A simple example is strict left-to-right evaluation order. Another is defined overflow/underflow on integers. But here's a more complex example: virtual functions. In C++, you can have
class base {
virtual base* clone();
}
class derived : base {
virtual derived* clone();
}
This is great, but not safe- we really need to use smart pointers. Unfortunately, this leads to an unfortunate surprise.
class base {
virtual std::shared_ptr<base> clone();
}
class derived : base {
virtual std::shared_ptr<derived> clone(); // error
}
This is obviously bad. So in Wide, I've decided for virtual function arguments and returns, we will use is-a- it's like implicitly convertible, but I've decided for it to have a deeper meaning. This means that the above sample, when converted to Wide, is legal.
The problem is when you want to express this Wide hierarchy as C++- because the converted code is illegal. Strictly from a vtable point of view, if I export the constructor then I can set whatever vtable I want in it. But, for example, if the base class function is pure virtual, I need Clang to recognize that derived is not abstract. There is a method in Clang that suggests this but who knows if it can handle the idea of one method overriding another even when the Standard says it should not.
So, yes. Now you can inherit from a C++ interface, but what's going to happen if you try to convert that type to a C++ type, I don't know.
Saturday, 31 May 2014
Thursday, 29 May 2014
The early-move kraken, incremental analysis and error handling.
I've gotten into the habit of compulsively moving things. It doesn't help that my current design involves a bunch of unique_ptr. So I've just fixed a bunch of bugs in code that looks like this:
First, here we're depending on compiler initialization order to evaluate the left-hand-side first, because in the right-hand-side we destroy decls's contents. Even more hilariously, after that, we then use decls AGAIN to get the object we just inserted. Unsurprisingly, this piece of code did not work fantastically well. What's more surprising is how few tests failed.
Similar problem here. We've read args and moved from it in the same function call. Ignoring potential UB due to evaluation order, which I believe is safe because all operations are in terms of user-defined functions, the compiler can (and GCC did) move args first. There are likely many more such places in the Wide codebase (yay!).
Incremental analysis and error handling. So far I don't have any tests about incremental analysis or error recovery, we still use the exception error handling model. But incremental analysis is beginning to concern me because I don't have a solid model for which nodes are responsible for handling what. The core problem is that some nodes but certainly not all would have to be written to listen for changes to their input and adjust appropriately. I just don't know which ones and why.
OverloadSet* Analyzer::GetOverloadSet(std::unordered_set<clang::NamedDecl*> decls, ClangTU* from, Type* context) {
if (clang_overload_sets.find(decls) == clang_overload_sets.end()
|| clang_overload_sets[decls].find(context) == clang_overload_sets[decls].end()) {
clang_overload_sets[decls][context] = Wide::Memory::MakeUnique<OverloadSet>(std::move(decls), from, context, *this);
}
return clang_overload_sets[decls][context].get();
}
First, here we're depending on compiler initialization order to evaluate the left-hand-side first, because in the right-hand-side we destroy decls's contents. Even more hilariously, after that, we then use decls AGAIN to get the object we just inserted. Unsurprisingly, this piece of code did not work fantastically well. What's more surprising is how few tests failed.
return GetSignature()->BuildCall(Wide::Memory::MakeUnique<Self>(this, !args.empty() ? args[0].get() : nullptr, std::move(val)), std::move(args), c);
Similar problem here. We've read args and moved from it in the same function call. Ignoring potential UB due to evaluation order, which I believe is safe because all operations are in terms of user-defined functions, the compiler can (and GCC did) move args first. There are likely many more such places in the Wide codebase (yay!).
Incremental analysis and error handling. So far I don't have any tests about incremental analysis or error recovery, we still use the exception error handling model. But incremental analysis is beginning to concern me because I don't have a solid model for which nodes are responsible for handling what. The core problem is that some nodes but certainly not all would have to be written to listen for changes to their input and adjust appropriately. I just don't know which ones and why.
Tuesday, 27 May 2014
Totally not overkill
On today's issue of Totally Not Overkill: JITting a function in your test to tell you what failure to expect. Oh, stop looking at me like that. It was the easiest thing to do since I already have machinery to JIT Wide functions.
Still, I turned up a BUNCH of bugs in those tests and a lot of missing coverage. For example, I have 30 possible semantic errors right now, and only 15 rejection tests (and several of those are missing member or overload resolution failures). I found some tests that hadn't been updated for the new function argument syntax. I found some tests that hadn't been updated for new integer literal rules. I found some tests that were never valid Wide in totally irrelevant ways. I found an error that gave the using context's location instead of the true location.
Next up I will begin classifying the compilation failure tests by which exception site they test, so I can identify less-tested call sites and exception types. There should be like, 40 compilation failure tests at least.
This is totally orthogonal to the fact that I want to completely overhaul error handling from analysis-terminating exceptions to per-node properties, and add tests for warnings to ensure that they also behave correctly.
ExpectedFailure() { return { "OverloadResolutionFailure", 179, 183 }; }
using movable := cpp("CompileFail/CPPInterop/NonCopyable.h").test;
Main() {
var := movable();
copy := var;
return false;
}
Still, I turned up a BUNCH of bugs in those tests and a lot of missing coverage. For example, I have 30 possible semantic errors right now, and only 15 rejection tests (and several of those are missing member or overload resolution failures). I found some tests that hadn't been updated for the new function argument syntax. I found some tests that hadn't been updated for new integer literal rules. I found some tests that were never valid Wide in totally irrelevant ways. I found an error that gave the using context's location instead of the true location.
Next up I will begin classifying the compilation failure tests by which exception site they test, so I can identify less-tested call sites and exception types. There should be like, 40 compilation failure tests at least.
This is totally orthogonal to the fact that I want to completely overhaul error handling from analysis-terminating exceptions to per-node properties, and add tests for warnings to ensure that they also behave correctly.
Sunday, 25 May 2014
Syntactical flexibility
Previously, I kept C++'s member initializer syntax- that is,
But this isn't really flexible enough for my needs. For example, var can only be an identifier, but bases need expressions to identify them. In addition, consider the needs of a function that is, say, an exported constructor. This function needs initializers even though it is not, syntactically, a constructor. This causes me to feel the need to unify syntaxes for exporting, explicit return type, and member initializers into a familiar syntax.
In addition, consider that now, it could be possible to express the initializer of more than one member. For example,
I also feel that you should be able to define out-of-class functions as members, including a "this" member, and that this should function like a normal "this" in terms of implicit lookup. If the function is not exported as a member then failure. I feel that having export and return as keywords will save the user from not being able to use members that way. You can specify multiple values for export- string, function, or true. True is coming later.
For function I am in the position where you have to use an overload set. This is problematic when it's a member overload set. I feel like I need to re-introduce something like type->member, where -> will denote a static access. This access should produce an overload set including non-static members, where there are non-static member functions.
Finally, I should perform some parameter validation. Code-generation will fail if the exported function is not LLVM-compatible with the one Clang declared, but there's no other sanity checks involved. The same is true of virtual thunks. The compiler should not fail at code generation time.
type t {
var := bool;
type() : var(true) {}
}
But this isn't really flexible enough for my needs. For example, var can only be an identifier, but bases need expressions to identify them. In addition, consider the needs of a function that is, say, an exported constructor. This function needs initializers even though it is not, syntactically, a constructor. This causes me to feel the need to unify syntaxes for exporting, explicit return type, and member initializers into a familiar syntax.
type t {
var := bool;
type() var := true; {}
}
f()
export := function;
member1 := init;
{}
In addition, consider that now, it could be possible to express the initializer of more than one member. For example,
type t {
var := bool;
var2 := bool;
type() var, var2 := { true, true }; {}
}
I also feel that you should be able to define out-of-class functions as members, including a "this" member, and that this should function like a normal "this" in terms of implicit lookup. If the function is not exported as a member then failure. I feel that having export and return as keywords will save the user from not being able to use members that way. You can specify multiple values for export- string, function, or true. True is coming later.
For function I am in the position where you have to use an overload set. This is problematic when it's a member overload set. I feel like I need to re-introduce something like type->member, where -> will denote a static access. This access should produce an overload set including non-static members, where there are non-static member functions.
Finally, I should perform some parameter validation. Code-generation will fail if the exported function is not LLVM-compatible with the one Clang declared, but there's no other sanity checks involved. The same is true of virtual thunks. The compiler should not fail at code generation time.
Friday, 2 May 2014
A complete overhaul
This will be the most complete overhaul I've ever done of my analyzer. My lexer and parser I iterated on several times before their current, mostly final, designs. But my analyzer has always operated on the same core design.
This design I'm thinking of as the "Glorified Interpreter" approach. Essentially, we start with something akin to
Here we can generate the code for, say, adding, relatively simply.
This is basically an interpreter, except instead of an actual value, we have an llvm::Value*. I identified a few improvements on the design, like expressing most operations in terms of overload resolution (including primitive operators). However, I've identified a few flaws here.
I can delay code generation until it's actually needed. Each node can individually error or not. I can re-compute the same nodes with different inputs if the inputs are changed, say, because the user was live-typing his source code. I can dynamic_cast the lhs or rhs to see what they are if I want to implement anything funky. I can implement DestroyLocals with references to the LLVM values if I need to.
The main problem I've got so far is that many nodes have the same kind of logic- propagate null-type if our source has it, propagate error if our source has it, etc. So far I've duplicated it, but this is starting to piss me off. I believe it's some kind of monadic bind, I'll have to take a look.
I've also just noticed that I've got a problem with temporaries- namely, that the return of BuildAdd has no facility for determining what should be done about them. If the *result* of Add is a temporary, we can destruct it easily enough, but if, say, passing the rhs causes the creation of a temporary argument, then we've got no facility for destroying it. Can't just return a std::function<void()> in the return because how would the Type know what the llvm::Value*s are? Gonna have to say that the Type has to ask for any conversions it wants. There are no core language types that need destructing, and any that do will be forwards to user-defined function calls. This would be, I guess, implemented-in-terms-of-Expression node.
This design I'm thinking of as the "Glorified Interpreter" approach. Essentially, we start with something akin to
struct Expression {
Type* t;
llvm::Value* v;
Expression Add(Expression other);
};
struct Type {
// Default impl is unsupported- throws an exception.
virtual Expression BuildAdd(Expression lhs, Expression rhs);
};
Expression Expression::Add(Expression other) {
return t->Add(*this, other);
}
Here we can generate the code for, say, adding, relatively simply.
struct IntegralType : Type {
Expression BuildAdd(Expression lhs, Expression rhs) {
if (!dynamic_cast<IntegralType*>(lhs.t)) throw ...;
if (!dynamic_cast<IntegralType*>(rhs.t)) throw ...;
return builder.CreateAdd(lhs.v, rhs.v);
}
};
This is basically an interpreter, except instead of an actual value, we have an llvm::Value*. I identified a few improvements on the design, like expressing most operations in terms of overload resolution (including primitive operators). However, I've identified a few flaws here.
- Like C++, all information must be available up-front and cannot change. This prevents incremental re-analysis, type inference for recursive functions, and such things.
- No complex analysis of traits- the compiler has no information available except what type an expression is. We don't know where it came from or what it's doing here or what process created it. We also can't identify other traits like whether it's dependent on a duck-typed argument.
- Error-handling is fundamentally broken. As soon as any error occurs the entire system stops. This is fine for an essentially linear process like lexing or parsing, but for an analyzer, there's no reason why you shouldn't analyze the true and false branches of an if just because the condition is bad.
- Code generation is performed eagerly, even if the user doesn't need it because it's say, performing analysis for support features like IDE integration.
- Supporting destructors is a bit of a kludge. And by a bit I mean a lot.
struct Add : Expression {
std::unique_ptr<Expression> lhs, rhs;
Type* ty = nullptr;
std::function<llvm::Value*(llvm::Value*, llvm::Value*)> add;
std::unique_ptr<Error> err;
// Called when lhs or rhs changes
void OnNodeChanged(Node* n) {
if (!lhs->GetType() || !rhs->GetType()) {
if (ty != nullptr) {
ty = nullptr;
OnChange();
}
return;
}
auto unique_expr = lhs->GetType()->BuildAdd(rhs->GetType());
if (unique_expr.error) {
err = std::move(unique_expr.error);
OnChange();
return;
}
if (ty != unique_expr.t) {
ty = unique_expr.t;
add = unique_ptr.func;
OnChange();
}
}
void DestroyLocals() {
}
Error* GetError() {
return err.get();
}
Type* GetType() {
return ty;
}
llvm::Value* GetValue() {
return add(lhs->GetValue(), rhs->GetValue());
}
};
I can delay code generation until it's actually needed. Each node can individually error or not. I can re-compute the same nodes with different inputs if the inputs are changed, say, because the user was live-typing his source code. I can dynamic_cast the lhs or rhs to see what they are if I want to implement anything funky. I can implement DestroyLocals with references to the LLVM values if I need to.
The main problem I've got so far is that many nodes have the same kind of logic- propagate null-type if our source has it, propagate error if our source has it, etc. So far I've duplicated it, but this is starting to piss me off. I believe it's some kind of monadic bind, I'll have to take a look.
I've also just noticed that I've got a problem with temporaries- namely, that the return of BuildAdd has no facility for determining what should be done about them. If the *result* of Add is a temporary, we can destruct it easily enough, but if, say, passing the rhs causes the creation of a temporary argument, then we've got no facility for destroying it. Can't just return a std::function<void()> in the return because how would the Type know what the llvm::Value*s are? Gonna have to say that the Type has to ask for any conversions it wants. There are no core language types that need destructing, and any that do will be forwards to user-defined function calls. This would be, I guess, implemented-in-terms-of-Expression node.
struct Add : Expression {
std::unique_ptr<Expression> lhs, rhs;
std::unique_ptr<Expression> impl;
// Called when lhs or rhs changes
void OnNodeChanged(Node* n) {
if (!lhs->GetType() || !rhs->GetType()) {
if (impl != nullptr) {
impl = nullptr;
OnChange();
}
return;
}
impl = lhs->GetType()->BuildAdd(rhs->GetType());
OnChange();
}
void DestroyLocals() {
impl->DestroyLocals();
}
Error* GetError() {
return impl->Error();
}
Type* GetType() {
return impl->Type();
}
llvm::Value* GetValue() {
return impl->GetValue();
}
};
Subscribe to:
Posts (Atom)