Friday 2 May 2014

A complete overhaul

This will be the most complete overhaul I've ever done of my analyzer. My lexer and parser I iterated on several times before their current, mostly final, designs. But my analyzer has always operated on the same core design.

This design I'm thinking of as the "Glorified Interpreter" approach. Essentially, we start with something akin to

    struct Expression {
        Type* t;
        llvm::Value* v;
        Expression Add(Expression other);
    };
    struct Type {
        // Default impl is unsupported- throws an exception.
        virtual Expression BuildAdd(Expression lhs, Expression rhs);
    }; 
    Expression Expression::Add(Expression other) {
        return t->Add(*this, other);
    }

Here we can generate the code for, say, adding, relatively simply.

    struct IntegralType : Type {
        Expression BuildAdd(Expression lhs, Expression rhs) {
            if (!dynamic_cast<IntegralType*>(lhs.t)) throw ...;
            if (!dynamic_cast<IntegralType*>(rhs.t)) throw ...;
            return builder.CreateAdd(lhs.v, rhs.v);
        }
    };

This is basically an interpreter, except instead of an actual value, we have an llvm::Value*. I identified a few improvements on the design, like expressing most operations in terms of overload resolution (including primitive operators). However, I've identified a few flaws here.
  1. Like C++, all information must be available up-front and cannot change. This prevents incremental re-analysis, type inference for recursive functions, and such things.
  2. No complex analysis of traits- the compiler has no information available except what type an expression is. We don't know where it came from or what it's doing here or what process created it. We also can't identify other traits like whether it's dependent on a duck-typed argument.
  3. Error-handling is fundamentally broken. As soon as any error occurs the entire system stops. This is fine for an essentially linear process like lexing or parsing, but for an analyzer, there's no reason why you shouldn't analyze the true and false branches of an if just because the condition is bad.
  4. Code generation is performed eagerly, even if the user doesn't need it because it's say, performing analysis for support features like IDE integration.
  5. Supporting destructors is a bit of a kludge. And by a bit I mean a lot.
Instead, I now express each expression as a node in a graph. The node structure is permanent rather than transient and represents a function more than a value. Here's an example:

     struct Add : Expression {
       std::unique_ptr<Expression> lhs, rhs;
       Type* ty = nullptr;
       std::function<llvm::Value*(llvm::Value*, llvm::Value*)> add;
       std::unique_ptr<Error> err;
       // Called when lhs or rhs changes
       void OnNodeChanged(Node* n) {
           if (!lhs->GetType() || !rhs->GetType()) {               
               if (ty != nullptr) {
                   ty = nullptr;
                   OnChange();
               }
               return; 
           }
           auto unique_expr = lhs->GetType()->BuildAdd(rhs->GetType());
           if (unique_expr.error) {
               err = std::move(unique_expr.error);
               OnChange();
               return;
           }
           if (ty != unique_expr.t) {
               ty = unique_expr.t;
               add = unique_ptr.func;
               OnChange();
           }
       }
       void DestroyLocals() {
       }
       Error* GetError() {
           return err.get();
       }
       Type* GetType() {
           return ty;
       }
       llvm::Value* GetValue() {
           return add(lhs->GetValue(), rhs->GetValue());
       }
   };

I can delay code generation until it's actually needed. Each node can individually error or not. I can re-compute the same nodes with different inputs if the inputs are changed, say, because the user was live-typing his source code. I can dynamic_cast the lhs or rhs to see what they are if I want to implement anything funky. I can implement DestroyLocals with references to the LLVM values if I need to.

The main problem I've got so far is that many nodes have the same kind of logic- propagate null-type if our source has it, propagate error if our source has it, etc. So far I've duplicated it, but this is starting to piss me off. I believe it's some kind of monadic bind, I'll have to take a look.

I've also just noticed that I've got a problem with temporaries- namely, that the return of BuildAdd has no facility for determining what should be done about them. If the *result* of Add is a temporary, we can destruct it easily enough, but if, say, passing the rhs causes the creation of a temporary argument, then we've got no facility for destroying it. Can't just return a std::function<void()> in the return because how would the Type know what the llvm::Value*s are? Gonna have to say that the Type has to ask for any conversions it wants. There are no core language types that need destructing, and any that do will be forwards to user-defined function calls. This would be, I guess, implemented-in-terms-of-Expression node.

     struct Add : Expression {
       std::unique_ptr<Expression> lhs, rhs;
       std::unique_ptr<Expression> impl;
       // Called when lhs or rhs changes
       void OnNodeChanged(Node* n) {
           if (!lhs->GetType() || !rhs->GetType()) {               
               if (impl != nullptr) {
                   impl = nullptr;
                   OnChange();
               }
               return; 
           }
           impl = lhs->GetType()->BuildAdd(rhs->GetType());
           OnChange();
       }
       void DestroyLocals() {
           impl->DestroyLocals();
       }
       Error* GetError() {
           return impl->Error();
       }
       Type* GetType() {
           return impl->Type();
       }
       llvm::Value* GetValue() {
           return impl->GetValue();
       }
   };

No comments:

Post a Comment