Apparently, Fermi (and likely AMD's new architecture) offer equal 32-bit int and 32-bit floating-point performance. Still very branchy code... but that's what mass parallelism is for, amirite?
Now, first you need to know the scope you're in - type, namespace, function. Consider the ANTLR grammar for a namespace:
namespace_definition: 'namespace' identifier '{'
(variable_definition | function_definition | namespace_definition)+
'}'
The "Parallelize Me" sign here is that big plus sign saying "One or more". So we start off in serial, and let's say that we find the namespace token, and we know we're in a namespace. What we do is we find the terminating '}', so we know everything up to that point is a sequence of
(variable_definition | function_definition | namespace_definition)
. The trick to the parallelising here is that each of them is clearly delineated. Let's have a further look at the definition of these rules.variable_definition : [stuff] ';'
function_definition : 'function' [stuff] '{' [stuff] '}'
namespace_definition : 'namespace' [stuff] '{' [stuff] '}'
Once we have the list of contents of the original namespace, we just search through it and partition off all the contents into their separate rule invocations. Another sweet trick here is that we don't need to know the difference between a function_definition and a namespace_definition to parallelise at this point- we only need to know where they lie. Once we've separated out the invocations, we just stick them in a list, and parse them in parallel.
Now, an important point is that this is only partially independent. For example, inside the curly brackets, you need to know whether you're a namespace or a function before you can parse the contents. Once you know, however, the above rule can be trivially re-applied, until you reach the "atom" of the grammar- expression-statements. Haven't come up with any way to parallelize the parsing of individual expressions- most. Some, like indexing expressions, can be parallelized and importantly, type_literals indeed can also be parallelised, should I choose to include them. However, I'm going to go out on a limb and say that in a non-trivial program, there could be millions of expressions.
Now, ideally, I'd split this workload up into two segments, CPU and GPU parsing. CPU parsing will do the initial separation work- basically, determine all the expressions. The GPU will then parse all the expressions. Of course, since C++ AMP doesn't exist yet, then I have no desire to actually write a single line of GPGPU code and will probably just stick to CPU parsing, but it's an interesting idea.
You haven't explored the idea of GPU parsing at all...
ReplyDeleteYes, I noticed that. And GPU parsing is completely not feasible right now and the way I outlined in this post would be a total waste of time.
ReplyDelete