One thing that's really useful to have, when experimenting with new things, is an existing thing that does something like what you're trying to build. I don't mean something to copy, but rather something that you can judge your performance against as you work on an alternate implementation. There are many different ways to do a task, but it's silly to spend much time working on a way that's slow and clunky.
As a for example, as I dive back into working on Tornado, I dusted off the implementation of Conway's Game of Life. It's a nice piece to work with -- it's small and easy to get your head around, but it exercises much of the core of a VM, including array work, basic math, conditional expressions, resource allocation, and looping. It's also very easy to throw together a reference implementation of, as a way to get a handle on how the VM version runs relative to a compiled version.
Not, mind, that the VM version is likely to run as fast as the compiled version. There's overhead involved in most VM designs (without JITs, at least, and even having a JIT involves the overhead of on-the-fly compilation) so we're not looking for parity, but that doesn't mean we should settle for bad performance. So... I tossed together a quick GOL implementation and gave it a run. It's a standard version, one that calculates each cell completely sequentially. This is different from the Tornado version, which does nine offset passes, but that's fine, the results are the same.
The Tornado version does 50,000 generations of a 10x80 board in 1.45 seconds. That's with a perl wrapper, but there's only 0.045s of overhead there, which I think we can reasonably ignore.
Surprisingly, the C version does 50,000 generations of a 10x80 board in 1.58 seconds with unoptimized C. That I didn't expect, but judging against this isn't appropriate, since Tornado is compiled optimized. (Still... woo! :)
The C version, when built optimized, runs 50,000 generations of a 10x80 board in 1.28 seconds.
The C version is faster than the Tornado version, but only 12%. That... is quite nice, actually. While I expected it to be reasonably performant (the main loop to calculate a generation is only 34 VM instructions, regardless of the board size) I was expecting something more like a 40% penalty at this point in the implementation, and this is where having a benchmark implementation is important.
If the VM was running 70% or more slower than the base version, then it was time to toss out a lot of the core work. It wouldn't matter how much I liked the design, if it doesn't perform it's got to go, and the sooner that non-performant design gets thrown out the better -- the longer it lasts the more things depend on it and the harder it is to get rid of it.
So... yay for benchmark implementations. Make them and test your new implementations against them, and don't be afraid to throw out code no matter how much you like it.