So, this morning when I got off the train it hit me that making message passing really really lightweight is a Good Thing. Yeah, I know, withhold your "Duh!"s for a moment.
For the most part, up until now, I'd been looking at the communication between threads as involving a relatively large quantity of data and, therefore, as relatively heavyweight. In that case a few extra cycles isn't that big a deal, since you're making few inter-thread calls and slinging great masses of data (well, 4K or more) between them to be processed. And that's a fine way to view things, since it matches a number of use cases well. Large quantities of data matched with large amounts of processing on that data.
But...
It doesn't work at all well when your threads are communicating a lot but not exchanging all that much data. For example, if you're running a simulation where each 'object' in your simulation is represented by a separate thread, running mostly independently, and interactions between the objects is done via passed messages. You know, like when you've got a few hundred objects moving at different rates, with different collision avoidance algorithms, and different preferred distances with other objects.
Looking at threads like this, as producers and consumers of messages, rather than as potentially parallelized functions, has some other ramifications as well. As a for example it means that we don't want to look at messages as if they were remote function calls, since they're really not. A function may well get a message somewhere other than at the beginning of the code, and may well send a message out somewhere other than at the end.
Now, both ways are perfectly valid, and each of them has some benefits and drawbacks.
An RPC-style way of looking at threads makes for more easily auditable code. This is a good thing, since it means it's easier to do machine inspection, and it's easier for programmers to write correct code. (Not actually easy, of course. Nor likely, I suppose, but one can always hope) It also makes the implementation relatively simple, since you can make function calls and message passing share code, and that code doesn't have to be hugely efficient. The downside there is that it makes message passing less handy as a tool for communication, both because of its awkwardness and its mild inefficiency. (Though whether anyone but me cares about the inefficiency's always an open question) Coding this way tends to reduce the number of deadlocks you run into as well, which is a good thing.
A communicating agents way of looking at threads makes for more interesting code, which is good. Your code can be more flexible (or at least it's easier to be flexible) and it conforms to a relatively poorly served programming paradigm. The downside is that auditing and writing the code's potentially a lot more difficult.
Given how I want Tornado to behave and one of the places I want it well-suited to, we're heading to the lightweight message passing scheme. More of a pain for me, but I think the wins are worth it.
Spam's a weird thing. No matter what you do there's a whole great wad of it that gets flung around the internet. Yeah, I run SpamAssassin on my mailserver, have amavis scanning for virus mail with clamav, and have mt-blacklist installed on the blog, but that just slows things down.
Because of that, the recent (like last week) increase in the amount of e-mal spam and blog spam didn't really register as caused by anything special, even though the amount increased a lot. (Like a factor of five or six stuff getting through as e-mail and a factor of 10 or 12 coming through on the blog) That's kinda sad, that an increase like that goes by as noticed but unremarked.
It turns out, though, that there was a reason, and no it has nothing to do with nefarious doings on the dark and crusty edges of the IntarWeb.
I rebooted the server over the weekend.
Now, this wouldn't normally be remarkable in any way (though it is infrequent) except that one of the few things that I don't have preserved across reboots is my routing tables. And that wouldn't be all that special, except for a few weeks a while back I'd gone on a campaign of vicious black-hole routing machines that were making bogus trackback pings on Jane's website. The mt-tb.cgi program wasn't even there, so any hits on it that had a referrer showing it'd come from there were clearly bogus, and I threw in a reject route for 'em.
That was, for the record, a lot of reject routes. A few hundred, and they really could have been cleaned up (there were a number of netblocks clearly spamming) but I never bothered. Anyway, rebooting cleared 'em out and the spam just poured in. Ick.
Luckily I'd exited the shell I'd done most of 'em in at one point completely by accident, so a lot of the commands were in my bash history file. Chopped 'em out, stuck 'em in a file, and re-installed the routes, and the spam's definitely cut way back, both blog and e-mail.
I really hadn't realized how much of a difference it'd made, since the routes went in over the course of a couple of weeks, but when they went away... yow! Big difference, and bringing at least some of them back has made a big difference too.
I hadn't figured it'd make this big a difference, stopping that few machines. I'd mostly figured all this crap was coming from swarms of zombie PCs, and while I'm sure a lot of it is, there's apparently a good sized chunk that's coming from a relatively small number of big machines. Not a panacea for the spam problem, but heck, I'll take it.
It seems like a truism, but no matter what you do, if you're working at a low level you're going to find parameter passing to be a major pain in the neck. (by which I mean it's easy to find a common scenario where your parameter passing scheme is going to be annoying and/or slower than it ought to be) Tornado, of course, is no different.
If you're going to make function calls, you need to pass in parameters. That's pretty much a given. And parameters, well... they're a problem.
Basically, how do you pass in parameters? And how do you get them in when you're a function? Heck, what exactly counts as a function anyway? (That is, where's your boundary for a 'real' function, as opposed to a spot you just jump to?)
There are a few ways to pass them in. If you're running on a stack-based system you can push your parameters onto the stack and jump to your function, but there are problems there if you're putting other things onto the stack like the return address. It's also potentially kind of slow, since you've got the cost of putting something onto the stack, which involves checking for stack depth as well as actually putting the data onto the stack (which can be pricey if you need to extend the stack in a segmented stack system, or runs the risk of throwing an exception on stack exhaustion in a fixed-sized stack system)
If you're in a register-based you can stuff the parameters into registers, but then there's the issue of what happens if you have too many parameters to fit into registers? Where do they overflow?
And there's always the possibility of packaging up the parameters somewhere else, in some out-of-bad location regardless of the existence of a stack. That's got the mild benefit of a register machine, since it's out of band and presumably not going to give you stack sizing issues, but you need yet another structure to store your parameters in, and getting to them requires special ops. (Presumably register access in a register machine, or stack access in either a register or stack machine already works)
Tornado's got an extra problem in that it's heavily threaded and built for speed and safety. Now, most of Tornado's data structures are write-once (which makes life ever so much easier in a threaded environment, though GC is always rearing its ugly head) which means that if we hand-wave on the GC issues we don't have to make copies of things to send to remote threads -- we can send the things themselves. And since Tornado doesn't have any global writable data (remember, we can store data into global 'slots', but it's explicitly an obliteration of the existing entry, and an atomic operation) there's precious little shared state to deal with.
I'm also just fine putting a hard limit on parameter counts for function calls. Want to call with thirty parameters? Well... too bad, you can't. Or I'm at least fine saying you can't.
The interesting thing here (well, one of them at least) is that the speed of your function calls can have a significant effect on the style of programming people use on your system, and the style of programming people use can, in turn, alter how optimizations are made for call speed.
For example, in OO programming you want function (well, method, but...) calls to be fast, because people seem to insist on having a huge number of tiny little methods. Personally I blame the lack of automatic accessors as part of the language design, but that's neither here nor there. Anyway, if you look at OO code you'll see a half-gazillion methods that are one or two lines long. Method calls have to be fast, since if they're not your code just runs horribly slowly. And yeah, it means that you will go to some lengths for this, and sacrifice speed in other places to make your methods go faster. (In this case, it means probably going for a stack-based system, since stack systems are almost universally faster for small-count calls, even if being stack based costs you performance elsewhere)
On the other hand, if you're writing code in a language that doesn't do much in the way of function/procedure calling, it doesn't make nearly as much sense to optimize for calls. Does it really matter if your call takes 10 msec instead of 5 if the language or coding paradigm you're working towards will make maybe a dozen function calls over the lifetime of a block of code? Probably not, and that's fine.
For Tornado, if you think about it, subroutines are pretty heavyweight. Yeah, sure, your sub might just be:
a = b *c
a = a % 4
a = a ^ d
but if a, b, c, and d are 1024 element arrays, well... that could take a while. And no, I've no clue if that does anything useful or not, but it's not like the guts of a gaussian blur or RSA encoding function are all that fancy if you've got automatic vector operations.
Then there's the question "what are all these subs going to do?" I mean, we've got a relatively specialized processing engine, one with a programming style that's likely going to annoy most people, so is it too out of line to figure that most of the subroutines are going to be reasonably large computationally, if not in actual program text?
This also brings message pools into the equation. (No, really, it does) Should fetching a message be the same as getting called with parameters? If so, you could arguably have a bunch of threads waiting on a thread pool's message queue and another thread posting a message is the same as making a 'remote' procedure call into a thread in the pool. If your global data's read-only, who cares where you live outside of the obvious speed and latency issues?
It's very tempting to treat all function calls as if they were messages being sent to other threads. More to the point, exposing it so that functions create messages to make a call, wait on their return, extract the return values from the return message, and get messages in when they're being called.
Tempting, but... no, at least not user-visibly. Might do it under the hood but that's just an implementation detail, and for right now we're not letting those get exposed.
Anyway, what will we do? Again, looking over the list 'o features:
Given all that, it seems clear we want to restrict messages and parameters to the same count of things. I'm thinking 16 is a good number -- certainly no more, as that's 64 bytes of data to copy for each function call. Each function activation gets its own activation record, which can store the function's pad, its registers, and the return address, and with that we won't need a stack. We're also going to strictly enforce declared in and out parameters, and leave the rest untouched, so if a call takes two parameters and returns one, only R0 and R1 go in, and only R0 is touched in the caller on return.
I think a small piece of Clever is in order here too -- when making a call the parameters go in the low registers (starting with 0) but when you're activated your parameters go in your high registers. That is, if function X takes one parameter, I put it in R0 when I make a call to X. When I'm in X, the parameter is in R16. Why? Well, assuming functions that don't have all that much code in them it means there's a good chance that most of your operations will be on the data you're passed in, and you'd rather not have to shuffle that around to keep it from being stomped on by a function call.
Messages will be limited to 16 elements as well, when they're constructed. This may end up being a bigger hassle, but for now I think it's OK. That's for another time, though.
This isn't, I'm sure, anyone's ideal scenario, but it seems like it'll handle the common case the way it ought to -- quickly, and with as little fuss as reasonable. Now to get it all implemented so it can be tested...
It's always useful to frown at your assumptions when you're doing design work, since they so often go unexamined, and often force you to build in limitations you hadn't expected in your software. (And sometimes they force you to build in features you hadn't expected, which is nice) Those assumptions also force tradeoffs, which are inevitable -- nothing's free, alas.
One of the base assumptions that Parrot had was extreme dynamism, in part because the languages it was targeted for are extremely dynamic. Code could change at runtime, variable types could change at runtime, and array sizes could change at runtime. (Indeed, the structure of your program could change while the code was inside a single opcode) That brought in all sorts of nifty things, but it also made life somewhat slow and annoying. Or relatively slow, at least, since if you embrace your assumptions rather than fighting against them you generally do pretty well.
Tornado, on the other hand, doesn't have the same sorts of assumptions that Parrot has, because it's not targeted at the same sorts of problems Parrot was. Parrot needed flexibility, while Tornado needs speed, security, and concurrency. Tornado's really more of a co-processor than a CPU -- something you call from your main program to perform some relatively specialized tasks, while you leave the general purpose code to your mainline code.
Because of that we can make some assumptions. They shoot the sort of full dynamism that perl has way down, but that's OK -- you'd call Tornado from perl, rather than running perl on Tornado. (And yeah, there's a perl interface module for it, though the engine itself doesn't yet function. Go figure)
So here are the ground rules for Tornado.
#1 means that we can check vector lengths on variable fetch and be happy. Vectors may be of indeterminate length until their first assignment or use (so we can mark a parameter as being a vector but not knowing how long the vector might be at compiletime) but once it has a size the size is fixed.
#2 means that attempts to turn an integer variable to a bigint will fail
#3 means we don't deal with mixed-type vectors. All the elements are ints, or floats, or whatever, and we can store them as efficiently as possible
#4 means that we don't carry around metadata in the bytecode. No variable names, no attributes beyond type, no annotations, nothing. Maybe we'll lift that later, but probably not. The only thing we keep around is the minimum amount of information to allow potential bytecode verification.
#5 is one of those 'duh' things -- if it's a vector it's only got one dimension. (Otherwise it'd be a tensor) This limitation'll probably be lifted later, though it affects #6.
#6 is a flat-out dodge of a lot of hassle. If you add two vectors together and one's longer than the other, what do you do? Repeat? Zero fill? One fill? Well, Tornado pitches an exception. When (well, if) vectors become tensors, then we'll probably have to do something fancier, what with the way multidimensional things operate, but that's for later.
#7 is because, well, we're not a text processing engine. That's really not the point, though I can certainly see doing Clever Things with parallel algorithms and text searching, but... maybe another time. As far as Tornado's concerned there is no text at all, and if you want to do text-y things you can do them elsewhere. (And yeah, I know, Tornado's well-suited for some specialized text processing applications, but let's face it, crypto doesn't deal with text, it deals with streams of numbers that happen to be also text. I don't think anyone's got a crypto algorithm that does Special Things with multi-codepoint characters and if they do I really don't want to know)
#8 means that if we want to turn a float to an Int4 you have to do it explicitly. None of Perl's mighty morphing power scalars. They're good for some things, but not what we do. We might allow you to do operations with two types where one type is a subset of the other (Int8 and BigInt operations, or Int4 and Float operations), but the destination assignment had darned well better be lossless. Try and store a BigInt into an Int4, or a BigNum into a Float and you lose unconditionally.
#9 is there to allow code to get around the rules for #8 -- sometimes you can know that your results will fit into an Int4, even though there are floats and bigints involved, but if you do (or don't care) then you have to force an explicit conversion.
Finally #10 means that promotion rules depend on the type of the variables involved, not on their values. If you multiply an Int4 and a BigInt you'll get a BigInt, even if the Int4 has a value of 3 and the BigInt a value of 2.
The one potential caveat here is that we may allow for temporary casting of byte vectors to other types. That is, while a vector may hold a wad of bytes, you may want to treat it as a vector of Int2, Int4, or Int8s. (No Int3s -- If you're doing video data transforms, go for a 4-element format like CMYK or RGB with an alpha channel)
Yeah, not having dynamism, and code that changes based on circumstance, and dynamically loadable code, and a handy language compiler built in are all limitations, but they're limitations I'm fine with, since they serve a purpose, and Tornado's very much not a general-purpose VM.
So, I've been busy with other things, and Tornado's been ignored of late. That's done with, and it's time to get back to the machine. The delay has let some interesting ideas kind of percolate, though, which I'm happy about. Specifically dealing with distributed computing, which actually dovetails nicely with some of the other things that Tornado does.
Tornado's basic architecture -- a strongly sandboxed VM with isolated threads communicating via passed messages -- extends nicely to a distributed system. If the system's sandboxed well it's safe to fling code around to remote machines (and safe to receive it). If individual threads are mostly isolated there's not that much time lost synchronizing threads and their data. If you have thread pools that share a message port then it's easy to fling a lot of data at a remotely hosted pool so it's ready when the threads need some.
Tornado, handily, is strongly sandboxed, its threads are pretty damn isolated, and it's got thread pools. Woohoo! :)
Silly exuberance aside, it's not really that easy. While multithreaded programming and distributed systems have some things in common, you can't really treat remote and local consumers and producers identically. Bringing in remote threads adds in a whole new set of potential failure modes (most of which are actually recoverable, which is nifty if you're building reliable systems) and of course there's the issue of latency to take into account. Sending a message to another thread in the same process may take a microsecond, sending it to a thread in another process on the same machine may take a dozen or more microseconds, and sending it to a remote thread could take milliseconds (or seconds, or possibly minutes if you're really unlucky and have lots of data). Sure you could treat it all the same, but... I dunno, I think it's a bad idea to ignore potential sources of latency.
There are also quite a number of applications where you really do want to know. If you're doing crypto work (some of the algorithms parallelize and vectorize really nicely) you'd certainly not want to be slinging plaintext or keys across the network, or if you do it's because you explicitly chose to do so. If you're doing game programming you probably don't want the latency in communicating with a remote process. On the other hand, if you're doing numerical simulation work you may well want to take advantage of remote systems -- the extra millisecond or three to send data to a remote system's trivial if you've a computation that takes minutes, hours, or days, and being able to easily bring online extra computing resources is a very useful thing.
So, anyway, that argues for designing in the capability but making it more explicit -- programs can specify that a spawned thread can be remote, or a spawned threadpool can be remote -- and the VM can manage distributing them around as need be. (Which brings up the interesting idea of welding ZeroConf into the VM to allow it to track what other execution engines are around, but that's for another time) Possibly also allowing for message ports to be specified as external resources, though there's always the issue there of malice -- you can't necessarily tell what the remote system's running, and allowing it opens up a potential network port with the possibilities for malicious bytecode to attack remote systems, which is something that Tornado's trying to prevent.
Dunno if it'll make it in. It's an interesting thing to think about, though.