I should know by now -- whenever I'm facing large masses of mostly-identical code it's a sign that it's time to do some templating.
I did this in tornado with the opcodes, in large part because I fully planned on having a simple oploop to start with, and more complex oploops later. I also knew I was going to need to yank out opcode info for the assembler and whatnot, so pre-processing the source made a lot of sense. (This was something that we did quite successfully with Parrot)
I didn't do it with the various MMD math ops. This is something I'm coming to regret.
Firstly, there's just a whole wad of cut'n'paste code in because of it, and that's always really error prone. Yech for errors. Fixing bugs in the boilerplate is always a pain when it's boiled over, and you know that you always find the bug in the code that's been copied everywhere but miss one or two spots in the cut'n'paste fix.
Second it really sucks when you realize that you want to roll out a functionality change, but you've got a half-zillion places you have to wedge it in because it's in the boilerplate.
Feh. Should'a done it in the first place. Now I have to go back and rip out the guts of math.c and replace it with code that handles shorter vectors on the RHS of an op for looping. And possibly the LHS too, I'm not sure. (There is the question of what you should do -- loop back to the beginning, or pad with zeroes, or just run the length of the shorter vector)
Ah, well, it's only code. No reason to not rip it out and make it better if there's an actual win to doing so, and it'll make the source code smaller too. Smaller is usually more comprehensible, and I like comprehensible.
(I really ought to cut an 0.04 release -- fixed a bug on the train and added in vector/scalar versions of the math ops. Maybe tomorrow)
Continuing on with the multithreaded vector related VM stuff, I've been thinking about vector operations. That (thanks to a reminder from someone) has also got me thinking about APL. Well, OK, really J, since thinking about APL tends to make me think of Unicode, and when I do that I need a good lie-down until it passes.
Anyway, vector operations. Useful things in a small subset of problems, and at the moment I'm interested in that particular subset. The question, then, is what vector operations should be permitted?
As a start, there's the question of how many dimensions are allowed? And yeah, I know, vectors are one-dimensional, but once you move past scalars the issue of dimensionality raises its head. Should two or three (or four, or N) dimensional matrices be allowable? At the moment, I'm saying no, since that complicates things, but I'll freely admit it's a tentative no, so it may be allowed in the future.
Next there's the issue of machine operations. Should the VM treat scalar values and vector values identically? That is, should:
dest = source1 + source2
work just fine regardless of whether source1 and/or source2 is a vector or scalar value? (We'll ignore the underlying machine representation, whether we're register or stack based, and all that, for right now. We're talking abstract here) There are certainly arguments to be made for both requiring explicit declaration of intent and allowing for implicitly figuring things out at runtime.
Being implicit is certainly possible -- there'll have to be sufficient information at runtime to determine that a variable is a scalar or a vector, as well as how large a vector it is. (Remember we're dealing with an embeddable VM, and we want some measure of safety, so there can be some runtime checks to make sure things are what they seem so as not to allow malicious code to do Evil Things(tm). There's been more than enough crap done that way in various systems that there's no way I'd design a system that didn't allow for proper checking) And as a number of languages have shown, just Doing The Right Thing is very useful.
On the other hand, explicitness does have its virtues. When you're writing code of the sort we're specializing in here you're reasonably sure (or at least had better be sure) what types of data you've got. Yes, you might not know the dimension of a vector, but you certainly know whether you have a vector or a scalar. And if you're explicit there's the potential to do more error checking, as well as to be a bit more efficient in the generated code, if you can do some static analysis of the bytecode. (Whether you want to do this or not depends on the use that you'll put the code, of course, and in many cases can be inferred from the data anyway) But basically if the information's in the source language, it ought to be in the bytecode if at all possible, since the more information you have the better job you can do making efficient choices. Information loss is a Bad Thing.
So, we're going to do only scalars and one-dimensional vectors, and we're going to have the operations explicit. What, then, would the actual operations be?
Well, that's a good question. For my purposes I need basic math (addition, subtraction, multiplication, division, and modulus) as well as the basic transcendental operations (log, sin, cosine, tangent, and exponentiation), so that's a good start. Ten operations, and two or four permutations (scalar or vector parameter for the unary operators, scalar or vector on each side for the binary) leaves thirty basic operations, which isn't that bad, all things considered.
There's also the issue of logical operations, but right now I'm inclined to leave those scalar-only.
Or maybe it should be titled "Pondering transactional programming environments".
I've been thinking about this the past day or so. I'm not thinking so much the standard two-phase multi-connection RDMBS style transaction, but rather the sort of transactional things folks are talking about these days with programming languages. Y'know, where you basically snapshot everything and if something happens you just put your old values back. Folks talk about this with the whole 'black box programming' model as well, where you take and record snapshots of state so you can put things back in case something goes Horribly Wrong.
Now, I don't think this is something I'm going to design into the engine (though it could be really cool for doing massive simulations -- restore a snapshot, tweak some values, and see how things progress) but that doesn't mean I haven't been thinking about it.
The big reason for thinking about it is that, so far as I can tell, the things that you need to make transactional stuff work fast and well are some of the same things you need to do heavy multithreading well. If you have the sort of isolated system I was talking about earlier, the sort that makes pausing and nuking threads easy, then you've got what you need to do transactions, and you've got them easy. (Especially if you go as far as Erlang does, with single-write variables)
So, you want to snapshot. What do you do? Well, if you've got immutable variables (or a reasonable facsimile) then taking a snapshot's just a matter of copying the variable heads (as well as any spot that may refer to them, such as the stack or any registers if you use them) off to a spot that the garbage collector, if you have one, can find them so garbage doesn't get collected. The total time taken is proportional to the number of variables you've got, rather than the amount of data you have. Noting that a snapshot's been taken is useful for creating true transactions, since there are things that should be disallowed while a valid transaction is in progress. (For example, any external call that can't be guaranteed to be repeatable would be disallowed -- possibly IO, interprocess or inter-thread communication, calls to external libraries under many circumstances, and suchlike things)
Yeah, it's definitely limiting, but... it's also very useful. Straight non-transaction snapshotting gives you the potential to investigate how a program is running after the fact without having to grovel through a core dump (or even having to take a core dump) if you can also write the contents of memory to disk along with the snapshot (essentially freezing and thawing the entire variable store) providing the black box effect that many folks want, and if you've got reasonably safe memory access you don't even have to save them to disk -- you can just take repeated snapshots and have them accumulate up in memory for later investigation if the program does die.
It's definitely nifty, that's for sure. 'Specially if you snapshot groups of threads -- in that case you can allow inter-thread communication, since you can then roll back state without any of the threads knowing, which is cool too. And definitely handy for sharing simulations, but that's just what's caught my eye at the moment.
As I said earlier, I'm working on the design of an embeddable VM geared towards heavy multithreading and vector math operations. Not exactly commonly trod ground, which is part of what makes it interesting. What I'm poking around with right now is what the system needs to provide to make the sorts of problems I'm interested in here easy to solve. (Or as easy as it gets when you're dealing with potentially massive multithreaded systems)
The requirements we have for threading are:
So basically we may spawn off a dozen or more threads all running the same code, pass in a chunk of data to each thread, and each thread will return a processed version of that chunk to the controlling thread. Those threads may pass the data forward or backward, to the thread processing the next or previous chunk of data. All threads in a group may need to synchronize to one spot. Threads have to be reasonably safely killable or pausable. And we need to be able to send or receive messages between arbitrary threads. This all adds up to an interesting architecture.
From that list it's pretty clear we need a few things, and we're going to get some restrictions that we're going to chafe at. Such is life -- nothing's free. Anyway the things we're going to get are:
Ordered Thread Pools
In this case I'm referring to a set of threads, all of which are running the same code, differing only in the parameters they're passed. Each thread in the pool knows where it was spawned in relationship to the rest of the threads in the pool so we can find the next and previous thread in the pool to send data to if we must.
The thread pool may or may not be dynamically mapped to OS threads -- if thirty or forty threads should be spawned it doesn't mean we'll actually fire off thirty or forty OS threads. We may only fire off ten or twenty (or two or three) and just make it look like there are a huge number, and we may dynamically fire off threads as we need them, for example waiting to fire off a thread to receive data until the sending thread has actually sent the data. Not necessarily on a one by one basis, but if we have forty chunks of data to process, we may only start threads to process chunks one to four, and hold off on starting the thread for chunk five until either chunk 1 is done being processed or something sends data to the thread processing chunk 5.
There are interesting arguments to be made to having the pool be N-dimensionally ordered, rather than 1-dimensionally ordered. That is, rather than just having a next and previous thread (or Nth forward/Nth backward thread) you have a 2 dimensional grid of threads, or a three dimensional cube of threads, which would allow for some interesting simulation capabilities. For now, though, I think we'll pass on that. Maybe later.
Messages
Threads are going to have to sling data between them. For this to work we need messages of some sort. The fastest and safest, though most annoying, way to do this is with read-only messages slung between the threads. That is, rather than some sort of shared data that gets accessed by multiple threads, the data sent in a message is read-only, unalterable by either the sender or receiver. Keeping everything readonly (to the point where we may make a copy on send, though there are time issues there if the data being sent is large) makes life simpler, requiring much less synchronization.
Message Ports
If we're sending data between threads we have to have some place to send the data to, and for that we need message ports. Or sockets, or whatever you want to call them. I like message ports, but I started doing IPC on the Amiga and I'm the one at the keyboard, so you get to bear with me. A message port is a place messages get sent to. Each thread has one, each thread pool has one, and threads can potentially create more if they want. Ports have names so they can be looked up by name if need be, since the defaults a thread can find (its pool port, its own port, and the port of threads forward and backward in the pool list) may not suffice for all activities.
Threads can wait on multiple message ports at once, and all inbound messages are tagged with the thread they came from and the port they were sent to, as well as information for replying if need be.
Optional message replies and acknowledgement
When you're slinging messages around you need to decide how its handled. Are they always asynchronous? Do you require that the sender is notified? Do you allow the receiver to reply to the message? And do you leave this sort of thing up to the application code or provide the primitives?
Well, for something like this it makes sense to wedge it in at the lowest possible level, so it can be done efficiently. That means there's a send, send-noted, and send-and-wait, as well as a receive, receive-acknowledge, message-acknowledge, and message-reply.
No shared data
Or almost none. Shared data requires synchronized access, which is expensive, has potential deadlock problems, and gets in the way of being able to kill a thread. Every single shared resource is a potential place a thread can leave things in an inconsistent state if killed, and every lock on a shared object is a place that you can deadlock if a thread is paused. Every shared resource is a spot where programmers can potentially screw up and deadlock things as well, and that's no good either.
What little shared data there is going to be available is unlockable and read-only. Accessing the data gets you either a copy or a read-only version -- altering global data will either be impossible or replace the shared data, leaving any thread that already fetched the data out with the old version. (Shared data being done by reference rather than by value, with shared updates altering the reference)
Handling shared data this way makes it easier to kill threads -- since a thread never actually has a lock on any shared data there are fewer things to get messed up if we kill off a thread, and fewer potential deadlock spots if we pause a thread.
Restricted external access
This is in part because of the embeddable nature of the VM and in part to help with the pause/kill feature. Killing a thread's only a problem when they've got a lock or have a global resource in an inconsistent state -- if a thread's not doing that they can be safely disposed of. Where most thread kill problems come in when killing a thread (besides the fact that people have too damn much global data in their application) is in the runtime libraries -- a lot of calls into them fiddle with global data. You can protect against inopportune nuking by wrapping those calls with a lock that'll block thread death, but that gets expensive pretty fast with a lot of calls.
If, on the other hand, your VM makes very few calls into the runtime library and associated support libraries, you can reasonably protect against inopportune death and kill threads off OK. Yes, this means no C runtime library calls or OS calls from within the virtual machine except in a very few specific spots, and it means that each thread will have to manage its own memory pool (so memory allocation's thread-local unless a thread runs out of memory), but that's fine -- we are a relatively specialized VM after all. Those few calls to the outside world we need to make (like for IO) can be delegated to the embedding application and wrapped with locks so a thread in the middle of a callout to the application won't be killable.
Pause and nuke threads
A lot of the above things make pausing and killing threads easier. We won't necessarily guarantee immediate death or pausing, unless we've got OS support for that sort of thing, but if the above restrictions (no shared data, callouts to places where we'd be vulnerable protected by locks preventing death, external bookkeeping to manage communication bits) are all in place it means we can pretty much have at the threads with an axe and not have to worry, which is a good thing.
The wrapup bit
Yeah, I know, before anyone says anything, building your own threading system's one of those things you almost never want to do, and 99% of the time that you think you do you're wrong. In this case, though, I've some specialized needs that make it a reasonable thing to do, and while the restrictions needed to make it work out are annoying for general-purpose programming, for special-purpose stuff, well...
It'll be interesting to see how this plays out once it's done and fiddled with. While I certainly wouldn't want to live within these restrictions when writing, say, a database or word processing app, for sub-applications (some graphical work, simulations, game AI, and crypto stuff) where part of the application would benefit from massive multithreading with mostly isolated threads it ought to be interesting to work with.
Or something like that. Gotta love the big words, y'know?
Anyway, as everyone's undoubtedly noticed, there's not been a whole lot here. Odds are, at least someone cares, too. :) The time's not been spent idly, though.
Besides the standard "do computer stuff for money" thing, one of the projects I've been involved with has an interesting question it's spawned -- what does a massively threaded vectorized virtual machine look like? That is, assuming you've got a lot of data, which you've split into pieces, and you need to do operations on all the bytes/words/longwords in those pieces, and the code processing each piece occasionally needs to talk to the same code processing a different piece (or potentially different code even)... what does it look like? What sorts of primitives do you build into the VM, how skewed should things get for speed, and what the heck do languages designed to work like this?
I know there has been some work in this area, mostly in automatic extraction of parallelizable code from serial code, but that's not exactly what I'm interested in. Erlang has a certain appeal in that it's designed to be massively multithreaded, but there's no real vector stuff built into it. (I think. I'm still poking around at it) There are things you can do in general with some of the languages, but pretty much all the languages I've come across at best make it so that the compiler can tease out any sorts of parallelizability rather than make it easy to write parallelizable code. (And no, the lazy functional languages aren't much use here. There's still ordering and dependency in the code, the languages just make it easier for the compiler and runtime to decide to not do things. Which is good, certainly, but not the same thing) Occam has some appeal here, but as I remember it was targeted at really small systems the last time I came across it (granted, that was twenty years ago or something like that) so I'm not sure it's useful. Poking around the WoTUG website's on the big list 'o things to do, though.
Even with the parallelizable stuff taken out of the loop, there's the question of vector or matrix operations, though this is much more commonly trodden ground, being the sorts of things people've been doing in Fortran for decades. Yay for numerical simulations, and more recently high-performance graphics operations. (There's a lot of this sort of thing involved when you're spinning a few million shaded triangles around at sixty frames a second)
Still, the two worlds don't seem to have met, or if they have I haven't found out where. Bizarrely, this is one area I think I can't steal thirty years of past practice from the Lisp guys. And yes, I'd be thrilled if someone'd fill me in. I don't mind doing my research, but I'm just as happy to have people point me to some giants whose shoulders I can stand on.
Anyway, that's from the language end of things, a place I'm not that strong in. For better or worse I'm a procedural language kind of guy, and I don't have the background to properly (or improperly) design a language for this. Standing joke or not, I really don't do syntax.
I do, however, do semantics, and that's the interesting bit. Most of what you want in an underlying engine is pretty standard -- flow control's pretty much the same no matter what you do, unless you want to get really funky. In this case funky is doing control flow checking against a vector rather than a scalar and spawning multiple threads, one for each value in the vector, but we won't go there because things get profoundly twisted very fast, and twisted isn't the point. Well, at least not the point of this, it can be someone else's point if they want it.
That leaves an interesting question, though. What sorts of primitives should be provided? In this case I don't necessarily mean low-level VM operations, though since we're talking thread and vector operations they pretty much have to be.
This is the sort of stuff I've been poking at lately. It's been fun thinking about, and I expect I shall go on at some length about it over the next month or so. Lucky you!