A few years back I was part of a team on a large project creating a real time control system for land based seismic data acquisition and QC. There was lots of custom hardware and custom software and it all had to communicate continuously and flawlessly. Every minute the system was down would cost the company thousands of dollars. Fault tolerance was our primary concern. When asynchronous messaging between the components was suggested, we went for it. Made perfect sense. If one component goes down, it shouldn’t bring the rest of the system to its knees. We ended up using JMS to communicate between the java-components.
I was responsible for the part that handled communication and control of the seismic sources, mostly vibrator trucks. (They needed a woman to control the vibrators obviously) Vibrator trucks are large trucks with hydraulically operated large metal pads that are pushed into the ground and then vibrated at certain frequencies.
The trucks needed information about where to go, what kind of signal to put in the ground and what formation to be in, they would in turn send signals to indicate that they had their pads down and were ready at a certain point, upon which we needed to check if the rest of the system was ready to record, were there other sources too close? Were there enough receiving sensors available in their location? Lots of things had to be checked before a time to start the source signal was decided upon. After the source signal was completed, measurements from the vibrator trucks had to be recorded. I’ve left out so much, but you get the idea. There was a lot going on. Lots to keep track of. One of the most important parts of the system for my component was handling when the vibes were ready to go. “Ready” they would say. My code would receive this message, do some validation and recording, then pass the message on to another part of the system. Then it would sit tight waiting for a response. Either an Abort or a Trigger. Or nothing. What did an absence of a response mean? Was it an error? Could be. Could be that some other system was unavailable. Or it could just be that things were going slowly. No way of knowing. In field tests, this would often lead to the truck drivers lifting their pads up and down again just to see if things were working properly at their end. This would of course trigger an Abort from their system, followed by a new Ready message, these would then dutifully be sent onwards to the rest of the system. If the absence of a response was due to some component being down, there would now be 3 messages waiting for it upon restart. First a ready message (which should of course be disregarded, but how could they know that?). The ready message would lead to all sorts of activity of preparation throughout the system. Then the Abort message would be read causing an equal amount of activity to stop the system and reschedule things, then another ready message! For the same point. It was the same the other way around of course. I had to handle the fact that Trigger messages coming in might be based on ready messages I had already aborted. It was a mess. We needed to take extra care what order to start everything in to avoid the system flooding with messages that should be disregarded. After a while we started asking ourselves why we were putting ourselves through all of this. The whole point of asynchronous messaging was to save us trouble, to give us added fault tolerance, instead it was adding points of failure and making things really hard to debug when things stopped working.
We ended up going back to synchronous RPC. Our code didn’t change much, but it worked much better at runtime. It turned out that we didn’t have a problem with components crashing and causing downtime. In the new version, when I’d get a ready message and I contacted another service, I would know immediately if they were unavailable. If I got a socket exception of some kind I could immediately tell the vibes to stand down. We had regained knowledge about when our system was operational and when it wasn’t. Asynchronous messaging had taken that away from us. In aiming for fault tolerance we had made “The Black Knight” of real time systems. A system that upon loosing a limb, would keep going as if everything was perfectly OK. The result was ridiculous. If your arm is off, you should stop fighting, or at the very least be aware of that fact.
I realize that most people aren’t writing real time control systems for seismic data acquisition, but I think it is important to communicate that asynchronous messaging is not appropriate in every case. I understand the case for asynchronous and event based systems. But we need to realize what we are giving up. First off: simplicity. Synchronous method calls are a lot easier to read at compile time. And debug at runtime. Second: control. I dislike asynchronous event based systems for the same reason I like functional programming. Calling a pure function gives me control. I know what is happening. There are no side effects I can’t see. Posting a message/event on a queue gives me no control whatsoever. Anything could happen. Debugging systems like that is a lot harder. Maintaining them can be tricky too, as the effects of events are less obvious and you’re more likely to introduce bugs.
There are good reasons for going with asynchronous messages, but it depends on the type of problem you are trying to solve. If you can get away with synchronous calls, I’d say you definitely should. It’s much simpler and you stay in control. I saw a quote on twitter the other day saying that “Synchronous RPC is the crack cocaine of distributed programming” (quote from @mjpt777). Well that sounds good to me :-D
(OK, OK, ok, I don’t endorse drugs. Just say NO kids. I have in fact never tried crack cocaine, but I hear it is better for you than alcohol…)