I've often been faced with two devices and a piece of wire,
over which the two devices must exchange critical data.
Whether the wire is two feet long or two miles, 300 baud
RS/232 or Ethernet, you can reduce the incidence of those
panicky 3:00 AM phone calls with a little planning toward
the day it gets cut.
Whenever you contemplate sending a message over a communication
channel, you must consider two scenarios which are unlikely
to happen on any individual message, but certain to happen
The situation is complicated a bit by the fact that much industrial
equipment (like PLC's, motor controllers, and scales) are "programmed"
in pidgin languages which may not be Turing complete or which may
present unusual performance bottlenecks.
- The message arrives garbled, with missing or randomized data
- The message doesn't arrive at all
The obvious solution to garbled-message detection is to run a
checksum on the data. While there are some elaborate checksum
schemes which can allow partial data reconstruction, I've found
in practice that the simplest way to deal with a bad checksum is
to ignore the message completely as if it never arrived. This
assumes, of course, that you've dealt adequately with the
Checksums should not interfere with clean parsing of the incoming
message. One control protocol I know of uses this format:
This looks good at first glance; when parsing, you can ignore
data until you read chr$(2), fill buffer until you read chr$(13),
and take the next byte as the checksum. The problem? The checksum
can take any 8-bit value 0-255. What happens if it works out to 2?
You can deal with it if you think of it when you write the driver,
but it's not pretty.
A related problem came out when I tried to emulate this protocol on
another device which, because its firmware used the standard C string
library to handle communications, could not transmit the ASCII value
zero. I had to detect this situation and change a relatively unimportant
bit of the datastream to keep the checksum from assuming this unsendable
I have had good results reserving the high ASCII values 128-255 for
a simple additive checksum, with the high bit delimiting both that
the value is a checksum and End Of Message. If a failure rate of
1 in 128 is not good enough, I prefer to use multiple bytes and
render the checksum as a delimited value itself, perhaps 4 hexadecimal
digits. It doesn't pay to get too complicated with the scheme, since
I've never had a situation where it made sense to try to reconstruct
the message from the garbled data.
Many control devices which can handle strings cannot perform a checksum,
or cannot do it in a reasonable time because they are implemented so
inefficiently. I've found that most industrial controls don't use a
fraction of the bandwidth available to them, so the simplest way to
deal with this is to simply send the message twice, with a single
delimiter. This works especially well with devices that are capable
of receiving N characters into a fixed buffer and comparing
buffers, but not of calculating checksums. You just receive both of
the two iterations, compare them, and if they're the same parse the
one that's in the most convenient register.
Another technique which works well on devices which can't calculate a
checksum is the range check. Say your temperature probe is sending
the temperature every second, and you receive this data stream:
Discarding the odd value is a no-brainer. Most physical processes
cannot change faster than certain outside limits, and it makes sense
to check input values (especially when there are no other error checks)
to make sure the input variance is within natural bounds.
It should go without saying, but never put a character into a buffer
unless you have checked that you are really putting it into a buffer.
Oddly enough this is not much of a problem in pidgin control languages, but
it's epidemic in mainstream C applications. All it takes is
one lost delimiter to concantenate two messages and overrun
a storage buffer. It is never safe to make any assumption about incoming
data which must be true to keep an application from crashing.
Messages can get lost for many reasons. You might discard it because it
failed the checksum or duplication check. The EOM delimiter might be
lost, so that you read it as the first half of the following message.
The cable might be unplugged, then hastily reconnected. If you use
the Microsoft MSCOMM control, and your application is in a tight loop
when the delimiter arrives, the OnComm event will be *cough*lost*cough*.
Your application should recover from all these situations.
While it's not necessary, for simplest debugging one end of the link
should have the responsibility for re-establishing comms after a timeout.
If you do this at both ends, you can get into a real mess. Note that
the end that reconnects is not necessarily the "master" in terms of
data sourcing or functionality; I've written several systems in which
the less important "slave" device initiates connection, so that the
comm link goes idle when it isn't in use. I'll call the end that
initiates contact the "active" partner, and the end that only answers
queries the "passive" side.
I've gotten into the habit of having the active end poll the
line, even when there's no data to communicate. This keeps both
ends apprised of the fact that the line is open and functional; the
active end knows it isn't getting responses, and the passive end can
use a timeout to put up an error message for an operator, which may make
the difference between a hair-pulling nightmare and a quick fix for some
It's possible for an answer from the passive device to pass a second
query from the active device on its way back. It's important for the
system to be able to figure out that a particular answer is the answer
to the last query that was sent, and not one of these "ghost replies."
(BTW, these "ghosts" are the #1 reason you don't want both ends to
retransmit after a timeout; you can get into some very snarly loops
that way.) You need to give each message an identifier. This needn't
be complicated; I've used a single byte "ARI = Anti Repeat Index" which
cycles 0-9 and repeats. This is almost always adequate to maintain
synchronization over a serial link, bidirectional Centronics, or
local ethernet subnet. (The situation gets a bit more complicated
over wide area networks, and the increased complexity of the TCP/IP
protocol reflects this.)
If the active device receives a message whose ID doesn't match the ID
of the last query transmitted, it should not automatically retransmit
the last real query; let the timeout take care of that. If your incoming
streams are buffered, the extra responsess will keep generating extra
queries that generate new extra responses forever, sucking up all the
bandwidth and locking up the whole comm link.
Sometimes Polling is Best
Messages coming in over the serial line aren't the only ones that
get lost. Some Intel CPU's (including both the 805x series and some
of the 80x86 line with built-in serial ports) are prone to losing
interrupts. I've already mentioned lost OnComm events. I've seen
some very hairy and elaborate schemes to kick CPU's which lose
interrupts, and I've found the safest way to use MSCOMM in Visual
Basic is to ignore the OnComm event entirely and poll the port every
100ms or so to see if any data have arrived. If an event signifying
incoming data is lost, you are buggered; but if you're using a timer
there will always be another clock pulse to kickstart the situation.
A whole range of headaches arise when an extra message ends up in a
buffer which is never cleared. Your protocol should include intervals
when you know there is no legitimate data waiting, and you should
clear the buffer at that time. While the situations that can get
buffers out of sync are unusual, the condition can be amazingly
persistent when it occurs. Generally, when an active device has
sent a message and received a legitimate response, it should clear
its receive buffer before sending another request. Passive devices
don't need to bother, since the extra messages they send should be
ignored if the active device is properly serializing and discarding
redundant or nonsensical replies.
TCP/IP as a Carrier
I know of several devices now shipping with Ethernet ports and using
TCP/IP for the kind of data once shipped over star-topology serial
cable networks. Since a TCP/IP socket is equivalent to a serial link
which is transparently error-corrected, this can be a cheap and
painless solution, even in some cases where the TCP/IP must be farmed
out over an RS/232 cable.
However, the timing of TCP/IP connections can be tricky.
Two applications stand out for their timing criticality in my experience.
One is obvious; I have programmed a machine which must relay size data
to a slave controller within about 1/2 second, and if the slave hasn't
been updated it must discard the piece because it has gone past the
first actuator. There isn't a lot of data involved; individual packets
are in the 20 byte range. But the timeout structure is critical, because
any detection and correction must happen quickly. In my implementation
the data are sent 60 times a second (at 9600 baud) until the slave
acknowledges that particular serialized item. This gives it about 30
chances to receive the data before disaster occurs.
Another less obvious timing-critical application is the remote user
terminal. When an operator hits [F1] and there is a noticeable delay
before the display updates, it's a major nuisance which will poison
a customer's attitude toward your equipment.
While the bandwidth of Ethernet makes the overhead of TCP/IP look like
a non-problem, the implementation of TCP/IP can cause surprising
headaches because it is not possible to force the transmission of the
data you have "sent" with the Berkeley Sockets SEND command. This is
generally done immediately when a RECV is issued, but it isn't
guaranteed. What is guaranteed is that if you use multiple
SENDs to build an output before going to RECV, you will almost always
be hit with the 0.2 second "Nagle delay" which waits to make sure no
more data is incoming before assembling the datagram. If you are
depending on a rapid back-and-forth conversation between devices Nagle
can murder your performance.
You can turn the Nagle delay off, but then you may be hit with a 40:1
bandwidth hit wrapping individual characters in datagram headers if you're
not careful. Multiple devices doing this can choke even an Ethernet subnet
very quickly. You can usually avoid Nagle by building your entire output
into a single SEND, but this can introduce interesting performance
bottlenecks into your application. (Hint in VB: Use Join rather than
successive string concatenation.)
I know of at least two companies which have implemented timing-critical
comms via TCP/IP on Ethernet subnets, both of which place massive
restrictions on the hubs, routers, and devices which may "legally" be
placed on those subnets so as not to bog down the comms. I consider
this really bad design; it takes a standard interface which, by definition,
invites any compliant device to hook up, and breaks it by severely
limiting what may actually be hooked up to that "standard" hookup.
One day someone will plug their Rio into the hub to download some
MP3's and the guys in the plant will have no idea why chicken quarters start
flying off the end of the sizer.
The most important thing when implementing a protocol to connect two
applications -- even in some cases when they are running in different
threads on the same computer -- is to always provide for the possibility
that any given message might be lost, garbled, or arrive as a blast of random
noise. The system should have a device which is responsible for re-establishing
comms when they are broken, so that things are seamlessly re-synchronized
when an error occurs.
I test all my applications by deliberately unplugging cables while test
data are being run. A well written application will provide coherent
error messages and will not crash no matter when you do this, and it will
cleanly and consistently resume operation when the cables are plugged back
in. It will also be forgiving if you "accidentally" plug it into, say,
the instrument that blasts out continous data in an incompatible format.
Use these hints to get your apps to pass these tests, and they might even
hold up when your customers get medieval on them.