Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Communication Protocol Design Basics

By localroger in Technology
Thu Jul 04, 2002 at 03:28:43 AM EST
Tags: Software (all tags)
Software

In over a decade of writing industrial control software, I've often had to coordinate operation between programs running on different devices, often over wires that are prone to RFI, bad connections, water intrusion, and being cut accidentally. This is a summary of my experience in how to write a protocol which will fail gracefully when such things happen.

I will also make some notes on the increasingly common practice of using TCP/IP for transparent error correction, and some situations in which this might not be appropriate.


I've often been faced with two devices and a piece of wire, over which the two devices must exchange critical data. Whether the wire is two feet long or two miles, 300 baud RS/232 or Ethernet, you can reduce the incidence of those panicky 3:00 AM phone calls with a little planning toward the day it gets cut.

Whenever you contemplate sending a message over a communication channel, you must consider two scenarios which are unlikely to happen on any individual message, but certain to happen eventually:

  • The message arrives garbled, with missing or randomized data
  • The message doesn't arrive at all
The situation is complicated a bit by the fact that much industrial equipment (like PLC's, motor controllers, and scales) are "programmed" in pidgin languages which may not be Turing complete or which may present unusual performance bottlenecks.

Error Checking

The obvious solution to garbled-message detection is to run a checksum on the data. While there are some elaborate checksum schemes which can allow partial data reconstruction, I've found in practice that the simplest way to deal with a bad checksum is to ignore the message completely as if it never arrived. This assumes, of course, that you've dealt adequately with the missing-message problem.

Checksums should not interfere with clean parsing of the incoming message. One control protocol I know of uses this format:

[STX]-D-A-T-A-[CR][CKSUM]

This looks good at first glance; when parsing, you can ignore data until you read chr$(2), fill buffer until you read chr$(13), and take the next byte as the checksum. The problem? The checksum can take any 8-bit value 0-255. What happens if it works out to 2? You can deal with it if you think of it when you write the driver, but it's not pretty.

A related problem came out when I tried to emulate this protocol on another device which, because its firmware used the standard C string library to handle communications, could not transmit the ASCII value zero. I had to detect this situation and change a relatively unimportant bit of the datastream to keep the checksum from assuming this unsendable value.

I have had good results reserving the high ASCII values 128-255 for a simple additive checksum, with the high bit delimiting both that the value is a checksum and End Of Message. If a failure rate of 1 in 128 is not good enough, I prefer to use multiple bytes and render the checksum as a delimited value itself, perhaps 4 hexadecimal digits. It doesn't pay to get too complicated with the scheme, since I've never had a situation where it made sense to try to reconstruct the message from the garbled data.

Many control devices which can handle strings cannot perform a checksum, or cannot do it in a reasonable time because they are implemented so inefficiently. I've found that most industrial controls don't use a fraction of the bandwidth available to them, so the simplest way to deal with this is to simply send the message twice, with a single delimiter. This works especially well with devices that are capable of receiving N characters into a fixed buffer and comparing buffers, but not of calculating checksums. You just receive both of the two iterations, compare them, and if they're the same parse the one that's in the most convenient register.

Another technique which works well on devices which can't calculate a checksum is the range check. Say your temperature probe is sending the temperature every second, and you receive this data stream:

532.3[CR]
532.4[CR]
53432.3[CR]
532.3[CR]

Discarding the odd value is a no-brainer. Most physical processes cannot change faster than certain outside limits, and it makes sense to check input values (especially when there are no other error checks) to make sure the input variance is within natural bounds.

Buffer Integrity

It should go without saying, but never put a character into a buffer unless you have checked that you are really putting it into a buffer. Oddly enough this is not much of a problem in pidgin control languages, but it's epidemic in mainstream C applications. All it takes is one lost delimiter to concantenate two messages and overrun a storage buffer. It is never safe to make any assumption about incoming data which must be true to keep an application from crashing.

Lost Messages

Messages can get lost for many reasons. You might discard it because it failed the checksum or duplication check. The EOM delimiter might be lost, so that you read it as the first half of the following message. The cable might be unplugged, then hastily reconnected. If you use the Microsoft MSCOMM control, and your application is in a tight loop when the delimiter arrives, the OnComm event will be *cough*lost*cough*. Your application should recover from all these situations.

While it's not necessary, for simplest debugging one end of the link should have the responsibility for re-establishing comms after a timeout. If you do this at both ends, you can get into a real mess. Note that the end that reconnects is not necessarily the "master" in terms of data sourcing or functionality; I've written several systems in which the less important "slave" device initiates connection, so that the comm link goes idle when it isn't in use. I'll call the end that initiates contact the "active" partner, and the end that only answers queries the "passive" side.

I've gotten into the habit of having the active end poll the line, even when there's no data to communicate. This keeps both ends apprised of the fact that the line is open and functional; the active end knows it isn't getting responses, and the passive end can use a timeout to put up an error message for an operator, which may make the difference between a hair-pulling nightmare and a quick fix for some maintenance technician.

It's possible for an answer from the passive device to pass a second query from the active device on its way back. It's important for the system to be able to figure out that a particular answer is the answer to the last query that was sent, and not one of these "ghost replies." (BTW, these "ghosts" are the #1 reason you don't want both ends to retransmit after a timeout; you can get into some very snarly loops that way.) You need to give each message an identifier. This needn't be complicated; I've used a single byte "ARI = Anti Repeat Index" which cycles 0-9 and repeats. This is almost always adequate to maintain synchronization over a serial link, bidirectional Centronics, or local ethernet subnet. (The situation gets a bit more complicated over wide area networks, and the increased complexity of the TCP/IP protocol reflects this.)

If the active device receives a message whose ID doesn't match the ID of the last query transmitted, it should not automatically retransmit the last real query; let the timeout take care of that. If your incoming streams are buffered, the extra responsess will keep generating extra queries that generate new extra responses forever, sucking up all the bandwidth and locking up the whole comm link.

Sometimes Polling is Best

Messages coming in over the serial line aren't the only ones that get lost. Some Intel CPU's (including both the 805x series and some of the 80x86 line with built-in serial ports) are prone to losing interrupts. I've already mentioned lost OnComm events. I've seen some very hairy and elaborate schemes to kick CPU's which lose interrupts, and I've found the safest way to use MSCOMM in Visual Basic is to ignore the OnComm event entirely and poll the port every 100ms or so to see if any data have arrived. If an event signifying incoming data is lost, you are buggered; but if you're using a timer there will always be another clock pulse to kickstart the situation.

Buffer Management

A whole range of headaches arise when an extra message ends up in a buffer which is never cleared. Your protocol should include intervals when you know there is no legitimate data waiting, and you should clear the buffer at that time. While the situations that can get buffers out of sync are unusual, the condition can be amazingly persistent when it occurs. Generally, when an active device has sent a message and received a legitimate response, it should clear its receive buffer before sending another request. Passive devices don't need to bother, since the extra messages they send should be ignored if the active device is properly serializing and discarding redundant or nonsensical replies.

TCP/IP as a Carrier

I know of several devices now shipping with Ethernet ports and using TCP/IP for the kind of data once shipped over star-topology serial cable networks. Since a TCP/IP socket is equivalent to a serial link which is transparently error-corrected, this can be a cheap and painless solution, even in some cases where the TCP/IP must be farmed out over an RS/232 cable.

However, the timing of TCP/IP connections can be tricky.

Two applications stand out for their timing criticality in my experience. One is obvious; I have programmed a machine which must relay size data to a slave controller within about 1/2 second, and if the slave hasn't been updated it must discard the piece because it has gone past the first actuator. There isn't a lot of data involved; individual packets are in the 20 byte range. But the timeout structure is critical, because any detection and correction must happen quickly. In my implementation the data are sent 60 times a second (at 9600 baud) until the slave acknowledges that particular serialized item. This gives it about 30 chances to receive the data before disaster occurs.

Another less obvious timing-critical application is the remote user terminal. When an operator hits [F1] and there is a noticeable delay before the display updates, it's a major nuisance which will poison a customer's attitude toward your equipment.

While the bandwidth of Ethernet makes the overhead of TCP/IP look like a non-problem, the implementation of TCP/IP can cause surprising headaches because it is not possible to force the transmission of the data you have "sent" with the Berkeley Sockets SEND command. This is generally done immediately when a RECV is issued, but it isn't guaranteed. What is guaranteed is that if you use multiple SENDs to build an output before going to RECV, you will almost always be hit with the 0.2 second "Nagle delay" which waits to make sure no more data is incoming before assembling the datagram. If you are depending on a rapid back-and-forth conversation between devices Nagle can murder your performance.

You can turn the Nagle delay off, but then you may be hit with a 40:1 bandwidth hit wrapping individual characters in datagram headers if you're not careful. Multiple devices doing this can choke even an Ethernet subnet very quickly. You can usually avoid Nagle by building your entire output into a single SEND, but this can introduce interesting performance bottlenecks into your application. (Hint in VB: Use Join rather than successive string concatenation.)

I know of at least two companies which have implemented timing-critical comms via TCP/IP on Ethernet subnets, both of which place massive restrictions on the hubs, routers, and devices which may "legally" be placed on those subnets so as not to bog down the comms. I consider this really bad design; it takes a standard interface which, by definition, invites any compliant device to hook up, and breaks it by severely limiting what may actually be hooked up to that "standard" hookup. One day someone will plug their Rio into the hub to download some MP3's and the guys in the plant will have no idea why chicken quarters start flying off the end of the sizer.

Summary

The most important thing when implementing a protocol to connect two applications -- even in some cases when they are running in different threads on the same computer -- is to always provide for the possibility that any given message might be lost, garbled, or arrive as a blast of random noise. The system should have a device which is responsible for re-establishing comms when they are broken, so that things are seamlessly re-synchronized when an error occurs.

I test all my applications by deliberately unplugging cables while test data are being run. A well written application will provide coherent error messages and will not crash no matter when you do this, and it will cleanly and consistently resume operation when the cables are plugged back in. It will also be forgiving if you "accidentally" plug it into, say, the instrument that blasts out continous data in an incompatible format. Use these hints to get your apps to pass these tests, and they might even hold up when your customers get medieval on them.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Also by localroger


Display: Sort:
Communication Protocol Design Basics | 24 comments (17 topical, 7 editorial, 0 hidden)
Just one question (none / 0) (#8)
by KWillets on Thu Jul 04, 2002 at 12:43:43 AM EST

Did you ever wire a battery to the DTR(?) line to keep it high?  I had some guy on the phone once who did that; I told him to use the damned command line utility because my stuff wasn't touching the port settings anyway.

That was a long time ago, but that was probably the funniest thing about doing serial port stuff.  That and doing it in FORTRAN.

battery? (none / 0) (#12)
by tzanger on Thu Jul 04, 2002 at 09:39:11 AM EST

Unless you have really weird serial ports you can just wire it to DSR (I'm assuming you're DCE here).



[ Parent ]
I still don't know why (none / 0) (#15)
by KWillets on Thu Jul 04, 2002 at 02:00:00 PM EST

I just remember talking on the phone with a customer  and he mentioned that he soldered a battery into one of the lines to keep it high.  I was puzzled, but quickly pointed out that he could type a command on his VAX that would do the same thing.  Strange.  

I assume it was just hardware-fixation.

[ Parent ]

When do you ever use sockets between threads? (none / 0) (#9)
by Kalani on Thu Jul 04, 2002 at 02:18:45 AM EST

If you're just testing a "talk session" locally then why wouldn't both ends be in different processes? If you just need a way to communicate between threads, why not use shared memory and event objects and critical sections? If you mean to exchange data between two processes that'll always be on the same machine, why not use pipes, mailslots, copied memory blocks, etc?

The only transmission protocols I've worked on have been for small wireless devices. It's a lot of fun. Great article.

-----
"Nothing says 'final boss' like a giant brain in a tube."
-- Udderdude on flipcode.com
Oops, terminology (5.00 / 1) (#11)
by localroger on Thu Jul 04, 2002 at 07:12:01 AM EST

I meant processes. I have so far managed to avoid having to ever resort to threads within a single process, and hopefully the stuff I do will remain simple enough for this to be the case for a long, long time. Or at least until there are non-miserable debugging tools for threaded processes.

I'm doing some stuff now where I use multiple processes in the way you sometimes see threads used within a single process, and I decided to use the TCP stack because (1) it's the only thing that seems to work the same on any platform, and (2) if I get it right I can scale the app over a network without rewriting anything. This is not, obviously, a universal solution, but it works in this case because the frequency and size of datagrams are both generally modest.

I can haz blog!
[ Parent ]

Some of my favorite things (5.00 / 5) (#10)
by tjb on Thu Jul 04, 2002 at 02:37:18 AM EST

As a communication-DSP programmer, there are a few more things I'd like to add:

First of all, if you're going to make a custom  protocol and it doesn't have to be super low-cost use forward error correction - real error correction.  This isn't 1985, a Reed-Solomon codec chip can be bought real cheap now, there is no need to rely on the TCP/IP.  Viterbi (or Tornado) codecs are good too, and while they may be a better fit for QAM-style physical layers, nothing says that you can't slap a 3/4 coding-ratio viterbi coder on a high-speed serial link.  Some ADSL standards do viterbi on the line-data as soon as it is received, and then pass the (possibly) corrected data to a RS decoder.  This works well because the viterbi doesn't correct much and doesn't spread errors around, while a Reed Solomon decoder can correct a ton of data but an RS miscorrect will annihilate the damaged data, which leads me to my second point...

Make errors hurt and hurt badly. This is actually the thing that I believe is one of the most important that local roger didn't mention.  By the time data gets to any intelligent parser, it should either be 100% correct or 99.9% garbage.  Even if only one bit is wrong, destroy multiple following bits.  If the data is minorly errored, it is much more likely that the parser will get confused and then come back into synch in a bad state.  By implementing some sort of scrambler/descrambler (kind of like a CRC generator, except your polynomial generally works in the other direction, your data is in the last-in position rather than the first-in position), a 1 bit error will be multiplied across several bits and the parser should be able to say 'Hey, I'm *way* out of synch here, lets start again' with much less trouble.  The chances of getting two bit errors that cancel each other out in a checksum or CRC calculation is relatively high (read: low, but possible), but with a scrambler in there, those two bit errors become maybe 20 bit errors and the chances of them resulting in a valid checksum or CRC are pretty low (read: exceedingly low in the lifetime of the universe).  And scramblers are cool because they are very simple (just a few XORs) and self synchronizing - just keep feeding the data into the descrambler and when you flush the last errored bit through the last tap, it will begin to synchronize again.

And a few more things that are mostly just preferences, feel free to disagree:

1) Please, for the love of everything holy, pick an Endian-order and stick with it!  If I ever run into the designers of these protocols that make you bit-cross every freaking step of the way, I will most certainly *not* buy them a beer :)  I should not have to bit-cross before descrambling, then bit-cross again (!) to feed my CRC generator, and then byte-cross to check the physical layer framing, and then word-cross to parse out the data.  That is just evil and wrong and undeserving of the mortal version of ambrosia.

2)  Separate your physical layer framing from your data.  Its easier to implement this way, but if there are no physical-layer events to trigger this (like a symbol boundary), I suppose it would be hard to do.

3)  Use a fixed length format for your data-link layer.  Because I said so :)

And localroger, great article, I'll quit rambling now and go vote :)

Tim

serial drivers (4.00 / 2) (#13)
by tzanger on Thu Jul 04, 2002 at 09:53:24 AM EST

This looks good at first glance; when parsing, you can ignore data until you read chr$(2), fill buffer until you read chr$(13), and take the next byte as the checksum. The problem? The checksum can take any 8-bit value 0-255. What happens if it works out to 2? You can deal with it if you think of it when you write the driver, but it's not pretty.

What I have always done (and I'm sure this is standard practise) is to set a timer upon receiving STX. If you don't get your CR+checksum within xms, flush the buffer. It takes care of the problem quite nicely. You need a good missing message uh.. protocol.., as you said, if you'll forgive the circular definition. :-)

More realistically (I hate protocols which waste bandwidth/time with hardcoded headers) I set the timer upon receiving any data. After enough has been received that I can determine what the hell it is (partially parsing the received header to see if it's for me and seeing how big the data is, if that's available), I just keep dumping received characters into the buffer until either the datalen+footer is received or I run out of buffer space (say 100 bytes or so). If everything is received I clear the framing timer and check CRC. If it checks out I pass the clean packet up the chain. If not, I dump it. If the framing timer expires before all the data was received, I dump the packet.

I learned long ago to always have counters for tx/rx data, tx/rx packets, bad packet counts (runts, overruns, bad CRC, bad header data, etc.) and even when I pass things up the chain the packet has a unique ID of some kind (usually just a hash of the time and maybe the rx count or something). Debugging even simple protocols is tedious but those counters are a godsend.

I've been doing communications for a while now, too, and my biggest personal bugaboo is trying to design too much. You know, what to do when your queues get full and so on. They're the edge conditions and I hate the idea of just dropping things since it's not really nice but memory and program space is limited and somewhere someone has to handle it. :-)



Timeouts (4.50 / 2) (#18)
by localroger on Thu Jul 04, 2002 at 08:19:16 PM EST

I really hate timeouts. I use them as infrequently as possible. If data are coming in, even if they are garbage, I use 'em to drive the reception process when I can.

In the case of the [CKSUM]=chr$(02), I stuck a flag in the driver which reset the process if two chr$(02)'s followed one another. The nature of the packet ensured that no [STX] would ever legitimately be followed by another. Unfortunately I've worked with some controllers that couldn't implement this check; once you told them "receive 37 characters" they were stuck until 37 anythings came in. It's not that hard when you have something PC-ish to work from, but when you're using these pidgin-language controllers it really becomes an art form.

I can haz blog!
[ Parent ]

For A Start (5.00 / 1) (#14)
by Baldrson on Thu Jul 04, 2002 at 12:33:09 PM EST

Read End to End Arguments In System Design.

It is considered a foundational paper for the Internet.

Basically, the idea is to keep the network stupid and simple while concentrating functions at the end-points of communication.

-------- Empty the Cities --------


Use layers (none / 0) (#16)
by KWillets on Thu Jul 04, 2002 at 02:13:15 PM EST

One thing I've found helpful is to use the layered protocol model so that I know what to do when.  That way if I have error correction, for example, on one layer, I don't have to worry about it on a higher level.  It's easy to fall into a trap of doing everything on one level, and having too much complexity.

The layers are also very modular, as we know with IP, etc.  The same building blocks can serve different applications.

Restrictions of layering (none / 0) (#17)
by mystic on Thu Jul 04, 2002 at 08:10:13 PM EST

OSI/DARPA layering concepts may make the life of the programmers easier, but it is messing the life of so many people trying to improve TCP/IP.

The basic problem is that even though it is easier and beneficial to discover a problem with a packet (CRC error etc.) at the lower layers, this lower layer may not be the best person to decide on what to do when such a situation occurs.

For example, even though a packet at the IP layer may be corrupted and failed the CRC check, maybe the error is only at the TCP data section of the packet and the TCP header is still intact. So much information can be gained from this uncorrupted TCP header, but what do we do? We just discard it!

The idea presented above is borrowed from the paper "TCP HACK: TCP Header Checksum Option to Improve Performance Over Lossy Links", In Proceedings of 20th IEEE Conference on Computer Communications (INFOCOM), Anchorage, Alaska, USA, April 2001.

[ Parent ]
Layers and more layers (none / 0) (#21)
by porkchop_d_clown on Mon Jul 08, 2002 at 11:40:36 AM EST

Layering is a design pattern that I dearly love - whether for communications or anything else; deal with something in the correct layer and don't worry about it anywhere else, and reuse the layers as needed.

The only problem is figuring out in advance how many layers you need and which layer should do what - only experience seems to be able to resolve that issue.


--
ACK.


[ Parent ]
Protocol design (4.00 / 1) (#19)
by sigwinch on Fri Jul 05, 2002 at 01:16:37 AM EST

Many control devices which can handle strings cannot perform a checksum, or cannot do it in a reasonable time because they are implemented so inefficiently. I've found that most industrial controls don't use a fraction of the bandwidth available to them, so the simplest way to deal with this is to simply send the message twice, with a single delimiter...
But you're screwed if you have data pattern-dependent errors. (E.g., cable reflections.)
I know of at least two companies which have implemented timing-critical comms via TCP/IP on Ethernet subnets, both of which place massive restrictions on the hubs, routers, and devices which may "legally" be placed on those subnets so as not to bog down the comms. I consider this really bad design; takes a standard interface which, by definition, invites any compliant device to hook up, and breaks it by severely limiting what may actually be hooked up to that "standard" hookup One day someone will plug their Rio into the hub to download some MP3's and the guys in the plant will have no idea why chicken quarters start flying off the end of the sizer.
This is one of the dumbest things I've ever read.

First, designing a custom physical link is a complete waste of time and money. Standard commercial links (like Ethernet) are cheaper, more reliable, and better performing. They take ~0 man-hours of design effort. And when something breaks you don't have to wait for the custom router (that went out of production five years ago) to be replaced: just buy an off-the-shelf router that meets the provided specs.

Second, the standard does not "invite" a goddamn thing. Just because the protocol allows John Q. Public to randomly wire things together, does not prevent engineers from carefully designing more optimal system.

Third, it is not inevitable that somebody will disrupt critical data networks. It is an utterly trivial matter to tag and mark the cables and routers as industrial data-only. (Telecom companies have done this forever for high-rel comm circuits; and power engineers and electricians have done it for AC mains that supply medical equipment.) If some fool decides the warnings don't apply to them, they get to have many exciting discussions with the lawyers. If the downtime entails pay cuts or furloughs, they get to have discussions of a more personal nature...

Fourth, industrial control isn't somehow special or exotic. Plenty of ordinary LANs have tough timing requirements: video-conferencing equipment, IP telephones (fom Cisco and others), interactive data terminals, etc. If you can't segregate network traffic properly, you have no business touching any enterprise LAN.

--
I don't want the world, I just want your half.

Reply (4.66 / 3) (#20)
by localroger on Fri Jul 05, 2002 at 09:24:38 AM EST

This is one of the dumbest things I've ever read.

You must not read much then. It's a straightforward account of proven design principles; even if you disagree with them, I'm sure you have read something "dumber" at some point.

First, designing a custom physical link is a complete waste of time and money ... And when something breaks you don't have to wait for the custom router (that went out of production five years ago) to be replaced: just buy an off-the-shelf router that meets the provided specs.

That is exactly not what the company that did this did. They specifically require the custom router (that will go out of production one day) in order to meet the warranty on their system. I suppose there is a reason for that, maybe it's one of those even dumber things you never read.

Second, the standard does not "invite" a goddamn thing. Just because the protocol allows John Q. Public to randomly wire things together, does not prevent engineers from carefully designing more optimal system.

Here is another principle of design which you'd be well advised to heed in real life: In a plant, if you have two connectors that can mate, at some point someone will plug them into one another. It does not matter how many signs you hang on each saying they're not compatible. I have replaced boards in instruments half a dozen times because operators violated the clearly labelled instructions not to plug them into incompatible power or comm lines which happened to have similar connectors.

Third, it is not inevitable that somebody will disrupt critical data networks. It is an utterly trivial matter to tag and mark the cables and routers as industrial data-only.

And it's an utterly trivial matter for people to ignore those tags. BTW, in a plant, everything is industrial data.

If some fool decides the warnings don't apply to them, they get to have many exciting discussions with the lawyers. If the downtime entails pay cuts or furloughs, they get to have discussions of a more personal nature...

No, they will play CYA and blame the vendors. That's the way it plays in real life.

Fourth, industrial control isn't somehow special or exotic. Plenty of ordinary LANs have tough timing requirements: video-conferencing equipment, IP telephones (fom Cisco and others), interactive data terminals, etc. If you can't segregate network traffic properly, you have no business touching any enterprise LAN.

What's this "enterprise LAN" shit? We're talking about data lines run throughout a plant, on networks which the vendor has specified cannot ever be connected to the vast majority of equipment most people would think compatible with their system. Once those lines leave the phone closet it doesn't matter how they are tagged, some $10.00/hour technician who gets in a bind will try swapping them around one day.

Good designers deal with the shit that happens in the real world. Bad designers write arbitrary rules and cry "no warranty" when people who aren't aware of them break those rules and break their cheap, zero-design-effort "standard" systems.

I can haz blog!
[ Parent ]

A subject for my comment (none / 0) (#22)
by sigwinch on Tue Jul 09, 2002 at 04:10:28 AM EST

They specifically require the custom router (that will go out of production one day) in order to meet the warranty on their system.
Because there was no point in qualifying all of the dozens of routers that would have worked. And qualifying a new model is **CHEAP** compared to reverse-engineering an existing one, then designing and manufacturing a replacement. For the former, you can throw something online and pray that it works *now*, and qualify it with less than a week of engineering effort. For the latter, you don't get even a crappy replacement for weeks, and you'll spend at least a man-month on it.
In a plant, if you have two connectors that can mate, at some point someone will plug them into one another. It does not matter how many signs you hang on each saying they're not compatible.
So? If you plan for foolishness and low quality, you will find an abundance of it.
And it's an utterly trivial matter for people to ignore those tags.
In which case they, and the managers who are answerable for their actions, richly deserve what they receive. If no one is held responsible, the organization cannot help but have frequent disaster.
BTW, in a plant, everything is industrial data.
Except for things like telephones. And the LAN used for business data (bar code scanners, bills of lading).
What's this "enterprise LAN" shit?
A local area network that will cause the enterprise to be nonviable if it fails often.
We're talking about data lines run throughout a plant, on networks which the vendor has specified cannot ever be connected to the vast majority of equipment most people would think compatible with their system.
Replace "industrial control" with "voice" and you've just described an off-the-shelf TCP/IP telephone system: segregated cables, no direct link to outside data networks, hard real time requirements, and ultra-high-value data. Put it in a call center and the system becomes as critical as anything in an industrial plant. (In fact, some of the IP telephony systems go ever farther and stick power on a pair of wires that is unused by common Ethernet.)

And guess what: it works well, and is in fact preferred in many situations. Sure, an employee could try to play Quake across that LAN, but doesn't because he knows what the bill would be per second of downtime.

Once those lines leave the phone closet it doesn't matter how they are tagged, some $10.00/hour technician who gets in a bind will try swapping them around one day.
At which point the company will get what it deserves for keeping fools.
Good designers deal with the shit that happens in the real world.
Good companies hire good people, train them well, and inspire them to care and attention to detail.
Bad designers write arbitrary rules and cry "no warranty" when people who aren't aware of them break those rules and break their cheap, zero-design-effort "standard" systems.
You cannot stop fools from wreaking havoc. If you piss away vast amounts of money on custom comms just so they can't plug something in wrong, they'll just discover that putting a mild steel bolt on the bearing cap in an engine causes it to throw a rod, or that brake fluid isn't the same is hydraulic fluid, or that if you don't bother to stay around or use a written procedure while pressure testing a KC-135 Stratotanker the fuselage can explode.

--
I don't want the world, I just want your half.
[ Parent ]

Whee, buzztechs (none / 0) (#23)
by Miniluv on Tue Jul 09, 2002 at 10:14:59 AM EST

Ok, what's the obsession with pathetic VoIP implementations? Especially Cisco, whom you hint at but never seem to want to name? You consider it a "good" thing that Cisco decided to power their phones via wires marked reserved in the Ethernet standard? You consider it a safe vendor decision to ship said phones without standard AC adapters? You realize that this means you must buy an upstream device (i.e. switch or hub) which supports the proprietary Cisco power over ethernet's sloppy seconds protocol?

You seem to think that vendors can just arbitrarily dictate the entirety of the environment in which their products can be used. In theory, it sounds great, and IBM made it work for a long, long time before that model collapsed under its own weight. These days vendors need to actually serve their customers, and that means designing systems that fail gracefully, even when the customer does something truly boneheaded.

Good companies hire good people, train them well, and inspire them to care and attention to detail.
Sadly, most customers won't be good companies. Furthermore, people make mistakes, even at good companies. I'd hate to be around a system that breaks just because I plug Ethernet Cable 1 into Ethernet Receptacle 3, just because you never thought I would. Even if the manual clearly states that Heaven will be ripped asunder when I do, you should've thought I might.


"Too much wasabi and you'll be crying like you did at the last ten minutes of The Terminator" - Alton Brown
[ Parent ]

Cool Communication Protocol Book (none / 0) (#24)
by GGardner on Thu Jul 11, 2002 at 11:55:34 PM EST

Gerard Holzmann wrote a really cool book about pre-computer communication protocols. Ancient Greeks and their signal fires, semaphore codes at sea, and railroad signaling. All the same issues as with modern computer protocols were faced by our ancestors!

amazon link. That's ISBN: 0818667826 for the amazon-intolerant.

Communication Protocol Design Basics | 24 comments (17 topical, 7 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!