Reliability and hot standby

Fault Tolerance in SCADA and DCS

Fri, 11 Jul 1997 16:56:00 +1000

An area where the industry does not seem to have made much progress is in the area of "fault tolerance".

For many years now we have seen "redundancy" of major components as the answer to reliability and availability. For some more critical applications such as safety and emergency shutdown applications, we have seen triplication and voting techniques used to increase reliability. Some individual controllers also use this technique.

An interesting article in April 1997 Computer Magazine by Algridas Avizienis of UCLA has perked my interest in what the SCADA and DCS industries are doing to build more fault tolerance into their products.

Would vendors, consultants and users care to comment?

 Thomas H Fox				Sinclair Knight Merz
 Manager SCADA  Telemetry Projects	PO Box H615
 tfox@skm.com.au				PERTH WA 6001 Australia
 Phone: +61 9 268 4440			Fax: +61 9 268 4444

Re: Fault Tolerance in SCADA and DCS

Tue, 22 Jul 1997 18:46:28 +0800

This question probably deserved more comment. I don't entirely agree with the view that not much progress has been made.

Firstly I think that systems are much more reliable - largely because the components are better proven. eg which is more reliable - a 1970's minicomputer a la PDP11/34 or todays Pentium PC. Which character generators are better - the ones that took a cupboard to house them or todays graphics card in a PC.

I also believe the technology is better proven. Hot standby systems at one time could actually make the system more unreliable due to the increased complexity they introduced. Offer me a beer and I will tell you some horror stories. Today that is not the case.

Better computing power means devolving the computing further down into the field, and this means more reliability.

Reliability can be built in at more fundamental levels. I recently looked at a system that required a control loop over 20 kilometres. As I was in a position to influence the hydraulic design, we found that we could introduce a small tank (required to control water hammer anyway), and reduce this to two control systems each operating locally (ie tank and pump at one site and second tank and valve at the second site). Much better. And it didn't cost any more. And the increased reliability will not be a SCADA function!

So although we still use redundancy, there are other considerations. And we have made redundancy more reliable as we understand it better.

I suppose that we have got better at the things we always used to do, but there is no 'silver bullet' which has solved the problems in one go. A gradual improvement has occurred.

Does anyone else have any experiences or views to add to this debate? Regards

 Ian Wiese                  Ph 6189 420 2610
 Tek Soft Consulting        Hm 6189 448 7487
 http://www.iinet.net.au/~ianw
 ianw@iinet.net.au          Fax 6189 420 3179

Re: Fault Tolerance in SCADA and DCS

Sun, 27 Jul 1997 23:27:44 +1000 (EST)

I agree with Ian that a better appreciation of the complexity of hot standby systems has been gained, but I'm not so sure the Hardware of today is that much better (eg floating point problems in Pentium) than the recent past, and I'm quite sure the challenge of providing redundant hot standby systems in a classic SCADA architecture, with the explosion in memory, CPU speed and complexity of software has increased significantly.

Hot standby has always been taken for granted in any serious classic SCADA system that must monitor an important resource or process. In the the not too distant past, the complexity of SCADA systems in terms of the functionality they provided, the amount of software required to achieve the effect, the amount of data they were capable of storing, and the fact that the whole system was generally written from the ground up meant that the chances of achieving a workable hot standby system were pretty good.

To give an example: The LN 2055 SCADA system was originally built with a 2 Mbyte helium filled hard disk, up to 128 KByte of Core memory (thats the ferro stuff mind) and a CPU that was built using discrete IC's on a PC Board the size of a picnic table, operating at about 1 Mhz. To the credit of the engineers that put these systems together, I have seen them quite happily handling around 8000 points spread accross 300 RTU's, with poll times as fast as the 1200 baud comms would allow, 4 operators simultaneously plugging away and not a murmer of discontent about performance. Moreover, these systems are still in use, and have provided excellent reliability and availability, in part due to the hot failover capability. LN 2055 SCADA system Contrast this to a typical SCADA implementation these days: Gigabytes of Disk space, 100's of Megabytes of main memory, 300 MHz plus CPU speed, Megabyte per second LAN links, 3 rd party relational databases and a much richer set of behaviour (ie. much more software, much more data). LN 2055 SCADA system I think the logical conclusion to this is that the job of providing hot standby capability must be considerably harder with systems built these days: I don't think its simply a case of "same strategy, more data, faster comms" to provide the redundant hot standby solution. LN 2055 SCADA system My question some weeks back about the future of "Classic" SCADA was prompted, in part, by my perception that vendors of systems that look after networks or processes where saftey and continuation of supply and/or production are critical are continually faced with the task of striking a balance between delivering more complex behaviour and providing acceptable reliability and availability. While the two are not mutually exclusive, I think delivering on both facets is not easy to achieve with a classic SCADA architecture. LN 2055 SCADA system The evolution of the SCADA architecture such that intelligence is devolved further into the field, as suggested by Ian, is one way that the overall reliability of a System could possibly be improved, but I agree with Thom that not much progress has been made overall in the area of "fault tolerance". Bill Tarlinton.

Re: Fault Tolerance in SCADA and DCS

Mon, 28 Jul 1997 13:13:55 +0800

At end Automatic Teller Machine networks, etc. Now if the SCADA system software is written to accepted Open Systems standards, it can normally run in this hardware redundancy environment with little or no change.

PCs are catching up too. Compaq now offer server systems with different levels of redundancy (mainly Unix, but Windows NT stuff using new - but non standard - clustering software). Some PC suppliers offer servers with hot swappable dual power supplies and disk arrays as standard.

In summary, as far as master stations go, traditional software driven redundancy may well sonn become obsolete as the hardware guys "build it in".

 |-------------------------------------------|
 | Gregory J Smith                           |
 | Vector Systems Integration         ,-_|\  |
 | gregs@vecint.com.au             /     \ |
 | Perth, Western Australia ------- *_,-._/ |
 | P:618 9242 3396  F:618 9242 3397       v  |
 | WWW: http://www.vecint.com.au             |
 |    * Software Solutions for Industry *    |
 |-------------------------------------------|
 

Re: Fault Tolerance in SCADA and DCS

Mon, 28 Jul 1997 17:26:52 +1000

A bit off the topic, but nonetheless:

I suppose (from my burnt fingers) that when someone talks about "fault tolerance", you need them be more specific:

Hardware generally carries a very high MTBF rating, and is usually not a limiting factor in system availability. Hardware often has a very simplistic mode of failure ie: It either works or it doesn't. This simplistic mode of failure allows for easy rules in determining to switchover.

Software on the other hand (and it appears to be getting worse - with companies rushing to get product out the door) is generally less reliable than hardware and often has very complex modes of failure.

We have recently put together a system with many levels of redundancy at the hardware level, but the limiting factor on system availability is ALWAYS the software. Whilst we do have redundancy in the software, determining when and what to switch is never easy, and in some cases needs to be done manually.

Software failures can often be subtle, and this often makes 100% automated switchover difficult if not impossible. eg. (Hypothetical situation) If one user interface (assuming a multi user system) somehow crashes (and this often happens on Windows) and starts affecting response times to other users do you switchover to a standby server? What would the rules be?

My personal opinion (a bit simplistic) is that software vendors need to improve somewhat and start being responsible for the product they provide. eg. Provide some (typical) MTBF failures on their software as do the hardware vendors, as well as to start providing some in-built mechanism for detecting and recovering from failures.

While we are on this thread, can we define the concept of 'system availabilty'?

Some customers are demanding "99.X% system availability", without specifying what 'available' means. eg: If you have a system comprising 150 RTU's and one RTU should fail, does that compromise overall system availability? What if there was a bug that only affected trending, is the system available? Kevin Webster TUSC Computer Systems Melbourne, Australia

Re: Fault Tolerance in SCADA and DCS

Mon, 28 Jul 1997 20:59:38 +0200

IMHO, that's too simple, there is at least a tree state logic for the hardware:

- it works without any visible errors a) :-))

- it works, but shows sometimes errornous behavior b) :-( :-)

- it doesn't work completely c) :-((

The state b) brings the biggest problems to every software developer and these problems are often claimed to be software problems ...

I know these solutions ... reset or switch off/on the hardware :-)

Yes, this is why adding software to the hardware increase the complexity of the system. The reason for the 'very complex failure modes' are often unknown or unspecified hardware behavior: chip failures, undetected memory failure, timing problems ... and so on.

This depends on the software quality. We have customers running QNX in pellet productions or power plants without reboots for several YEARS !

The same with the hardware ... it is sometimes very difficult to decide whether a signal from the hardware level is a valid signal or not ...

Use the right software basis for your solution !

The solutions could be: let work the hardware and the software developer smoothly together ... if possible :-) Best reg

ards
 
       ___/___           Armin Steinhoff      Armin@steinhoff.de
      /      /          STEINHOFF Automations-  Feldbus-Systeme
  ---/      /---------------------------------------------------
    /______/          +49-6431-529366   FAX  +49-6431-57454
      /              URL:       http://www.steinhoff.de
 

Ref: 071997\msg00010.xml