The following discussion is part of an occasional series showcasing the ISA Mentor Program, authored by Greg McMillan, industry consultant, author of numerous process control books, 2010 ISA Life Achievement Award recipient, and retired Senior Fellow from Solutia, Inc. (now Eastman Chemical). Greg will be posting questions and responses from the ISA Mentor Program, with contributions from program participants.
Mike Laspisa has spent 37-plus years working in the instrumentation and control (I&C) discipline, including 32 years as a lead I&C engineer and manufacturing plant staff I&C engineer. Mike’s primary motivation is to advance the automation profession by sharing knowledge gained from plant experience as seen in the Control Talk columns, “Instrument specification: Where we are and where we should be,” “I&C Construction scope,” and “Instrument Index Insights.”
Mike Laspisa’s Question:
I have been thinking about the various definitions of “safe” as it applies to devices, I/O, control systems, and most importantly critical parameter measurements and processes in general.
For devices, we think primarily of control and on/off valve failsafe positions. What is best for both safety and process operations? Open, Closed, Forward/Recycle/Last position?
For control system I/O, we think of what should be done if an analog input is considered “bad:” Hold last good value, use default value, switch controller to manual, stop process, etc. What is best for the process at this point in time? Boiler burner management system (BMS) controllers usually monitor I/O functionality with critical input check code and may also use redundant I/O. Redundant transmitters can be used for critical measurements. However, the control system must have a selection method to determine which transmitter to use. Selection methods include high/low selectors, average calculations (as long as signals are within x% of each other) or 2 out of 3 (2oo3) voting for shutdown interlocks. Operators should be able to select, or remove, a transmitter for service as necessary. Transmitter signal failure should remove it from service automatically.
For control systems, we think of trying to protect the process by preventing control system failures by using controller or I/O or power supply redundancy and/or segregating redundant field devices on different I/O cards. What about a PLC processor fault: Should you turn off all outputs or just selected outputs?
In the last 15 years, shared data has become more and more prevalent. Passing data via Ethernet using OPC data managers in several cases has replaced hard-wired handshakes between controller platforms (DCS-PLC or between different CPUs). Heartbeat code needs to be added to verify that registers have live data. If the data pass is broken, what action needs to be taken to protect process operations?
For processes, we think about what the safe states are that will protect the process quality and hopefully allow operations to correct a device failure/process upset and restart easily. Each process may require specific safe device states to maintain heating, cooling, pressure, vacuum, etc., or recycle to keep product in suspension and/or prevent pipeline blockages.
Greg McMillan’s Answer
I look forward to the Mentor resources in safety instrumented systems (SIS) chiming in here. My limited experience dates back about 40 years ago when I helped Monsanto develop internal guidance for interlock systems. My main input was providing guidance and perspective on how to find and classify the root causes of unsafe conditions. I was instrumental throughout my career in getting middle signal selection to be used, primarily in pH systems. I was part of an incredibly successful effort to dramatically reduce unplanned shutdowns of a challenging large intermediate chemicals plant from five to nearly zero per year by middle signal selection and smoother, safer, and faster compressor startups and surge prevention by procedure automation.
In the ISA 5.9 Technical Report on PID Algorithms and Performance that has officially been issued for review, I conveyed my experience and insights on increasing signal reliability by the following.
The goals of signal selection for PID control are to primarily protect against failures, and in the process improve accuracy and the 5Rs (reliability, resolution, repeatability, rangeability and response time) of signals. With the advent of increasingly smarter digital transmitters, most of the concern to be addressed is often the sensor type and installation. Thus, redundancy of sensors and independence of installation eliminating common mode failures is the first step. The next step is to decide whether to use Lo, Hi, or Middle Signal Selection (commonly referred to as Median Select) and/or supplemented by signal averaging to enable the PID controller to continue to do its job to keep the process in a good operating region.
Where profiles are not uniform (e.g., concentration, phase, temperature, or velocity), which is a characteristic of plug flow where there is little to no back mixing and piping discontinuities, leading to unpredictable nonuniformity and noise, location of sensors at different points and signal averaging may help. The most common example is an averaging pitot tube. For fluidized bed reactors, temperature sensors are installed in a pipe carrying the lead wires that traverses the reactor. Several of these pipes may be installed and the average computed for each pipe with Hi or Middle Signal Selection of averages used to determine the PV used for temperature control. Hi Signal Selection where each sensor signal is compared to average may also be used to rule out single sensor failures.
To promote independence, multiple sensors should be installed with separate process connections. Differential pressure and pressure transmitters should not share the same impulse lines. Separate nozzle connections are used to help maximize independence. Temperature sensors are not installed in the same thermowell that could be coated or have a loose fit or excessive vibration or other significant common mode problems. pH sensors should be separate and not share the same reference electrodes due to many possible sources of error and failures.
The most recognized failure is downscale possibly associated with transmitter failure or loss of signal, which leads to the frequent use of Hi Signal Selectors. With digital transmitters and signals, this may be less of a concern. For wireless system, there may be loss of updates, which may be addressed by a Hi Signal Selection. Downscale failure may have more safety and environmental concerns stemming from excessive concentrations, levels, pressures, and temperatures.
Middle Signal Selection inherently protects against a single failure of any type including last value, which is extremely difficult to detect and deal with automatically. Middle Signal Selection is particularly important for pH measurement because it reduces the common effects of noise, drift, coatings, glass electrode premature aging (e.g., caused by high temperatures or strong acid concentrations), dehydration, and abrasion, and reference electrode contamination. Middle signal selection offers a distinctive advantage of ignoring a slow sensor, which is particularly advantageous for pH measurement since significant aging, dehydration, and coating of the glass electrodes can increase the 86% response time from 2 seconds to 2000 seconds.
At one large chemical company, Middle Signal Selection was used on all pH loops. Middle Signal Selection can also offer simple effective diagnostics that enable quicker maintenance of a defective sensor to retain the full inherent protection. Middle Signal Selection was also used on all measurements in several large complex continuous plants reducing trips due to signal errors from 4 to less than 1 per year (each trip costing 10 or more million dollars). In general, the most hazardous operation occurs during shutdown and startup, posing safety and monetary concerns.
While maintenance and operations need to see each sensor signal, manual signal selection by individuals is often based on favoritism that is not attentive or fact-based. Manual signal selection should be limited to situations where a sensor has a confirmed problem or is being serviced (e.g., calibrated or replaced).
For an extensive integrated perspective on diagnostics, see the Control article, “A structured approach to control system diagnostics” by Mentor Program resource Luis Navas Guzman. For an insight from Len Laskowski of the important aspects of SIS not sufficiently discussed, see the Control Talk columns, “The ins and outs of safety instrumented systems – part 1,” and, “The ins and outs of safety in instrumented systems – part 2.”
Hunter Vegas’s Answer
I struggled with Mike’s question a bit since it is rather open-ended. The “safe” solution for device, controller, PLC, or IO point very much depends on the application. The control, production, and safety engineers need to carefully evaluate the failure and ramifications to the process and make the proper choice. There is no magic answer.
That being said, I will mention a few things that tend to trip folks up:
- Be very careful when obtaining analog data from external sources. Analog data that comes through the IO cards is tagged with a wealth of diagnostic data that can be used to flag issues and trigger selection and/or failure algorithms. However, data that comes via external data connections (OPC, wireless, external gateways, HART, etc.) may or may not have diagnostics available. The diagnostic data may be available, but it often takes additional configuration to bring it into the system. If this is not done, the diagnostics may sparse or absent entirely.
- Be leery of remote IO. When an IO card is located away from the controller, it will require some kind of remote scanner or communication module to transfer that data back into the system. What happens to the remote IO when those communications are lost? Note that things need to happen on both sides—the controller needs to detect the situation and handle things on its side, and the remote IO also needs to detect the failure and respond appropriately. Again, note that the “appropriate” response could be different for every IO point.
- As Greg mentioned, beware of common cause failures. For instance:
- Having redundant transmitters utilizing the same orifice and orifice taps.
- Having redundant orifice transmitters utilizing the same heat trace protection.
- Having redundant transmitters using the same technology, which are susceptible to the same failures. For instance, low process conductivity would affect the readings on redundant magmeters equally.
- Running redundant network communication cables together could be taken out by a single event.
- Having redundant power supplies powered by the same UPS. UPS systems fail, and often it may be better to put one power supply on UPS and the other on regular power.
- And so on.
- When considering failure modes, look beyond the obvious. For instance, don’t just consider the loss of air to a single control valve. What happens if you lost air to the whole unit? Similarly, consider not only the loss of cooling water to a single condenser, but the impact of a loss of cooling water on all the condensers simultaneously. A control valve can fail in a lot of ways—you could lose the incoming signal, lose the air supply, have a failed positioner, have an internal plug/seat failure, have a failed diaphragm, etc. Most of those failures would drive the valve toward its fail position, but some might drive the valve the other way.
Zach Brook’s Answer
My experience is more burner-focused, so I’ll be responding with that application in mind. I’ll defer to Hunter Vegas and Len Laskowski for the more traditional SIS commentary.
In the burner management system (BMS) world, we’re all about bringing valves to their fail state positions if there’s an unsafe condition detected within the purge or light-off sequence of events. However, there’s not one fail state position that is going to work for all applications. Block valves, for example, will be fail-closed, whereas the vent valve will be fail-open for a fuel train. Each valve instance would have to be evaluated for what makes sense for the application.
“Bad” status is also usually treated as a vote to trip in the BMS logic. There are instances where we’d have MooN voting on multiple transmitters, which in NFPA 86 and NFPA 87 are called out as being required to be SIL 2 capable, but transmitter failure is regarded as a cause for tripping the sequence and requiring the operator to reset the unit and re-purge before attempting to light off again. This would be the same for a processor or logic system failure. The BMS would monitor for this failure and bring the unit, be it a boiler, oven, furnace, or fired heater, to the safe state when detected.
Steven Kormann’s Answer
These questions raise many interesting thoughts with regards to burner management applications. I think a perspective that gets lost in burner applications is that putting together a “safe” system requires a wholistic approach that addresses all the layers and failure modes of the system.
For example, if a fuel shutoff valve is only rated for the normal operating pressure of its associated burner, an upstream regulator failure could easily expose the valve to excess line pressure and lead to premature failure of the shutoff valve. Without taking consideration of the whole system, safety equipment that is designed correctly in its specific use case may still fail to perform due to indirect failures in other parts of the system.
On the input side of things, I’m seeing increased discussion around diagnostics for discrete inputs. Current NFPA burner codes clearly call out minimum SIL capability and diagnostic requirements when analog input devices are used in BMS service, but diagnostic capability of most discrete input device applications is all but non-existent. Critical input checks can help determine if an input card or channel is faulted, but rarely can it detect a loop, sensing line, or sensor fault. Surely, we can add redundancy to help, but that does not overcome common-cause fault.
Since burner management is typically implemented as a sequence, an idea that has come up more often recently is implementing discrete input checks based on sequence state. For instance, if the minimum combustion airflow switch detects airflow, but the sequence has not started fans to induce sufficient airflow, an alarm or trip would occur. Another example would be if a low gas pressure switch downstream of a safety shutoff valve detects OK pressure before the sequence commands the valve open.
Network based data communications can certainly add extra layers of challenges. There’s a wide range of standards and implementations with varying levels of integration with different vendors. Well integrated communication systems will have built-in diagnostics that require no additional design considerations. However, more typically, systems will be somewhere else on the spectrum.
Again, this is where consideration of a system as a whole comes into play. Just a few things to think about if passing critical data over a data connection:
- Beyond basic communication of device status, what is the network infrastructure?
- Is there sufficient bandwidth available to support a worst case data burst?
- What is the network redundancy or other resiliency implementation?
- What’s the worst-case communication delay?
- Do the sender/receiver devices prioritize local processing over remote communication during heavy loading?
- Is there a QoS method in place to ensure that critical communication is prioritized across repeaters/switches?
- Are we introducing cybersecurity vulnerabilities into the data, whether it’s through a poorly protected data tunneling or easy physical access to network equipment? Has appropriate network administration of that system been considered?
- Have rules, barriers, and procedures been set between IT/OT to ensure evolving security policies don’t cause unexpected production outages?
Len Laskowski’s Answer
I agree with Hunter, Zach, Steve, and Greg. As always, the mentors are spot on and I cannot articulate it better than they did. I will just offer some additional information for consideration.
I too am struggling a bit at the questions, but maybe Mike’s intent was to get us to think a bit outside the box. So, I have outlined a few key words from each of his paragraphs.
Important critical measurements: If you can measure something, you should be able to control it. The trick here is to really pay attention to the measurement and make sure it is as good as possible. Also, do not be afraid to validate that it works. The more critical the measurement, the more important this is.
For example, radar level measurement is a popular technology these days. As a SIS engineer, you need to budget time to take a serious look at the installation and make sure it works. The inexperienced engineer may not do this. I have seen installations where they do not have a nozzle in the top of the vessel or proper side nozzles to build a correct bridle. One must force the issue to make the installation proper or do not use the technology. I have seen projects that just put the radar in the manhole cover and that device is looking at the man ladder, which creates all sorts of noise and false echoes as the tank fills. Here, the engineer did not do his homework and this quick, dirty, and cheap installation that has limited if any success is a disservice to the plant. The real loss here is the plant’s willingness to try the technology again (if properly installed) in other applications.
Analyzers are another all-time favorite. Be sure you really understand what it takes to make the devices work. I did an installation once that we would get in calibration, and it would drift all over the place. We worked on this for months with no real lasting success. So, I decide to go to the manufacture’s facility for factory training on the device. I took the training and as far as I could tell, I was doing things right, so what was wrong? One of the experts there asked, are you sure your lab is right? This never occurred to me that the lab could be wrong in analyzing these samples. Armed with this new plan, I took a sample, mixed it thoroughly, split the sample into three bottles labeled A, B and C, and put three different dates on them. I sent them to the lab, and they came back with three different results. So, we discovered that there was an error in the lab procedure.
Lesson learned: If it is important, make sure you validate it. That may mean you put in redundancy, extra nozzles, modeling, or other special equipment to check that your measurements are good. Begin to look for these tough applications as early in the project as possible and formulate a plan so that you can get money and schedule in the project.
Best for both safety and plant operations: I totally agree and that is why most SIS systems do things called secondary actions, like putting a control loop in manual and pulsing closed the control valve, to assist in a smooth startup. In today’s typical SIS Safety Life Cycle per S84 or IEC61511, a LOPA (layer of protection analysis) is usually done. A multi-discipline team should be providing guidance on what to do (open or close valves, etc.).
As a SIS engineer, you need to take that guidance and implement it. However, do not do it blindly. On occasion you may find issues that have not been addressed or secondary hazards that are created. You need to go back to the LOPA team and explain the issues to get clarification. This is not uncommon. I currently have a project that is on Rev Y of the LOPA. While a pain during design, it will start up and run smoothly (as we found and addressed the issues).
Transmitter failure should be removed from service automatically: I don’t agree with that because it has no consequence to the operators and it could instill complacency. I prefer to have the bad status as an announced vote to trip unless the operator takes action. That way, a work order to repair the device can get started and the operator can manually select (with supervisor permission) to bypass the device and degrade the voting to a predetermined architecture.
PLC processor fault: While it is a noble idea to just shut down the cards or IO that failed, the reality is that it is a lot simpler and safer to just shut down the whole PLC and then figure out what went wrong once the unit is in a safe state. Human nature is to fight it, try to compensate, and keep production up. Depending on what failed, you may be doing more damage to the unit than helping it. I remember one case like this: if they would have tripped immediately, they could have come back up in hours. Instead, they fought it and made such a chemical mess inside the unit vessels that it took weeks to clean up, replace the catalyst, and restart.
If the data pass is broken, what action needs to be taken: This is a tough question and should not be left to the control engineer to decide alone. Typically projects big enough to be putting in new BPCS or SIS systems should have a meeting called a CHAZOP (Controls HAZOP). This is an opportunity to bring in senior resources and/ or vendors and third parties to make a conscious decision about such events. What actions need to happen to put the unit in a safe state if this should occur? For example, what does an operator due if he loses all HMIs? Could this happen? Has it happened?
While not very probable, it is possible. This has happened to very modern installations before, so it is very real. The more we use digital communication protocols, the more susceptible we are to this and the more carefully this needs to be addressed. This plays directly to cybersecurity. In the old days, no hackers could externally corrupt or compromise your relay-based hardwired SIS or panel board controls.