In mission-critical control rooms, a visual collaboration system failure can be disastrous. Component resiliency is crucial to success. “Resilience” refers to the inherently designed ability to meet an acceptable level of service while a fault is present. System failure or poor performance occurs for a variety of reasons: cybersecurity attacks, insufficient network resources, outdated equipment, and understaffing of operations or IT departments to name a few. Often, the underlying issue impacting resiliency is an inability to scale systems and facilities to meet operational objectives.
MTBF vs MTTR
Mean time between failure (MTBF) measures system resiliency, or reliability. An organization may review MTBF metrics, but this can lead to false confidence. System failure can’t be predicted. A more relevant indicator may be mean time to repair/recover/resolve (MTTR). These metrics include the organization’s ability to respond to a failure. They also measure the availability of any parts required to repair the failure, make the system operational (recover), and make sure it doesn’t happen again (resolve). These MTTR times are often measured in hours or days.
If the MTTR risk is not acceptable to the mission, it is mitigated with redundancy or resilience. “Redundancy” is the duplication of system components to increase dependability. This is accomplished by switching to any backup components (or system) during a failure event to restore full operation.
In modern IT systems the likelihood of a component failure is much higher than that of a total system failure. A resilient system is designed to continue operations at an acceptable level if individual pieces of that system fail. This is accomplished by providing limited redundancy or accepting a defined lower service level automation to ensure resources are assigned to critical tasks.
For example, a visualization system is made resilient by adding a single redundant processor as backup. Usually, we think of ‘backup’ equipment as an extra piece of hardware to be pulled out of the closet when the primary device crashes. Control rooms with large video wall systems can’t afford to lose time swapping out and configuring processors. True redundancy enables the primary and backup processors to create mirrored displays.
On mirrored displays, identical content is automatically streamed to separate displays. Only one is active and the other is maintained as a hot backup. By mirroring content between two displays, a user can alternate between them in the event of a failure. This allows for full operational capacity under a single failure condition at a 75% cost reduction over full redundancy. Resilience is also achieved by replacing non-critical tasks on a working processor with the critical content from a failed component. That type of setup also provides an acceptable service level with minimal redundancy expense.
Resilience must be designed at a system level to meet the requirements of any organization. The product attributes and techniques which allow you to design resilient systems are often the same attributes required for scalability. Scalability is simply the ability for a system or its components to handle increased use or incorporate additional resources as an organization’s operation grows. Resilience is primarily the management of these additional resources. The scalability features provided by a network distributed video wall system also allow for the resiliency features in the examples above.
Visual collaboration systems provide mission-critical situational awareness to decision makers. You can read more about scalability and how to meet your current and future resiliency goals in our recently published white paper: Evaluating Scalability for Control Rooms.
Contact us today for more insight on how CineMassive provides resilient, scalable systems for long-term operational success.