System Health Management: It’s all about Monitoring, Control and Communication!

With my last week’s blog post you hopefully learned a lot about the importance of computing health management. Within the article I already mentioned that health management supports three main functions: monitoringcontrol and communication. But what does this mean in detail? To give you a practical overview, I will provide you with an inside into the functions explaining them by means of the Kontron modular TRACe™ transportation computer system portfolio that features integrated health management capabilities. TRACe allows online access to comprehensive information on the status of each unit giving operators a resource to control the real reliability and effectiveness of the whole installed base. Health management functionality is uniquely achieved via the TRACe systems’ dedicated Health Management Unit (HMU) microcontroller, which is independent from the main processor. This allows separate firmware execution from the operational application running on the main processor, and makes health management autonomous even in the case the main processor stalls or stops. Ease of use is ensured either from private intranet or the web, virtually from any connected IoT device. Employing health management gives users a broad range of vital real-time information that enables comprehensive monitoring and control of reliability, availability and service ability thereby dramatically reducing operational costs over the lifetime of the system. But now – as promised – let’s have a look at the functions in detail:

 

System Health Management: Monitoring

The monitoring of TRACe is performed at two levels. First, at each power-on, a comprehensive self-test verifies the system is functioning at its required level. The so called “power-on self-test” checks the basic functions of TRACe: Processor, memory sub-system, main interfaces and I/Os. The self-test result provides a generic system status:”GREEN – everything OK”, “ORANGE – OK with partial restrictions described in the detailed status” or “RED – KO, with failure status”. Then, TRACe continuously monitors vital signals while the system is functioning: Voltages, Current, Temperature (processor, PSU, thermal sensors, etc), and the “heart-beat” signals (PCIe buses, clocks, etc.). If the value of these signals is not within a defined range, the online monitoring will record the value and generate an event. The so called ”watchdog” time counter can program a duration (for example 10 seconds), and if the application does not reset the countdown, the watchdog time will trigger. This can be used as the ultimate detection of any frozen/stalled status, so as to reset the system.

 

System Health Management: Control

TRACe’s control functionality exploits the monitoring results enabling users to take relevant actions depending on predefined strategies relative to the user-case or mission profile. The boot management depends on the power-on self-test status: “GREEN – boot OS and application”, “ORANGE – OK with partial restrictions – depending on failure status: boot OS & application with disabled functions, or reduced performances, depending on application profiles” or “RED – stop, or restart self-test for double-check (a defined number of self-test restarts can be allowed)”. The permanent monitoring management depends on the vital/critical signal values. Either will record the value out of limit, date and time, with a red flag, accessible to application or will trigger a reset, or a new power-off/power-on and a reboot. All boot/reboot events and status are logged with historical records in the system permanent memory.

 

System Health Management: Communication

Data, status or events logged into the permanent memory of TRACe are available for analysis, either locally or remotely through the Internet TCP/IP and SNMP via OS and/or from the HMU. External network communications are all using secured protocols in order to protect the health management system from intrusions. In addition, MQTT (light weight communications) and TR069 will be made available to complement the interoperability. The user can easily employ these communication layers to access important information and interface it to the application or any web service middleware. Depending upon the application, the emergence of IoT helps enable every mobile device to be connected to the cloud. Therefore, any transportation system or web service can take advantage of TRACe health management provided to track, monitor, control and engage actions with devices such as palmtops, tablets, laptops, or any other type of connected device. Combining health management capabilities with IoT truly offers the successful building blocks that make up the intelligent future of fleet management.

TRACe allows health management monitoring, control and communication layer accessibility based on various levels (low, mid and high-level) defined by the user’s application. This flexible structure of interaction is designed to serve as many application profiles as possible.

I hope, this article provides you with a good introduction into the functionalities of “system health management”. Please feel free to complement!

Thank you!

Your comment was submitted.

An error occured on subscribing!:
{{cCtrl.addCommentSubscribeErrorMsg}}

{{comment.name}}
{{comment.date.format('MMMM DD, YYYY')}}

{{comment.comment}}

There are no comments yet.

Stay connected