When we talk about monitoring network infrastructures, routers, network interfaces and traffic flows come to mind. But monitoring can go much further; we can obtain the number of free ports of a switch, the use of Quality of Service (QoS) classes, the processor, memory, disk load of a server, or the number of connections established by a firewall, proxy, DHCP or any other network element, to name just a few examples.
The metrics we can obtain are not only software metrics, we also implement hardware health metrics. For example, we can obtain the status of fans or power supplies, as well as processor temperatures and speeds.
In the WOCU monitoring tool we have support for routers and switches, but we also monitor firewalls, proxies, load balancers, bandwidth managers, siems, IP PBXs, Wifi access points, virtualization platforms and storage cabins, among others.
The way we decide which metrics to monitor on each device is through the application of packs. For a more intuitive identification of which pack to apply in each case, we classify them by categories.
Thus, we have Network, Security, Hardware, Database, Network Protocols, Virtualization, Operating System, Voice IP and Storage packs.
Given the different nature of the metrics and devices we monitor, some packs are applicable to equipment of different brands and models, such as packs that measure network traffic, number of routes or errors and interface states. On the other hand, there are packs for specific brands and models, such as Checkpoint, Fortigate, Palo Alto, F5, Allot, Oracle or Cisco Call Manager.
WOCU obtains the metrics it needs through different methods. The most used are SNMP queries, but it is not the only one, it also uses installable agents in the hosts to be monitored (Windows or Linux) and, in other cases, queries are made to the API offered by the manufacturer.
Monitoring large infrastructures means having to query thousands of computers, and depending on the number of services configured per computer, the total number of checks per minute can be quite high. For this reason, it is essential to have the ability to scale horizontally, and by simply adding pollers, it is possible to increase the performance of the entire system.
However, there is no point in having scalability if the query mechanisms are not optimal. We have worked hard on improving SNMP queries to minimize execution times, and thus obtain maximum performance even in high-latency environments.
We continue working on the improvement of data collection processes, as well as on data processing to create the appropriate output and status for each service. In this line, we are developing new versions of the monitoring packs to, for example, not use temporary files in the calculation of traffic statistics for network interfaces or quality of service classes.