Monitoring & Alert Management
Last updated
Last updated
Infra Monitoring and Application monitoring is an important aspect of maintaining the performance and reliability of the BDB platform system.
The BDB team collaborates with the customer IT team to finalize the monitoring & alerts framework based on the ownership of infrastructure. A responsibility matrix will be outlined to define what each team will monitor, the alerts flow mechanism & reporting methodology.
Monitoring setup can be configured & enabled on open-source tools like Prometheus, Fluentd, Grafana, Zabbix, and licensed tools like DataDog.
Prometheus is an open-source monitoring system that is widely used for collecting and storing metrics data. It is designed to be scalable and reliable, and it provides a simple yet powerful query language for retrieving and processing metric data. BDB enables alerts in Prometheus to notify of any issues or anomalies in the BDB Platform application's performance when certain thresholds are met or exceeded.
Fluentd is an open-source data collection software that allows you to collect and transmit data from various sources to various destinations. It is designed to be lightweight and flexible, and it is often used to collect logs and metrics data from various sources. BDB configures Fluentd to collect metrics and logs data from the Platform. This can be achieved using Fluentd's input and output plugins, which allow you to specify the sources and destinations for data collection.
Grafana is an open-source visualization and analytics platform that allows you to create interactive dashboards and charts from your metric data. It is highly customizable and can be integrated with various data sources, including Prometheus, Fluentd, Zabbix, and DataDog. BDB platform uses Grafana to retrieve the metric data from various data sources and display it in interactive dashboards and charts. This can be achieved by adding Prometheus as a data source in Grafana and creating visualizations using the metric data.
Zabbix is an open-source monitoring software tool for diverse IT components, including networks, servers, virtual machines (VMs), and cloud services. Zabbix can be deployed for agent-based and agentless monitoring. Agents are installed on IT components to check performance and collect data. The agent then reports back to a centralized management server. That information is included in reports or presented visually in the graphical user interface (GUI). Agentless monitoring accomplishes the same type of monitoring by using existing resources in a system or device to emulate an agent. BDB uses Zabbix to collect monitoring metrics, such as network utilization, CPU load, and disk space consumption.
DataDog -- Datadog is an observability service for cloud-scale applications, providing monitoring of servers, databases, tools, and services, through a SaaS-based data analytics platform. It supports multiple cloud service providers including Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform, Red Hat OpenShift & others.
BDB uses different monitoring methods to ensure the performance and reliability of the Platform system meets set expectations & SLAs. These can be categorized like
Synthetic/Endpoint monitoring
Monitoring each module inside the platform ( Till version 7.6 & below )
Resource utilization monitoring
This method is used to monitor internal and external endpoints (HTTP/S, DNS, TCP, and ICMP) for various parameters including HTTP latencies, DNS lookup latencies, SSL certificate expiry information, and TLS version. These metrics are then displayed in interactive dashboards and charts.
A few sample dashboards/charts are outlined below:
The BDB Platform provides the Server Monitoring option under the Administration module. Server Monitor reviews and analyses a server for availability, operations, performance, security, and other operations-related processes. Along with visualizing the key parameters, there is a component log also available. In addition to the pipeline monitoring facility, in the data pipeline, you can configure failover events for every component. Admins can create exclusive Kafka topics to handle failures and then drag n drop them into the canvas and map it with the components. If there are any failures or errors within a component during data processing these will be registered in the corresponding fail-over events. Data Audit can be implemented by using the monitoring facility available and the failover events. Admin can verify the count of records processed successfully as against the no of records available for processing.
Click the Server Monitoring option from the list of Admin options.
The Server Monitor page appears with various nodes presenting various modules of the BDB Platform.
Select a node to display the node-specific server details. E.g., the Data Science Servers node is selected in the below-given details:
Open-source tools ( Prometheus, Fluentd, Grafana, Zabbix, etc.. ) or customer-provided paid tools ( DataDog, etc.. ) are used to monitor Kubernetes cluster nodes, containers, and pods. We will acquire a broad view of overall platform capacity and health by monitoring the Kubernetes cluster. Cluster resource and infrastructure usage will determine whether the cluster is underutilized or over capacity. Node health and availability are important to reveal whether there are sufficient resources and nodes available to replicate applications.
Monitoring node/container/pod metrics provides a high-level overview of a node’s health and whether the scheduler can place pods on that node. It runs checks on the following node conditions:
Out Of Disk
Ready status (node is ready to accept pods)
Memory Pressure (node memory is too low)
PID pressure (too many running processes)
Disk Pressure (remaining disk capacity is too low)
Network Unavailable
A few sample dashboards/charts are outlined below:
Based on the alerts routing mechanism outlined along with the customer IT team, the alert manager will send notifications to configured notification channels such as Outlook, Slack, MS Teams, or PagerDuty, when your Kubernetes cluster meets pre-configured events and metrics rules in the monitoring server. These rules can include sending critical messages related to:
Node observability
CPU Utilization
Memory Utilization
Storage Use
Kubernetes Clusters and container Heath check
The pod is crash looping, Unscheduled pods, Pod has been in a non-ready state
Node is out of capacity, unready for a long time
Node has Memory/Disk Pressure condition
Node has a Network Unavailable condition
The container has been OOM Killed
Volume is almost full (Volume out of disk space)
Kubernetes API server/client is experiencing a high error rate
Kubernetes API server has a latency
Synthetic Monitoring (Endpoint Monitoring)
HTTP status code is not 200
Probe failed, Probe took >1s to complete