In the world of microservices, observability is an important concept. Yes, containers and automation are very important concepts too, but with microservices observability, you can see what your microservices are doing, and you have the assurance that they’re performing and behaving correctly.
In the traditional telco world, this concept is known as service assurance. There are vendors who specialize in service assurance solutions that observe network traffic by pulling data straight from the fiber connections through physical taps. But how do you put a physical tap on a virtual machine? And how do you monitor cloud-native microservices when there may be thousands of them deployed on single VM at a given moment in time?
The answer, of course, is you can’t. What works in the physical world is, in the cloud-native world, virtually impossible. Instead, the broader cloud community has developed a robust ecosystem of microservices observability tools to provide service assurance in a cloud-native network. Some of these tools are relatively new, while others have been used by enterprises and cloud providers for years. When Affirmed built its 5G Core solution, we made an important decision to leverage the best microservices observability tools through our Platform as a Service (PaaS) layer, giving mobile network operators (MNOs) a simple, effective way to deliver service assurance.
Four Aspects of Microservices Observability in Cloud-Native Networks
The concept of microservices observability in the cloud-native networking world is similar to FCAPS in the traditional telco world. Observability can be broken into four categories: application performance management; logging and events; faults; and tracing.
- Application performance management
- Logging and events
- Faults
- Tracing
1. Application performance management
(APM) measures key performance indicators (KPIs) such as latency. The Cloud Native Computing Foundation (CNCF) recommends Prometheus as its KPI collection/storage tool. Prometheus is tightly integrated with Kubernetes and Helm; when you deploy a cloud-native Network Function (CNF), Prometheus is deployed with it, allowing it to scrape data from microservices in a non-intrusive and efficient manner. This data is then visualized through a tool called Grafana, which creates dashboards using widgets; Affirmed also integrates Grafana and includes pre-built dashboards and widgets into its 5GC solution.
2. Logging and events
This records what is actually happening to apps and microservices in the cloud-native environment. Cloud providers generally use what is known as the EFK stack—Elasticsearch, Fluentd, and Kibana—to record logs and generate alerts. At a high level, Fluentd collects the logs and messages, Elasticsearch stores them and Kibana provides the data visualization, again using dashboards.
3. Faults
To detect faults, open-source offers two tools: ElastAlert (so named because it is designed to work with Elasticsearch) and Prometheus’ Alert Manager, which comes bundled with the Prometheus tool. ElastAlert lets you set specific rules and policies to manage alerts (e.g., when X happens, send Y alert via SMS), while Alert Manager generates alerts when specific numerical thresholds are passed.
4. Tracing
Tracing is something unique to the cloud-native world, allowing network operators to track the exchanges between different microservices. For example, microservice A might invoke microservice B to complete a process, which in turn invokes microservice C and ultimately returns control back to microservice A. As you might imagine, tracking the exchanges between every microservice instance would generate an almost-unmanageable amount of data, so tracking tools like Jaeger (which is supported by CNCF) only collect a very small sample (the default sample rate is a tenth of one percent) of these exchanges for analysis and visualization. But even this small sample is useful, as it provides visibility into how microservices are functioning. And through policy, this global sampling percentage can be overridden for transactions containing certain data (e.g., for a particular 5GC ‘slice’ or from a particular user).
Beyond those four aspects of observability, there’s still more to see in a cloud-native architecture. For example, tools like Istio, Envoy, and Kiali are used to illustrate the microservices topology (referred to in the industry as a service mesh) and show where errors are happening. (Service meshes are worthy of their own blog, so stay tuned for that.)
In addition to using open-source, third-party tools, a fully virtualized probing solution can record data logging and event data within the microservice itself. Think of a fully virtualized Probe as logging and events on steroids, as it captures detailed microservices data from every CNF and user endpoint: TCP roundtrip times and retransmission rates, deep packet inspection details, and the list goes on. Unlike traditional physical probes, vProbe doesn’t negatively impact network performance and generates a wealth of data to support big data analyses, artificial intelligence, and machine-learning systems.
If your 5G vendor isn’t as transparent in how their solutions support observability, maybe you should be looking for a new vendor.