Monitoring and tracing for akka applications under kubernetes (k8s)

innFactory GmbH
9 min readMar 26, 2020

Introduction

To guarantee high availability of applications, they must be monitored and traced. For this reason, we have considered various monitoring and tracing products for Akka applications. To test the products, they were integrated into our Akka demo application. The Akka demo application has already been mentioned in previous articles ( https://innfactory.de/softwareentwicklung/scala-akka-play-co/akka-service-deployment-on-kubernetes/ ). A monitoring and tracing product is selected for the later concept development.

Monitoring

Introduction

There is no blanket solution for monitoring applications, as each application is structured differently. However, there are some rules that should be followed:

Avoid Checkbox Monitoring

This method tries to monitor as many metrics of an application as possible. This is done for fear of not monitoring a metric that will be needed later. Most of the time, however, this leads to too many metrics being collected and important information being lost due to the oversupply of metrics. It is therefore better to start with a smaller number of metrics.

Monitoring is a continuous process

Another important point is that monitoring is a process that must be constantly adapted and changed. This is due to the further development of the application and the gaining of new knowledge when monitoring the application.

Avoid Tool Obsession

This is an attempt to monitor the application comprehensively with as many different tools as possible. These attempts often fail, because the large number of tools is not maintained properly and nobody can find his way around this tool “jungle”.

Correct configuration of threshold values

Thresholds for monitored metrics must be configured correctly. An alarm that is set too early often leads to it being ignored after a short time, as it is triggered too often and requires no action. Configuring the thresholds is a continuous process and takes time, especially in the beginning.

Different priorities can also be set for monitoring. In this article the focus is on application performance monitoring. Therefore, the monitoring software must provide a way to collect metrics from Akka, Scala and the JVM (Java Virtual Machine). For this purpose, different products were considered, which are presented in the following section.

Lightbend Monitoring

Lightbend Monitoring is part of Lightbend Enterprise Tools and consists of two products: Lightbend Monitoring and Lightbend Console. The product is developed by Lightbend Inc., which also co-develops Scala and Akka. Lightbend Monitoring was developed by Lightbend Inc. specifically for monitoring Akka, Play and Lagom applications.

Structure of the Lightbend Monitoring.

Figure 1 gives an overview of the structure of Lightbend Monitoring when running under cybernets. It shows the individual pods of the different building blocks of Lightbend Monitoring. To get telemetry data from the Akka application, the Cinnamon extension is integrated into the Akka application. The Cinnamon plugin consists of a Java agent that collects telemetry data about Akka and the JVM and opens a web server where the telemetry data can be accessed. Prometheus collects and stores this provided data. Grafana and Lightbend Console are used to visualize the collected telemetry data.

Figure 2 shows the Lightbend Console. The upper section shows the different deployments, each point represents a Pod. The different colors indicate whether the deployment or the Pod is running properly. The lower section shows more information about the deployments, such as the number of nodes, pods and containers and the health status of the deployments over a longer period of time.

Kamon

Another product on the market is Kamon. Kamon is a software licensed under Apache 2. At first sight, Kamon looks like a cheaper version of Lightbend Monitoring.

Setup of the Kamon Monitoring.

Figure 3 gives an overview of the structure of Kamon. It is similar to the structure of Lightbend Monitoring. In the Akka application you can find the Kanela extension, which includes a Java agent. By including the Kamon Prometheus extension the collected telemetry data can be accessed by a Prometheus server. The collected data is visualized in various Grafana dashboards. Kamon is compatible with Akka and the Play Framework.
Figure 3 shows Kamon APM (Application Performance Management) in orange. Kamon APM is a fee-based solution. With this solution, the collected metrics are sent to Kamon, which takes over the storage and visualization via pre-built dashboards. To use Kamon APM, the extension Kamon APM Reporter must be included.

Istio

Since the Istio website also promotes the monitoring section, it was also considered. Istio is a free service mesh licensed under Apache License 2. Collecting metrics for Akka, Scala, JVM is not supported. So it was discarded for the monitoring section.

Tracing

Einleitung

Tracing is a useful tool for troubleshooting and optimizing applications. With the introduction of distributed systems, tracing became more and more complex.

innFactory akka cluster Kubernetes Application

Figure 4 shows the demo application. The aim of tracing is to record the request from the moment it arrives at service two and to trace it through the entire system. The smallest unit created when recording a request to the application is a chip. Figure 5 shows a trace of the demo application. On the left side, you can see the different spans that make up a trace. The topmost and therefore initial span has been maximized to show the details. This is the point at which the client request meets the web service. The spans visible here are all created in service two of the demo application. A trace can therefore contain spans from one or more microservices.

Trace of the demo application displayed in Jaeger Tracing.

The products Zipkin and Jaeger were considered for tracing.

Jaeger

Jaeger is used for tracing distributed systems and is licensed under Apache 2. It was originally developed by Uber Technologies. Meanwhile it is a project of the Cloud Native Computing Foundation.

Jaeger Architecture (Quelle: Mastering Distributed Tracing, ISBN: 978–1–78862–846–4)

Figure 6 shows the architecture of Jaeger. The Jaeger Client consists of the tracing library that runs in the application to be monitored. This library collects spans and sends them on to Jaeger Agent. The transmission of the data to Jaeger Agent takes place via User Datagram Protocol (UDP). The Jaeger Client also has the option to skip Jaeger Agent and send the data immediately to Jaeger Collector. Depending on the complexity and size of the application to be monitored, it may be advantageous to send the data directly to Jaeger Collector or via Jaeger Agent. Using the control flow in the graphic, it is possible to notify Jaeger Client and Agent of configuration changes.
The Jaeger Agent runs as a side car in the pods of the application to be monitored and sends the data to the Jaeger Collector. The Jaeger Collector receives the spans of Jaeger Agents in Zipkin or Jaeger format via the HTTP, GRPC or TChannel protocols. The received data can be encoded in JavaScript Object Notation (JSON), Thrift or Protocol Buffers (Protobuf). After receiving the data, it is converted into an internal data model and stored in the database. Kafka, Cassandra or Elasticsearch can be used as database. If Jaeger is installed for testing purposes only, it is also possible to use the local hard disk space of the Jaeger instance to store the data. The Jaeger Query component is used by the user interface to query the stored data.
The data mining jobs are optional and can be used for further analysis of the collected tracing data. They can be used, for example, to build graphs in which the dependencies of a service on other services can be seen.

Zipkin

Like Jaeger, Zipkin is also suitable for tracing distributed systems and is also licensed under Apache 2. Zipkin was originally developed by Twitter and has existed longer than Jaeger.

Zipkin Architecture (Source: https://zipkin.io/pages/architect)

As can be seen in Figure 7, the structure of the Zipkin architecture is similar to the Jaeger architecture. Here, a so-called reporter must run in the application, which collects the data for tracing and sends it to the collector. HTTP, GRPC or Advanced Message Queuing Protocol (AMQP) can be used as a protocol for data transmission. The Collector validates and indexes the data and sends it on for permanent storage. Cassandra, ElasticSearch or MySQL can be used to store the tracing data. Additional systems can be connected via third-party extensions. To make the data easy to retrieve, Zipkin provides an HTTP interface that makes the data available in JSON format. Zipkin provides its own web interface for displaying the data.

Concept for monitoring and tracing

Kamon was selected for the monitoring concept. This is because the license costs for Lightbend Monitoring are very high. Also, the functionality of Kamon is almost identical to that of Lightbend Monitoring. Istio will not be considered further, as it does not have the required functionality.
Jaeger was selected in the Tracing area. Zipkin and Jaeger are almost identical in terms of functionality. However, Jaeger offers the better Akka support and supports the standard Opentracing. Opentracing is a manufacturer-independent and standardised framework for tracing.
Figure 8 shows a concept that was created for Laura AI.

Two services of innFactorys Laura AI App with Kamon and Jaeger.

Monitoring

In the upper right corner of Figure 8 you can find the two pods Prometheus and Grafana, which contain the monitoring. In Grafana, dashboards can be created to visualize the data. For this purpose, Grafana fetches the data from Prometheus and InfluxDB. Grafana also takes care of alerting via different channels. Prometheus takes over the collection and storage of the data. Lauracrawler and lauradialogflow contain the Kamon Prometheus extension to provide Prometheus with an interface for collecting the data. The Pod in the upper left corner with InfluxDB is optional, this is indicated by the dotted line. If it is available, Prometheus will additionally write the collected metrics to InfluxDB. This is only needed if the monitoring data is to be kept for a longer period of time and the hard disk space under Prometheus is not sufficient for this.

Tracing

In figure 8 the two lower pods are responsible for tracing. In lauracrawler and lauradialogflow the Kamon Reporter for Jaeger is integrated. This is configured to forward the data to the Jaeger Agent. The Jaeger Agent then forwards this data to the Jaeger Collector. It is also possible to skip the Jaeger Agent and send the data directly to the Jaeger Collector. Whether this can be skipped depends on the amount of data that is transferred. To achieve better scalability, the use of Jaeger Agent is recommended. The Jaeger Collector sends the data to Elasticsearch for storage. Elasticsearch is recommended here, as it can be extended with a complete Elasticsearch, Logstash, Kibana (ELK) stack if required at a later date. If the environment is very large, several Jaeger Collectors can be used. Jaeger UI is only used to visualize the data. The arrows between Jaeger Collector, Jaeger Agent and lauracrawler or lauradialogflow go in both directions, because the tracing data is sent from the Akka application towards Jaeger Collector and configuration data is sent from Jaeger Collector towards the Akka application.

Practical test

In order to test the different software solutions with little effort, there are different branches in the GitHub repository of the demo application:

The branch kamon_jaeger tests the concept described in figure 8. In the respective branches you can find a markdown file under the directories monitoring-and-tracing, lightbend_monitoring or istio which describes the setup.

#Scala #akka #k8s

This Blogpost was auto translated by DeepL originally published in german under: https://innfactory.de/softwareentwicklung/scala-akka-play-co/monitoring-und-tracing-fuer-akka-anwendungen-unter-kubernetes/

Originally published at https://innfactory.de on March 26, 2020.

--

--

innFactory GmbH

Software & Cloud Engineering Experts based in Rosenheim | Germany — We blog about: Scala, TypeScript, Dart, akka, play, react, flutter, gce, aws, azure, cloud..