Intro
So many businesses have started to embrace evolutionary architectures and microservices, which promote splitting complex business applications into many smaller components that enable the parallel development of them by many teams working simultaneously.
The components may be as small as functions exposed as services (FaaS) in serverless architectures.
It’s great for flexibility which is heavily required for business, giving them the ability to speed up development and the opportunity to try different business scenarios.
But as an inevitable side effect, or cost of flexibility if you will, it’s getting more and more difficult to debug these modern dynamic systems. I’m sorry to disappoint, but even though the complexity has moved elsewhere, it never disappears completely. Most of the members of development and DevOps teams would agree.
The importance of tracing
Applications are very dynamic; where microservices interact with each other and with the external APIs.
None of these components themselves are bug free. The way their communication is configured and organized (orchestrated or choreographed) is never perfect either.
Plus all of them change independently and with more complex Continuous Delivery scenarios, we can hit the APIs with different versions.
Even the best testing methodology cannot predict all the possible outcomes. It can and will greatly improve the quality of the software solutions, but it’s not the end of the road.
→ Explore Avenga QA and testing services
On the other hand, we need to react as quickly as possible to the bugs discovered in test environments and production environments. As all experienced software engineers know, finding the cause of the bug as well as reproducing the flow and data which led to the bug, is usually 80%-90% of the effort in the process of fixing bugs.
Tracing helps to greatly improve the efficiency of fixing bugs, by shortening the time needed to find the cause of the bug.
A bug or a feature?
Another important benefit of tracing is to explain what has happened in the system, who did what, and in what order that resulted in the specific business outcome.
It can also be related to hard cases such as complaints reported by the end users. For example, a bank account has less money than was expected by the user; why is that? How to explain it? Was it a bug in the configuration of the workflow, a bad implementation of the business logic, or the wrong understanding of the business requirements?
Remember, it has just happened. Thousands of users are performing millions of business transactions using your system and you’re looking for this one particular case that went wrong. Without proper tracing and complex business logic with lots of business rules at the same time, it’s close to impossible to figure out what led to the unexpected outcome.
Challenges of tracing
It’s very easy to enable logging of everything that happens in every microservices and in every component. Your logs will be quickly flooded with tons of technical information which can be helpful to detect technical problems, but to the lesser extent for the business issues at hand.
→ More about Logging exceptions in Salesforce
For instance, connection to a database is lost, access is denied, out of disk space, or an external API call timed out. It may be related to business functions and processes but not 1:1.
But to detect logical business logic errors, the code of the application has to be prepared to trace business actions.
Context, context, context
This is the most important thing in proper tracing: logging the context information and ensuring the context is never lost. In the case of old monolithic applications, it was relatively easy to do these actions: time, ID of the users performing the action, versions of the software, data source in use, ID of the application server node, client (browser or other API), etc.
In the case of dynamic services meshes or microservices in general, it’s getting much more complicated.
However, in the case of orchestration, the orchestration manager usually knows the flow, step by step, who initiated it, what steps were taken in which contexts, and what happened in each node.
In the case of choreography, services communicate with each other depending upon their context and business data within the transaction. It’s even more dynamic and it’s much harder to figure out the context. One of the techniques is to pass some kind of contextID between the calls, starting with the first call. Then this context is passed to subsequent calls to maintain it with different APIs, but the problem is that other APIs have to accept it and that they won’t destroy the context information.
→Read about Asynchronous API standardization. AsyncAPI for faster and better digital communication
OpenTracing – a need for standardization
Different systems, written in different technologies, plus the technology choices of the teams all lead to the fragmentation oftracing tools and techniques. Even if someone is able to achieve standardization in their own digital product, it would be hard to expect the same from the multiple vendors of external APIs.
Different cloud providers also offer different sets of tracing technologies, which makes things even more diverse and colorful. In this context, it’s not a desirable thing.
The initiative of OpenTracing attempts to standardize tracing for different technologies, languages and frameworks in order to address these issues.
OpenTracing goals
The primary goals are:
- Vendor-neutral – to make tracing independent from the vendors, languages, different runtimes and cloud environments.
- Distributed – for modern microservices architectures, including service mesh, and it will also work for modular and monolithic applications.
- Tracing unaware services – many legacy services are built without tracing support, which is a problem. The best solution is to modify them to work nicely with tracing systems. Unfortunately, it’s not always a viable option thus the need to support tracing unaware services.
Spans
Span is the primary building block of OpenTracing. It’s a “named timed operation representing a piece of workflow”.
Spans can have references to other spans and can create larger spans.
The state of the span includes:
- An operation name
- A start timestamp and a finish timestamp
- A set of key:value span Tags
- A set of key:value span Logs
- A SpanContext
A typical example of the simplest span is one query sent to the database and one API call to the external service.
In order to continue the trace over the process boundaries and RPC calls, we need a way to propagate the span context over the wire. The OpenTracing API provides two functions in the Tracer interface to do that: Inject (spanContext, format, carrier) and Extract (format, carrier).
Carrier is an interface or data structure that’s used for inter-process communication (IPC); its role is to transmit the tracing state from one process to another.
Format defines how the trace payload is structured and two are obligatory: binary and textmap (string to string map).
OpenTracing-contrib
As the name suggests, there are additional libraries created to broaden the adoption of OpenTracing. If something is missing in the main OpenTracing set of components, it is likely that it can be found here.
OpenTracing adoption
OpenTracing libraries are currently available for Go, JavaScript, Java, Python, Ruby, PHP, Objective-C, C++ and C#. It should cover approximately more than 98% of the enterprise languages which are in active use.
But of course, writing applications or microservices in plain languages are quite rare. So to make it easier, OpenTracing supports popular frameworks such as gRPC, Flask, Go kit, Django, Dropwizard, Motan, Hprose, and Sharding-JDBC,
The tracers supporting OpenTracing are Elastic, Jaeger, Instana, Apache SkyWalking, inspectIT Ocelot, Datadog, InspectIT, Zipkin, and Wavefront.
Does it work with Kubernetes? Yes, it does. How could it not?
→ Learn more about Kubernetes – how hot can it get?
Future of OpenTracing
There is more to the future of OpenTracing than releasing new improved versions with more features and addressing the inconveniences reported by the community.
The idea is more ambitious and is called OpenTelemetry, which is a combination of tracing and telemetry features in one consistent standard.
Currently it is in beta version, but moving steadily forward towards 1.0.
The OpenTracing team is openly admitting they cannot set the standard as they are not a standardization body, however they have a lot of influence as part of the Cloud Native Computing Foundation (CNCF).
Tracing is a very important part of every digital solution we create with our partners here at Avenga. It enables us to build them faster and with much higher quality, while preserving the benefits of the flexibility of microservices and serverless.
How to build traceability into your digital products?
How to make traceability an useful feature of your API ecosystem?
Here, at Avenga, we are always looking for an opportunity to talk about it. Feel free to contact us.