Jun 13, 2025

Inside the Black Box: Observability for Microservices

Discover how observability built on OpenTelemetry, Grafana, and real lessons can make diagnosing microservices fast, reliable, and scalable—before cracks appear.

MeetUp

Author

Boudhayan GhoshTechnical Content Writer

Inside the Black Box: Observability for Microservices

Book a call

Editor’s Note: This blog is adapted from a talk by Mithun K, Platform Engineer at AntStack. In this session, he walks through the crucial role observability plays in diagnosing microservice-based systems. Rather than focusing on tools, Mithun shares real-world lessons on how metrics, logs, and traces—when used together—bring clarity to complex architectures. With OpenTelemetry as the backbone and Grafana’s open tools for visibility, the talk offers a grounded blueprint for building reliable systems from the inside out.

Inside the Black Box: Observability for Microservices

Hi, I am Mithun K. I work at GeekyAnts as a Platform Engineer. Over the last two years, one thing has become clear: microservices need observability—not something you add later, but something you build in from the start.

This is not a talk about tooling. It is about clarity. About knowing what your system is doing, and being able to answer why it behaves the way it does.

From Simplicity to Complexity

I remember when applications used to be monoliths. You had one deployment bundle. Backend, frontend, CMS—everything was shipped together. Scaling meant scaling everything, even if only one component needed it.

Now we build microservices. Checkout, catalogue, authentication, order tracking—each is a service of its own. They run on different clusters, on different environments, and speak to each other over APIs. This architecture gives flexibility. But it also means that when something breaks, finding the cause is no longer straightforward.

That is the context for observability. Not as a replacement for monitoring, but as an evolution of it. Monitoring shows symptoms. Observability shows causes.

What Observability Offers

Logs and metrics are familiar. You check CPU usage, memory thresholds, and maybe request latency. These are helpful. But they do not tell the full story.

Observability adds traces to the mix. A trace is a view of a single request as it moves through multiple services. It includes timing, dependencies, and even slow database queries.

Here is how I break it down:

Metrics tell you how the system is behaving.
Logs tell you what happened.
Traces tell you how it all fits together.

When you can correlate all three, you stop guessing and start understanding.

A Lesson in Delay

We had a client who initially skipped observability. Their system launched fine. As traffic grew, the application slowed down. But all the dashboards were green. CPU usage was normal. Memory was fine. Still, the user experience was poor.

We spent two days chasing ghosts.

Then we added OpenTelemetry and connected it to Grafana Tempo. Within hours, we saw the issue. A poorly indexed SQL query buried in one of the services was holding everything up.

That is when I realised observability is not just about debugging errors. It is about diagnosing slowness, regressions, and invisible bottlenecks.

Why We Chose OpenTelemetry

OpenTelemetry gives us a standardised, vendor-neutral way to collect traces, metrics, and logs. It supports languages like Node.js, Go, Java, Python, and lets us instrument applications once, not repeatedly for every new backend.

In our setup, we use Prometheus for metrics, Loki for logs, and Tempo for traces. All the data flows through a central OpenTelemetry Collector. The best part is that we can swap any of these tools without touching application code. That flexibility saves time and keeps us adaptable.

We rely on auto-instrumentation for most services. It requires almost no code changes. Once configured, everything just flows—clean, consistent, and ready to visualise.

Debugging in Practice

One example stood out to me. We had an endpoint that responded slowly. The logs were fine. Metrics were stable. But when we opened the trace, we saw it clearly.

A call from Service A to Service B triggered a slow database query. The trace showed the timing, the SQL statement, and even the exact span where the delay occurred. Fixing it took less time than finding it.

Another time, we used a log entry in Loki to jump directly into the corresponding trace in Tempo. No guesswork. Just click and see the path the request took. That is how observability changes workflows.

Why It Needs to Start Early

The earlier you introduce observability into a project, the more value you extract from it. You understand performance trends before issues arise. You notice inefficiencies before they become incidents.

And when something does break, you already have the context.

In microservices, problems do not announce themselves. They scatter. You need observability to trace the thread back to the cause.

The Case for OpenTelemetry

OpenTelemetry makes this easy. It supports multiple backends, works across environments, and separates instrumentation from tooling.

With it, we get:

One-time instrumentation across services
Backend flexibility without code rewrites
Centralised visibility through standard collectors
Cost-effective scaling with open-source components

This is not just about tools. It is about engineering confidence.

Final Thoughts

Observability is essential for modern systems. Metrics, logs, and traces each answer a different kind of question. Together, they reveal how systems behave, why they fail, and what to fix next.

OpenTelemetry helps us do this without vendor lock-in or duplication of effort.

If you are building microservices, observability is not an extra. It is your foundation. And it is best to build that foundation before the cracks start showing.

SHARE ON