Jun 13, 2025
Inside the Black Box: Observability for Microservices
Discover how observability built on OpenTelemetry, Grafana, and real lessons can make diagnosing microservices fast, reliable, and scalable—before cracks appear.
Author


Book a call
Editor’s Note: This blog is adapted from a talk by Mithun K, Platform Engineer at AntStack. In this session, he walks through the crucial role observability plays in diagnosing microservice-based systems. Rather than focusing on tools, Mithun shares real-world lessons on how metrics, logs, and traces—when used together—bring clarity to complex architectures. With OpenTelemetry as the backbone and Grafana’s open tools for visibility, the talk offers a grounded blueprint for building reliable systems from the inside out.
Inside the Black Box: Observability for Microservices
Hi, I am Mithun K. I work at GeekyAnts as a Platform Engineer. Over the last two years, one thing has become clear: microservices need observability—not something you add later, but something you build in from the start.
This is not a talk about tooling. It is about clarity. About knowing what your system is doing, and being able to answer why it behaves the way it does.
From Simplicity to Complexity
I remember when applications used to be monoliths. You had one deployment bundle. Backend, frontend, CMS—everything was shipped together. Scaling meant scaling everything, even if only one component needed it.
Now we build microservices. Checkout, catalogue, authentication, order tracking—each is a service of its own. They run on different clusters, on different environments, and speak to each other over APIs. This architecture gives flexibility. But it also means that when something breaks, finding the cause is no longer straightforward.
That is the context for observability. Not as a replacement for monitoring, but as an evolution of it. Monitoring shows symptoms. Observability shows causes.
What Observability Offers
Logs and metrics are familiar. You check CPU usage, memory thresholds, and maybe request latency. These are helpful. But they do not tell the full story.
Observability adds traces to the mix. A trace is a view of a single request as it moves through multiple services. It includes timing, dependencies, and even slow database queries.
Here is how I break it down:
- Metrics tell you how the system is behaving.
- Logs tell you what happened.
- Traces tell you how it all fits together.
When you can correlate all three, you stop guessing and start understanding.
A Lesson in Delay
We had a client who initially skipped observability. Their system launched fine. As traffic grew, the application slowed down. But all the dashboards were green. CPU usage was normal. Memory was fine. Still, the user experience was poor.
We spent two days chasing ghosts.
Then we added OpenTelemetry and connected it to Grafana Tempo. Within hours, we saw the issue. A poorly indexed SQL query buried in one of the services was holding everything up.
That is when I realised observability is not just about debugging errors. It is about diagnosing slowness, regressions, and invisible bottlenecks.
Why We Chose OpenTelemetry
OpenTelemetry gives us a standardised, vendor-neutral way to collect traces, metrics, and logs. It supports languages like Node.js, Go, Java, Python, and lets us instrument applications once, not repeatedly for every new backend.
In our setup, we use Prometheus for metrics, Loki for logs, and Tempo for traces. All the data flows through a central OpenTelemetry Collector. The best part is that we can swap any of these tools without touching application code. That flexibility saves time and keeps us adaptable.
We rely on auto-instrumentation for most services. It requires almost no code changes. Once configured, everything just flows—clean, consistent, and ready to visualise.
Debugging in Practice
One example stood out to me. We had an endpoint that responded slowly. The logs were fine. Metrics were stable. But when we opened the trace, we saw it clearly.
A call from Service A to Service B triggered a slow database query. The trace showed the timing, the SQL statement, and even the exact span where the delay occurred. Fixing it took less time than finding it.
Another time, we used a log entry in Loki to jump directly into the corresponding trace in Tempo. No guesswork. Just click and see the path the request took. That is how observability changes workflows.
Why It Needs to Start Early
The earlier you introduce observability into a project, the more value you extract from it. You understand performance trends before issues arise. You notice inefficiencies before they become incidents.
And when something does break, you already have the context.
In microservices, problems do not announce themselves. They scatter. You need observability to trace the thread back to the cause.
The Case for OpenTelemetry
OpenTelemetry makes this easy. It supports multiple backends, works across environments, and separates instrumentation from tooling.
With it, we get:
- One-time instrumentation across services
- Backend flexibility without code rewrites
- Centralised visibility through standard collectors
- Cost-effective scaling with open-source components
This is not just about tools. It is about engineering confidence.
Final Thoughts
Observability is essential for modern systems. Metrics, logs, and traces each answer a different kind of question. Together, they reveal how systems behave, why they fail, and what to fix next.
OpenTelemetry helps us do this without vendor lock-in or duplication of effort.
If you are building microservices, observability is not an extra. It is your foundation. And it is best to build that foundation before the cracks start showing.
Related Articles.
More from the engineering frontline.
Dive deep into our research and insights on design, development, and the impact of various trends to businesses.

May 11, 2026
From MVP to Scale: Designing Architecture for AI-First Products
A panel of architects and engineering leaders at thegeekconf mini 2026 discuss how to build and scale AI-first products — from MVP decisions to production-level challenges. The conversation covers data quality, model selection, security, token economics, and the mindset teams need to navigate a fast-moving AI landscape.

May 7, 2026
The AI native Enterprise Evolution | Saurabh Sahu
Explore Saurabh Sahu’s insights on AI-native enterprise, AI gateways, model governance, agentic SDLC, and workspace.build for scalable AI adoption from thegeekconf mini 2026.

May 5, 2026
The Next Era of AI Builders: Building Autonomous Systems for Frontier Firms — Pallavi Lokesh Shetty
Discover Pallavi Shetty’s view on the next era of AI builders, covering autonomous systems, trusted agents, data quality, and frontier firms from thegeekconf mini 2026

May 5, 2026
The Autonomous Factory: Architecting Agentic Workflows with Clean Code Guards | Akash Kamerkar
Akash Kamerkar’s thegeekconf mini 2026 talk explores the ACDC framework for building safer agentic workflows with clean code guards, sandbox testing, and AI-driven software development.

May 4, 2026
OpenClaw: Build Your Autonomous Assistant | Deepak Chawla
Discover how Deepak Chawla explains OpenClaw for building autonomous AI assistants through data preparation, knowledge bases, AI engines, and agent automation.

May 4, 2026
From Prompt Chaos to Production AI: Spec-driven Development for AI Engineers | Vishal Alhat
Learn how Vishal Alhat’s thegeekconf mini 2026 session explains spec-driven development and how AI engineers can move beyond prompt chaos to build production-ready applications.