May 6, 2025

Query Anything, Anywhere: Meet Presto

Discover how Presto enables real-time, federated querying across all your data sources—used by Facebook, Uber & Airbnb. Fast, scalable, and fully open-source.

MeetUp

open source

Author

Prince Kumar ThakurTechnical Content Writer

Book a call

Table of Contents

Editor’s Note: This blog is adapted from Saurabh Mahawar's talk on leveraging Presto, an open-source SQL query engine designed for petabyte-scale analytics across distributed data sources. In this session, he explained how Presto powers real-time querying without moving or duplicating data, along with its architecture, use cases, and open-source ecosystem.

From Bookshelves to Big Data

I am Saurabh, a Developer Relations Engineer at IBM, working on Data Lakehouse solutions. Let me break this down with a simple analogy: imagine managing a massive bookstore and needing to find a specific title, along with its author, price, and publication details. Now scale that challenge to millions of books. That’s what querying petabytes of data feels like without the right system.

This is where Presto makes a difference. Developed by Meta in 2012, Presto is an open-source, distributed SQL query engine that enables real-time analytics across multiple data sources, without moving or duplicating data. Whether your data is in Mysql, PostgreSQL, Hive, MongoDB, or S3, Presto connects directly and queries it where it lives. And the best part? If you know SQL, you already know how to use Presto.

From Facebook’s Challenge to Everyone’s Solution

Back in 2012, Facebook was hitting the limits of what Apache Hive could handle. As data volumes grew, Hive could not deliver the speed or flexibility needed for large-scale, real-time analytics, especially when querying across different data sources.

To solve this, Facebook built Presto: a distributed SQL query engine designed for fast, federated analytics without moving data. What started as an internal solution quickly proved its value. By 2013, Presto was open-sourced, and it didn’t take long for the broader tech community to take notice.

Today, companies like Uber, Airbnb, Adobe, and many others use Presto to run complex queries at scale across systems, formats, and infrastructure.

How Presto Works: A Smarter Way to Query

At the heart of Presto’s architecture is the coordinator Node—the brain of the system. When a query is initiated, whether through tools like Tableau, Superset, or any other frontend, the coordinator handles parsing, planning, and distributing tasks across the system.

The actual computation is performed by Worker Nodes, which execute these tasks in parallel across distributed data sources. You can scale worker nodes as needed, but there’s always a single coordinator orchestrating the entire process.

Presto also uses Connectors to interface with different databases—MySQL, MongoDB, PostgreSQL, Hive, and more. These connectors convert Presto’s execution plan into native queries for each backend system. There’s no data duplication or movement—just fast, federated querying across systems, regardless of where your data lives.

A Real-World Example: Uber’s Pricing Engine

If you have ever booked an Uber and noticed the price change within seconds, that’s Presto working behind the scenes. Uber uses Presto to run real-time analytics based on driver availability, rider demand, and location heat maps. These pricing decisions happen in milliseconds, and they rely on Presto’s ability to query multiple sources without latency or lag.

Beyond pricing, Uber also uses Presto for fraud detection, ride analytics, and customer support workflows. It’s a core component of their data infrastructure.

Use Cases That Go Beyond BI Dashboards

Presto is not a database—it does not store data or perform CRUD operations. Instead, it’s a query engine optimized for analytical workloads (OLAP). You can use it for ETL validation, demand forecasting, user behavior analytics, and even reverse-engineering fraud patterns. Whether it’s a small team querying a few GBs or a billion-dollar enterprise analyzing petabytes, Presto scales flexibly—and it’s entirely open source.

A Demo: Running Multi-Source Analytics with Presto

For this meetup, I set up a demo with over 3 million records stored across MySQL and MongoDB. MySQL stored subscription and ad-related data, while MongoDB held real-time match viewing logs. Using Apache Zeppelin as the client, I queried users who watched at least one IPL match, along with their subscription type, match ID, and total watch time.

Presto handled it seamlessly—running federated queries across both sources and returning structured results in seconds. This is what makes Presto powerful: the ability to unify diverse datasets without re-engineering your data pipeline.

The Ecosystem and Where It’s Headed

Presto is written in Java, but the community has started reengineering components like the worker node in C++ for improved performance. It also supports Kubernetes deployment for cloud-native scalability. Major contributors include Meta, Uber, IBM, and Airbnb, but it’s an open ecosystem, and new contributors are always welcome.

If you are interested in analytics, distributed systems, or open-source infrastructure, I highly recommend diving into Presto. It’s flexible enough to run on a single machine and powerful enough to support global-scale businesses.

Final Thoughts: Query at Source. Operate at Scale.

In modern data environments, fragmentation is the norm. Presto eliminates the need for costly data movement by enabling fast, federated querying directly on source systems, regardless of scale or complexity.

It’s not a convenience layer; it’s infrastructure-critical. For teams dealing with diverse data ecosystems, Presto provides a consistent, SQL-first interface that delivers performance without architectural compromise.

If you are architecting systems that demand flexibility, scale, and speed, Presto belongs in your toolkit. The ecosystem is mature, actively maintained, and designed for real-world workloads.

SHARE ON