Table of Contents
Book a call
Engineering a Unified GenAI Framework for Real-World Tasks
Editor’s Note: This blog is adapted from a talk by Sowmya Ranjan, delivered during the Build with AI meetup hosted by GeekyAnts at NIMHANS. In his session, Sowmya shared key insights from his journey building a versatile GenAI framework capable of supporting multiple tasks, models, and domains within a unified architecture. With deep experience integrating open-source LLMs, custom pipelines, and cross-domain agents, he offered a behind-the-scenes view of building a real, extensible GenAI platform from scratch.
A Framework for Real-World GenAI
Hi, I am Sowmya Ranjan. I lead GenAI efforts at my organisation and have spent the past year working on a highly customisable GenAI platform—one that can run across domains, support different use cases, and scale with new models and tasks without starting over each time.
In this talk, I want to walk through how we approached building such a framework, what constraints we had, and what architectural choices helped us stay both modular and extensible. I will also share how we connected the platform with multiple open-source LLMs and turned task-specific chains into fully orchestrated agents.
Starting with Problems, Not Tools
When we began, the question was not which model to use. The question was: what are the business problems we want to solve, and how can we build a system that adapts to different solutions without being rewritten every time?
Most GenAI tools today work well in isolation. You can build a chatbot, or a summariser, or a document parser. But stitching those together into a coherent, multi-capability platform is harder. We wanted a framework where a single entry point could route requests to different modules, use the right model for the task, and return a structured response in a standardized format.
This is how our journey started—intending to abstract complexity, not amplify it.
System Architecture: Orchestration First
The heart of our platform is an orchestrator. Every user input goes through the orchestrator, which decides what type of task it is—whether it is a summary request, a question answering query, a chatbot prompt, or a document processing job.
Once the task is identified, the orchestrator routes it to the correct engine. Each engine is designed for a specific task, such as summarisation, Q&A, or context extraction. These engines are modular and pluggable. We can swap out the model or pipeline within an engine without breaking the rest of the system.
To support this, we defined a custom format called Unistruct. All responses from any engine follow the same structure. Whether the user is asking for a summary or a document classification, the output format remains consistent. This allows frontend teams to build against a single contract, while backend teams can evolve independently.
From LLM Chains to Autonomous Agents
Initially, each engine was a static LLM chain—prompt in, response out. But that quickly hit limits. If the user gave vague input, the chain broke. If the model misunderstood the intent, the result was useless. We needed the system to have reasoning ability.
That is when we added agent capabilities. We built a meta-engine that could trigger multiple sub-engines, run intermediate steps, and make decisions. The agent decides what information is missing, invokes the right tools, and builds up the context before returning the final result.
For example, in the case of document summarisation, the agent first runs a format validator. Then it checks if OCR is needed. If yes, it invokes the OCR engine. Then it extracts context, runs a chunking process, and finally applies the summariser. All of this is invisible to the user. They simply upload a document and get a structured summary in return.
Bringing in Open Source Models
We worked with several models—LLaMA2, Mistral, and Mixtral, among others. Some were hosted internally, others were accessed through APIs. One of the key challenges was context size. Many open-source models cannot handle large documents. We solved this by adding a chunking layer with overlap and caching.
We also tested different embedding models for similarity search and context retrieval. BAAI, Instructor, and MiniLM all performed well for different use cases. Based on the domain—legal, financial, educational—we picked embeddings that retained semantic relevance without adding cost.
For hosting, we used vLLM with continuous batching enabled. This allowed multiple concurrent users to share a single model instance, improving throughput. We also added streaming support using FastAPI’s Server-Sent Events, so users could see the response in real time.
Multi-Modal and Cross-Domain Readiness
While our primary focus was on text, we made the architecture ready for image and audio processing. For image-to-text conversion, we integrated models like BLIP2 and LLaVA. For audio, we tested Whisper for transcription.
But more importantly, we ensured that our orchestrator could treat these tasks as peers. The same orchestration layer could route an audio file to the transcription engine, a document to the parser, and a prompt to the chatbot—all without any manual intervention.
This allowed us to build cross-domain flows. For example, a user could upload an image of a document, have it processed via OCR, extract the key context, and then ask follow-up questions on that extracted content—all in one flow, orchestrated through a unified agent system.
Task Management and Version Control
As the number of tasks grew, we needed a registry. We created a task registry with metadata, validation rules, and configuration parameters for each task type. This allowed us to version tasks, roll back changes, and onboard new capabilities without affecting existing users.
Each task is registered with its input type, output schema, and execution steps. This metadata is what the orchestrator uses to decide which engine to call and how to handle the response. If a new summarisation model is added, we simply update the registry. No need to change the routing logic.
This approach also made it easy to add fallback logic. If a task fails, the orchestrator can retry with a different model or engine, based on priority. This made the system more resilient and better suited for production usage.
Frontend Integration and Reusability
On the frontend, we exposed a simple API layer with clear response types. Since all engines returned data in the Unistruct format, the frontend did not need to know which model was used or what pipeline was executed.
This allowed us to reuse the same UI components across different flows. The summary card, the document viewer, and the Q&A interface all used the same contract. As we added new features, the UI required minimal changes.
We also introduced session IDs and context tracking. A user could start a conversation, upload a document mid-way, and continue the discussion using extracted context. All of this was maintained within the session, giving a fluid, conversational experience without context loss.
Lessons from Building the Framework
There were several technical and architectural lessons that emerged from this journey.
First, abstraction is key. If each engine behaves differently, the system becomes unmanageable. Having a shared schema, shared logging, and shared lifecycle logic helped us maintain consistency.
Second, reasoning matters more than raw output. LLMs can generate anything, but agents need to decide. That decision layer—what to run, what to skip, how to handle ambiguity—is what turned the system from a toolkit into a platform.
Third, metrics matter. We built detailed logs, token trackers, time-to-response metrics, and cost monitoring. Without these, it becomes difficult to scale or debug production issues.
Finally, keeping humans in the loop was essential. For high-risk tasks, we included approval workflows. For complex documents, we allowed preview and correction steps. This balance between autonomy and oversight made the system trustworthy.
Where We Go from Here
The platform is now live across multiple domains. We have built specialised workflows for legal, healthcare, education, and enterprise knowledge systems. Each domain reuses the same core engine set, but adapts the prompts, models, and output format as needed.
In the coming months, we plan to add multilingual support, extend our multi-modal capabilities, and optimise model selection further using reinforcement learning signals.
But the core principle will remain the same. Build once, adapt infinitely. The goal is not to chase every new model—it is to build a system that evolves with the ecosystem without starting over.
Dive deep into our research and insights. In our articles and blogs, we explore topics on design, how it relates to development, and impact of various trends to businesses.