9 min read

Tracing LLM Pipelines with Opik: A Practical Guide

AI

ThinkTools Team

AI Research Lead

Introduction

In the rapidly evolving world of generative AI, the ability to build, monitor, and evaluate large language model (LLM) pipelines in a reproducible manner is becoming a critical differentiator for both research teams and commercial developers. Traditional experimentation workflows often rely on ad‑hoc logging, manual checkpointing, and a fragmented view of the data‑to‑decision pipeline. This lack of end‑to‑end visibility can lead to hidden biases, inconsistent results, and a steep learning curve for new contributors.

Opik, a platform designed to bring observability to AI workflows, offers a unified solution that captures every function call, records input and output artifacts, and automatically aggregates evaluation metrics. By integrating Opik into a local LLM pipeline, developers can trace the flow of data through each stage, quantify performance with custom metrics, and share reproducible experiments with stakeholders. The tutorial that follows walks through a complete implementation: starting with a lightweight model, adding a prompt‑based planning layer, constructing a domain‑specific dataset, and finally running automated evaluations—all while leveraging Opik’s tracing and evaluation capabilities.

The goal of this guide is not only to demonstrate how to set up the pipeline but also to illustrate the broader value of transparent, measurable, and reproducible AI workflows. By the end, you will understand how Opik transforms a simple script into a fully observable system that can be audited, iterated upon, and scaled with confidence.

Building a Lightweight LLM

The first step in any LLM pipeline is selecting a model that balances performance with resource constraints. In this tutorial we choose a distilled version of a transformer architecture that can run comfortably on a single GPU or even a CPU. The model is loaded using a popular open‑source library, and a minimal wrapper is defined to expose a single generate method. This wrapper is intentionally simple: it accepts a prompt string, forwards it to the underlying model, and returns the raw text output.

By keeping the wrapper thin, we preserve the ability to inject additional logic—such as prompt engineering or post‑processing—without modifying the core inference code. The wrapper also becomes the primary point of integration with Opik: each call to generate is wrapped in an Opik trace span, ensuring that the prompt, token counts, and raw output are captured automatically. The code snippet below illustrates the wrapper and the Opik integration:

import opik
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "distilbert-base-uncased"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

@opik.trace
def generate(prompt: str) -> str:
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=50)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

The @opik.trace decorator automatically creates a span that records the function name, arguments, and return value. When the pipeline later expands to include additional stages, each stage can be similarly decorated, building a comprehensive trace graph that can be visualized in the Opik dashboard.

Prompt‑Based Planning for Contextual Reasoning

A common challenge when working with LLMs is ensuring that the model’s responses remain grounded in the task at hand. Prompt‑based planning addresses this by first generating a structured plan—often a list of sub‑tasks or questions—that guides the subsequent generation steps. In our pipeline, we introduce a lightweight planner that takes the original user query, produces a plan, and then iteratively feeds each plan item back into the model.

The planner itself is a small function that calls the same generate wrapper but with a specially crafted instruction prompt. Because the planner is also decorated with @opik.trace, the planner’s output is captured as a separate span. This separation allows us to analyze the planner’s accuracy independently from the final answer generation.

@opik.trace
def plan(query: str) -> list[str]:
    instruction = (
        "You are a helpful assistant. First, break down the following query into a list of actionable steps. "
        f"Query: {query}\nSteps:\n"
    )
    plan_text = generate(instruction)
    # Simple split on newlines; in practice you might parse JSON or a more robust format
    return [step.strip() for step in plan_text.split("\n") if step.strip()]

By capturing both the planner and the final answer in separate spans, we can later compute metrics such as plan coverage, step relevance, and overall answer quality. This granularity is essential for debugging and for demonstrating the pipeline’s transparency to auditors or clients.

Dataset Creation and Management

A robust pipeline requires a dataset that reflects the real‑world scenarios the model will encounter. In this tutorial we construct a small synthetic dataset that pairs user queries with expected answers and optional evaluation criteria. The dataset is stored in a simple CSV format, but the same approach scales to larger JSONL or Parquet files.

Each row in the dataset contains the following fields: query, expected_answer, evaluation_criteria. The evaluation_criteria field can be a JSON string that defines custom scoring functions—for example, a semantic similarity threshold or a keyword match requirement. By embedding evaluation logic directly in the dataset, we enable automated, data‑driven assessment of the pipeline.

import csv

def load_dataset(path: str):
    with open(path, newline="") as f:
        reader = csv.DictReader(f)
        return list(reader)

The dataset loader is trivial, but the real power comes from how we use it in the evaluation loop. Each example is processed through the planner and the generator, and the resulting spans are automatically recorded by Opik. The evaluation step then pulls the spans, compares the generated answer against the expected answer, and records the metric.

Tracing with Opik: Capturing Function Spans

Tracing is the backbone of observability. Opik’s tracing API is intentionally lightweight: a decorator or context manager can be applied to any Python function, and the resulting span will include all relevant metadata. In our pipeline, we apply the decorator to every stage—model inference, planning, evaluation—creating a nested trace hierarchy.

The Opik dashboard visualizes these spans as a tree, where each node represents a function call. Developers can drill down into a node to see the exact prompt, the raw token counts, and the output text. Because the spans are stored in a time‑series database, it is possible to filter by tags such as model_version or dataset_name, enabling quick identification of regressions or performance bottlenecks.

Moreover, Opik supports custom tags and annotations. For example, when the planner generates a step that is later found to be irrelevant, we can annotate the span with a relevance: low tag. These annotations become searchable, allowing teams to surface patterns that might otherwise be buried in logs.

Automated Evaluation and Metrics

Evaluation is where the pipeline moves from raw generation to actionable insight. In this tutorial we define a simple evaluation function that computes the cosine similarity between the embedding of the generated answer and the embedding of the expected answer. The function is also decorated with @opik.trace, so the similarity score is captured as a child span of the evaluation step.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

@opik.trace
def evaluate(generated: str, expected: str) -> float:
    gen_emb = model.encode(generated, convert_to_tensor=True)
    exp_emb = model.encode(expected, convert_to_tensor=True)
    similarity = util.cos_sim(gen_emb, exp_emb).item()
    return similarity

After running the entire dataset through the pipeline, Opik aggregates the similarity scores and presents them as a histogram in the dashboard. This visual representation makes it easy to spot outliers, assess overall performance, and track improvements over time. Because every evaluation metric is tied back to the original spans, any change in the model or the planner can be traced to its impact on the final score.

Reproducibility and Transparency

One of the most compelling benefits of integrating Opik into an LLM pipeline is the guarantee of reproducibility. Every trace span records the exact prompt, the model checkpoint, the random seed, and the system environment. When a new team member wants to replicate a result, they can simply load the corresponding trace from Opik, re‑run the same function calls, and confirm that the outputs match.

Transparency also extends to compliance. For regulated industries, auditors often require evidence that a model’s outputs were generated under controlled conditions. Opik’s audit trail provides a tamper‑evident record that can be exported in JSON or CSV format, satisfying many regulatory frameworks.

Additionally, the ability to annotate spans with custom tags means that teams can embed domain knowledge—such as compliance flags or risk scores—directly into the trace. This feature turns the pipeline into a living document that evolves alongside the model, rather than a static set of scripts.

Conclusion

Building a fully traced and evaluated local LLM pipeline is no longer a niche exercise; it is becoming a standard practice for teams that demand reliability, accountability, and continuous improvement. By leveraging Opik’s lightweight tracing, automated evaluation, and robust data management, developers can transform a simple inference script into a transparent, measurable, and reproducible workflow.

The tutorial demonstrates that the integration effort is modest: a few decorators, a dataset loader, and a custom evaluation function. Yet the payoff is substantial. Teams gain granular visibility into every step of the pipeline, can quickly identify regressions, and can provide verifiable evidence of model behavior to stakeholders. In an era where AI systems are increasingly scrutinized for bias, fairness, and safety, such observability is not just a convenience—it is a necessity.

Call to Action

If you are ready to elevate your LLM projects from ad‑hoc experiments to production‑grade pipelines, start by integrating Opik into your next prototype. Begin with a lightweight model, wrap your inference functions with @opik.trace, and iterate on your prompt‑based planner. As you collect traces, experiment with custom evaluation metrics that reflect your domain’s unique requirements. Share your findings on the Opik dashboard, and invite collaborators to review the audit trail.

Whether you are a researcher, a data scientist, or a product engineer, the principles outlined here will help you build AI systems that are not only powerful but also trustworthy and compliant. Dive into the Opik documentation, experiment with the sample code, and join the community of developers who are redefining how we build, monitor, and evaluate language models.

We value your privacy

We use cookies, including Google Analytics, to improve your experience on our site. By accepting, you agree to our use of these cookies. Learn more