9 min read

Evaluating Models with Amazon Nova in SageMaker AI

AI

ThinkTools Team

AI Research Lead

Introduction

Amazon SageMaker AI has long been a cornerstone for building, training, and deploying machine learning models at scale. As the field of large language models (LLMs) expands, the need for robust, flexible evaluation pipelines becomes increasingly critical. The latest release of Amazon Nova, the model evaluation container that ships with SageMaker AI, addresses this need by introducing a suite of new capabilities that empower data scientists and ML engineers to assess model performance with greater depth and precision. In this post we unpack the key enhancements—custom metrics support, LLM‑based preference testing, log‑probability capture, metadata analysis, and multi‑node scaling—and illustrate how they can be leveraged to streamline evaluation workflows, surface actionable insights, and ultimately accelerate the deployment of high‑quality models.

The evolution of Amazon Nova reflects a broader industry trend toward end‑to‑end observability in ML pipelines. Traditional evaluation often relied on a handful of static metrics such as accuracy or perplexity, which can obscure nuanced behaviors in complex models. Nova’s new features provide a richer, more granular view of model outputs, enabling teams to detect subtle biases, measure user‑centric preferences, and quantify uncertainty in a systematic way. By integrating these capabilities directly into the SageMaker ecosystem, Amazon eliminates the friction of moving data between disparate tools, allowing practitioners to iterate faster and with fewer operational overheads.

In the sections that follow, we dive into each of Nova’s new functionalities, discuss practical use cases, and outline best practices for incorporating them into your evaluation pipeline. Whether you are fine‑tuning a conversational agent, building a recommendation engine, or deploying a generative text model, the enhanced Nova container equips you with the tools to evaluate your models comprehensively and confidently.

Main Content

Custom Metrics Support

One of the most powerful additions to Amazon Nova is the ability to define and compute custom metrics directly within the evaluation container. Prior to this release, users were limited to a predefined set of metrics that often did not align with domain‑specific objectives. Nova now exposes a flexible API that allows developers to write lightweight Python functions or integrate third‑party libraries to calculate metrics such as BLEU, ROUGE, F1‑score, or even bespoke business KPIs.

The process begins by specifying a metrics configuration file in JSON or YAML format. Nova parses this configuration and injects the custom functions into the evaluation runtime. During evaluation, each generated response is passed through the metric functions, and the results are aggregated across the test set. Because the evaluation runs in a containerized environment, the custom code can import any package available in the Python ecosystem, including TensorFlow, PyTorch, or specialized NLP libraries. This flexibility means that teams can tailor the evaluation to the exact nuances of their application—whether that means penalizing hallucinations in a medical chatbot or rewarding diversity in a creative writing model.

Beyond simply computing metrics, Nova’s custom metric framework supports dynamic weighting and thresholding. For instance, a team might assign higher weight to metrics that capture factual correctness, while setting a minimum threshold for user satisfaction scores. The resulting composite score can be used to rank models, trigger alerts, or feed into automated deployment pipelines. By embedding custom metrics into the evaluation loop, organizations can enforce domain‑specific quality gates and reduce the risk of deploying models that perform well on generic benchmarks but fail in production.

LLM‑Based Preference Testing

Evaluating user preference is notoriously difficult, especially when dealing with generative models that produce a wide range of plausible outputs. Nova addresses this challenge by integrating LLM‑based preference testing, a method that leverages large language models to automatically assess which of two or more responses better aligns with a given prompt or context.

In practice, the preference testing workflow involves generating multiple candidate responses for each prompt, then feeding pairs of responses into a pre‑trained LLM that has been fine‑tuned on preference data. The LLM outputs a probability distribution indicating which response is preferred. By aggregating these probabilities across the dataset, Nova produces a preference score that reflects how often a model’s output is favored over alternatives.

This approach offers several advantages. First, it scales gracefully: the LLM can evaluate thousands of response pairs in parallel, making it suitable for large‑scale benchmarks. Second, it captures nuanced aspects of language that are difficult to encode in handcrafted metrics, such as coherence, style, or emotional resonance. Third, because the LLM is itself a generative model, it can adapt to evolving user expectations by fine‑tuning on new preference data.

Implementing LLM‑based preference testing in Nova is straightforward. Users specify the preference model in the configuration file, along with any necessary hyperparameters such as temperature or top‑k sampling. Nova then orchestrates the generation, pairing, and evaluation steps, returning a preference matrix that can be visualized or exported for further analysis. Teams can use this matrix to identify systematic weaknesses—such as a tendency to produce overly formal language in casual contexts—and iterate on model architecture or training data accordingly.

Log Probability Capture

Understanding the confidence a model places in its predictions is essential for diagnosing issues like over‑confidence or uncertainty. Nova’s log probability capture feature records the log‑probabilities of each token in the generated sequence, providing a fine‑grained view of the model’s internal belief distribution.

During evaluation, Nova hooks into the model’s inference engine to intercept the logits before they are transformed into probabilities. These logits are then converted to log‑probabilities and stored alongside the generated text. By aggregating log‑probabilities across the dataset, practitioners can compute metrics such as perplexity, entropy, or confidence intervals.

The practical benefits of log probability capture are manifold. For example, a high entropy across a response may indicate that the model is uncertain and could benefit from additional training data in that domain. Conversely, a low entropy coupled with incorrect predictions might signal a systematic bias that needs to be addressed. Log probabilities also enable downstream tasks such as uncertainty‑aware inference, where the system can defer to a human or a fallback model when confidence is low.

Nova’s implementation is designed to be non‑intrusive. The log‑probability data is stored in a structured format (e.g., Parquet or JSON) that can be easily queried with tools like Amazon Athena or Pandas. This compatibility ensures that teams can integrate the data into existing analytics pipelines without rewriting code.

Metadata Analysis

Modern evaluation datasets often come with rich metadata—source of data, user demographics, timestamp, or domain tags—that can provide context for interpreting model performance. Nova’s metadata analysis feature automatically extracts and aggregates this information, allowing users to perform stratified evaluations.

When configuring Nova, users can specify which metadata fields to capture. During evaluation, each generated response is tagged with its corresponding metadata, and Nova aggregates metrics by these tags. For instance, a team can compare the accuracy of a translation model across different language pairs or assess bias across demographic groups.

This capability is particularly valuable for compliance and fairness audits. By visualizing performance disparities across protected attributes, organizations can identify and mitigate bias before deployment. Moreover, metadata analysis can inform data collection strategies: if a model performs poorly on a specific domain, the team can prioritize gathering more data from that domain.

Nova’s metadata engine supports both structured and semi‑structured data, and it can ingest metadata from external sources such as Amazon S3 or DynamoDB. The resulting reports can be exported to CSV, JSON, or integrated into business dashboards, ensuring that stakeholders can access insights in the format that best suits their workflow.

Multi‑Node Scaling for Large Evaluations

Evaluating large language models on sizeable test sets can be computationally intensive. Nova addresses this bottleneck by enabling multi‑node scaling, allowing evaluation workloads to be distributed across multiple EC2 instances or SageMaker endpoints.

The scaling mechanism is built on top of Amazon SageMaker’s distributed training infrastructure. Users specify the number of nodes and the instance type in the Nova configuration. Nova then orchestrates the distribution of prompts, collects partial results, and merges them into a final evaluation report.

This distributed approach yields significant speedups, especially for models that require GPU acceleration. By parallelizing inference, teams can reduce evaluation time from hours to minutes, enabling rapid iteration cycles. Additionally, multi‑node scaling supports fault tolerance: if a node fails, Nova can automatically retry the workload on another node, ensuring that the evaluation completes reliably.

Implementing multi‑node scaling is as simple as adjusting a few parameters in the configuration file. Nova handles the underlying orchestration, so users can focus on interpreting results rather than managing infrastructure. The scalability also opens the door to evaluating models on massive datasets—such as millions of user queries—without compromising on turnaround time.

Conclusion

Amazon Nova’s latest enhancements transform the way organizations evaluate large language models. By combining custom metrics, LLM‑based preference testing, log probability capture, metadata analysis, and multi‑node scaling, Nova delivers a comprehensive, end‑to‑end evaluation platform that aligns closely with real‑world deployment needs. These features not only provide deeper insights into model behavior but also streamline the feedback loop between data scientists and product teams. As the AI landscape continues to evolve, having a robust, flexible evaluation pipeline will be a key differentiator for companies that aim to deliver reliable, high‑quality models at scale.

The integration of Nova into SageMaker AI means that teams can now perform sophisticated evaluations without leaving the AWS ecosystem. This seamless experience reduces operational friction, accelerates experimentation, and ultimately leads to better models that meet both technical and business objectives. Whether you’re building a chatbot, a recommendation engine, or a generative content platform, the new Nova features give you the tools to evaluate, iterate, and deploy with confidence.

Call to Action

If you’re ready to elevate your model evaluation process, start by exploring Amazon Nova’s new capabilities today. Sign up for a free SageMaker trial, experiment with custom metrics, and leverage LLM‑based preference testing to uncover hidden strengths and weaknesses in your models. Share your findings with the community, contribute to open‑source evaluation tools, and help shape the future of AI evaluation. For detailed documentation, tutorials, and best‑practice guides, visit the SageMaker AI documentation portal and join the discussion on the AWS Developer Forums. Your next breakthrough model is just a few evaluation steps away—let Nova guide the way.

We value your privacy

We use cookies, including Google Analytics, to improve your experience on our site. By accepting, you agree to our use of these cookies. Learn more