7 min read

OceanBase’s seekdb: A Unified AI‑Native Hybrid Search Database

AI

ThinkTools Team

AI Research Lead

Introduction

AI applications have evolved from simple classification tasks to complex, multimodal systems that weave together structured tables, unstructured text, embeddings, and even spatial coordinates. In practice, developers rarely find a single clean database that can accommodate all of these data types efficiently. The typical solution is a patchwork of an OLTP database for transactional data, a dedicated vector store for embeddings, and a search engine for full‑text queries. This fragmented architecture introduces latency, increases operational overhead, and makes consistency hard to guarantee.

OceanBase, a leading cloud‑native database provider, has responded to this pain point with the release of seekdb, an open‑source AI‑native hybrid search database. Licensed under Apache 2.0, seekdb promises to bring together the best of relational, vector, and search capabilities into a single, cohesive engine. The goal is to simplify data pipelines for Retrieval‑Augmented Generation (RAG) systems and AI agents that require rapid access to heterogeneous data sources.

In this post we dive deep into the motivations behind seekdb, its architectural innovations, real‑world use cases, and what it means for the future of AI‑centric data management.

Main Content

The Data Challenge in Modern AI

Modern AI workloads routinely consume a blend of data: user profiles stored in relational tables, chat logs in semi‑structured JSON, embeddings generated by transformer models, and occasionally geospatial coordinates for location‑based services. Each data type traditionally finds its home in a specialized system. Relational databases excel at ACID guarantees and complex joins; vector stores like Pinecone or Milvus are optimized for nearest‑neighbor search; search engines such as Elasticsearch provide powerful full‑text indexing.

When an AI system needs to answer a user query, it might first retrieve relevant documents from a vector store, then enrich those results with metadata from a relational table, and finally rank them using a search engine’s scoring algorithm. Coordinating these steps across multiple services introduces network hops, serialization costs, and the risk of data staleness. Moreover, developers must write custom orchestration logic, maintain multiple connection pools, and monitor disparate systems for uptime.

seekdb addresses this fragmentation by offering a unified query interface that can seamlessly combine relational joins, vector similarity searches, and full‑text relevance scoring in a single statement. This integration reduces latency, simplifies code, and ensures that all data remains consistent within a single transaction context.

seekdb Architecture and Design

At its core, seekdb builds upon OceanBase’s proven distributed SQL engine. The key innovation is the addition of a hybrid storage layer that supports three data models concurrently:

  1. Relational tables – Stored in the traditional row‑oriented format, with support for ACID transactions, foreign keys, and complex analytical queries.
  2. Vector columns – Each row can contain one or more high‑dimensional vectors. Internally, seekdb uses an approximate nearest neighbor (ANN) index built on a variant of HNSW (Hierarchical Navigable Small World) that is tightly coupled to the underlying storage engine, enabling index updates to happen in near real‑time.
  3. Full‑text fields – Text columns are automatically tokenized and inverted‑indexed using a lightweight Lucene‑style engine, allowing keyword search and relevance scoring without external dependencies.

The query planner in seekdb is aware of all three data models. When a user issues a statement that mixes a vector similarity predicate with a relational filter, the planner can decide whether to perform the vector search first and then apply the filter, or vice versa, based on cost estimates. This flexibility is crucial for RAG pipelines where the vector search is often the most expensive operation.

Another noteworthy feature is schema‑on‑write for JSON metadata. Instead of storing JSON as a blob, seekdb parses the structure at insertion time and creates virtual columns that can be queried directly. This approach preserves the flexibility of semi‑structured data while enabling efficient indexing and filtering.

Practical Use Cases

Retrieval‑Augmented Generation

In a typical RAG system, a user’s question is first encoded into an embedding. The system then retrieves the top‑k most similar documents from a vector store, fetches any associated metadata from relational tables, and finally feeds the combined context into a large language model. With seekdb, the entire pipeline collapses into a single SQL query:

SELECT d.content, u.profile_name
FROM documents d
JOIN users u ON d.owner_id = u.id
WHERE d.embedding <=> :query_embedding
LIMIT 10;

The <=> operator performs the ANN search, while the join pulls in user metadata. Because the query runs inside the database, there is no need for an external vector service or a separate API call.

AI Agents with Contextual Awareness

AI agents that interact with users often need to maintain context across multiple turns. seekdb’s ability to store conversation logs as relational rows, embed each turn, and perform similarity search against the entire conversation history allows agents to retrieve relevant past interactions on the fly. This capability is essential for building agents that can remember user preferences or past instructions.

Geospatial‑Aware Recommendations

Some AI applications, such as personalized travel assistants, require both semantic similarity and spatial proximity. seekdb’s hybrid index can handle a query that simultaneously filters by distance and vector similarity, producing recommendations that are both contextually relevant and geographically feasible.

Performance and Scalability

Benchmarks released by OceanBase demonstrate that seekdb can process a mixed query involving a vector search, a join, and a full‑text filter in under 50 ms on a single node with 64 GB of RAM. When scaled across a cluster, the latency remains sub‑100 ms even for 10 million rows of embeddings, thanks to the distributed nature of the ANN index and the efficient partitioning of data.

One of the challenges with hybrid systems is maintaining index freshness. seekdb addresses this by performing incremental updates to the ANN index in the background, ensuring that newly inserted vectors become searchable within seconds. This near real‑time indexing is a significant advantage over traditional batch‑indexed vector stores.

Open Source Ecosystem and Community

By releasing seekdb under the Apache 2.0 license, OceanBase invites the community to contribute to its development, propose new features, and integrate it into existing AI pipelines. The project includes comprehensive documentation, example notebooks for RAG and AI agent workflows, and a set of SDKs for Python, Java, and Go.

The open‑source nature also means that organizations can run seekdb on their own infrastructure, avoiding vendor lock‑in and ensuring compliance with data‑privacy regulations. For teams that already use OceanBase for transactional workloads, adding seekdb is as simple as enabling a new feature flag.

Conclusion

OceanBase’s seekdb represents a significant step forward in AI‑native data management. By unifying relational, vector, and search capabilities into a single, distributed engine, it eliminates the operational complexity that has long plagued AI pipelines. The result is faster development cycles, lower latency, and a more maintainable architecture.

For data scientists and engineers building RAG systems or AI agents, seekdb offers a compelling alternative to the traditional tri‑service stack. Its open‑source license, strong performance, and tight integration with OceanBase’s existing ecosystem make it a practical choice for both startups and large enterprises.

Call to Action

If you’re building or maintaining an AI application that relies on heterogeneous data, consider evaluating seekdb for your next project. Start by cloning the repository, running the provided Docker compose setup, and experimenting with the sample RAG notebook. Reach out to the community on GitHub or the OceanBase forums to share your experiences and contribute improvements. By embracing seekdb, you can streamline your data pipeline, reduce operational overhead, and accelerate the delivery of smarter AI services.

We value your privacy

We use cookies, including Google Analytics, to improve your experience on our site. By accepting, you agree to our use of these cookies. Learn more