Blog
May 19, 2026 Praveen Sampath

Benchmarking Omni for Enterprise RAG

A practical look at how Omni performs on EnterpriseRAG-Bench, a 500-question test of search and retrieval over workplace knowledge.

RAG is hard. Enterprise RAG is harder. Workplace knowledge is scattered across disparate apps, formats, and permission systems. Building a system that can reliably index, retrieve, and answer questions over that data is an inherently complex problem.

To measure how well Omni performs in this setting, we ran it against EnterpriseRAG-Bench, a 500-question benchmark built around synthetic company-internal knowledge. The benchmark is useful because it tests the kinds of retrieval problems that show up in real workplaces: scattered documents, varied source types, ambiguous questions, and cases where the right answer may not exist.

EnterpriseRAG-Bench evaluates retrieval and answering across a synthetic workplace corpus modeled on common enterprise systems.

511K+
Synthetic enterprise documents
500
Benchmark questions
9
Enterprise source types
10
Question categories

In this post, we discuss the benchmark methodology, the challenges we encountered, and of course, the results. Spoiler alert: Omni matches leading agentic RAG systems like Onyx and OpenClaw, with its simple, Postgres-only retrieval stack.

The result

Omni scored 68.07 overall on our full 500-question EnterpriseRAG-Bench run using an agentic search loop. That puts Omni in the same top cluster as OpenClaw and Onyx on the published Onyx leaderboard, well ahead of simpler baselines like OpenAI File Search, Amazon Q, Azure AI Search, etc.

EnterpriseRAG-Bench · 500 questions

Omni on the EnterpriseRAG-Bench leaderboard

Toggle each metric to compare overall score, answer quality, recall, and noisy citations.

Public baseline metrics from Onyx's EnterpriseRAG-Bench page. Omni is inserted from our internal run and used DeepSeek V4 Pro as both agent and judge. For noisy citations, lower is better; the purple line is scaled onto the 0-100 axis for readability.

The architecture behind this result is just as interesting as the result itself. Omni uses ParadeDB (Postgres with the pg_search extension) for BM25 full-text search, and pgvector for semantic retrieval. There is no Elasticsearch cluster and no separate vector database to run. For small and mid-size teams, that matters: fewer moving pieces, simpler operations, and a system that can sit in the same performance tier as heavier agentic RAG stacks on realistic enterprise retrieval tasks.

Benchmark setup

Omni indexed roughly 512K documents across 9 source types, using ParadeDB for BM25 full-text search and pgvector for semantic retrieval. The corpus produced about 2M 1024-dimensional embeddings with BAAI/bge-large-en-v1.5. In Postgres, the ParadeDB BM25 index was about 1.9GB and the pgvector HNSW index was about 15GB.

For this run, Omni used its normal agentic chat loop with the following tools:

  • search_documents to query the search index
  • read_document to inspect promising documents
  • submit_answer to return the final answer and supporting document IDs

The benchmark then scored the answer against gold facts and checked submitted document IDs against expected source documents.

Overall score
68.07

Mean completeness, counting incorrect answers as zero

Correctness
73.40%

Binary LLM judge score over the full question set

Completeness
72.27%

Share of expected answer facts present

Document recall
69.43%

Gold-source coverage from submitted citations

We used BAAI/bge-large-en-v1.5 for text embeddings. The primary 500-question run used DeepSeek V4 Pro as both the agent and judge. For the final score, 23 of 500 rows received a targeted GPT-5.4 adjustment where the DeepSeek run had already retrieved strong supporting documents or where the judgment looked suspicious; no submitted document IDs changed.

Why DeepSeek V4 Pro?

EnterpriseRAG-Bench’s public leaderboard includes systems that use GPT-5.4-class models. For our benchmark work, we wanted a model that was capable enough for agentic retrieval while still inexpensive enough to run repeatedly as we tested retrieval behavior, prompts, and scoring consistency.

DeepSeek V4 Pro was a good fit for that workflow. It gave us strong enough reasoning performance at a much lower per-run cost, which made it practical to iterate on the benchmark instead of treating each run as a one-off event.

We also compared a sample of DeepSeek V4 Pro judgments against GPT-5.4 and found the two were largely in agreement.

This is worth calling out because the model choice cuts both ways - DeepSeek V4 Pro is capable, but it is not the strongest possible agent model we could have used. Omni landing near the top agentic systems with that setup is a strong signal for the retrieval and agent loop itself.

Beyond the overall score

The overall score is useful, but it’s more instructive to look at performance metrics broken down across the various benchmark question categories.

Some questions are narrow lookups, where the answer is contained in a single document. Omni does well on those, including intra-document and constrained questions. Other categories are harder because they require broader search behavior: finding semantically related evidence, connecting project-level context across documents, or collecting enough supporting facts for a complete answer.

Question types

Overall score by question category

Higher scores indicate better answer completeness, with incorrect answers counted as zero.

This separates systems that can answer direct document questions from systems that can plan searches, inspect evidence, and decide when the corpus does not support an answer.

Omni’s strongest categories show that the agent loop is good at grounding answers in specific evidence and avoiding unsupported claims. The lower scores on semantic, project-related, and completeness questions are the areas where enterprise RAG remains hardest: the relevant context is often distributed, indirect, or only partially retrieved.

What this means in practice

Most workplace AI products eventually run into the same infrastructure question: how many systems do you need before employees can ask useful questions over company data?

Our bet with Omni is that the answer should be: fewer than people assume.

Postgres already gives teams durability, permissions, migrations, backups, and operational familiarity. With ParadeDB and pgvector, it can also provide the retrieval layer for production RAG workloads. Omni’s EnterpriseRAG-Bench result shows that this simpler architecture is more than an operational convenience: it is competitive.

What’s next

We plan to publish more detail on ablations, including non-agentic retrieval runs, hybrid search behavior, and where query planning improves recall.

For now, the takeaway is simple: Omni can handle realistic enterprise RAG workloads with competitive retrieval performance, without requiring a complex search infrastructure stack.

Appendix: Benchmark artifacts

The benchmark code is available in the enterprise-benchmark branch of the Omni repository. It includes the runner used to index the EnterpriseRAG-Bench corpus, execute Omni’s agentic tool loop, and score the resulting answers and citations.

The benchmark artifacts (generated answers, LLM judgements, metrics) are published as a HuggingFace dataset.