Beyond Text: Unlocking True Multimodal, End-to-end RAG with Tomoro ColQwen3
Today, we are releasing Tomoro ColQwen3, a family of state-of-the-art multimodal embedding models designed to solve this problem by enabling end-to-end visual retrieval.
For the last two years, "RAG" (Retrieval-Augmented Generation) has been almost exclusively synonymous with text. The standard playbook is simple: take documents, strip out the text, chunk it, and index it.
But the enterprise world doesn't run on plain text files. It runs on complex PDFs, slide decks with dense charts, CAD drawings, scanned invoices, and increasingly, massive archives of video footage.
The industry standard for handling this using OCR or Vision LLMs to "caption" an image into text before embedding it is a massive bottleneck. It is slow, expensive, and inevitably loses the rich spatial and visual context of the original data. If your AI describes a complex visual simply as "a blueprint," you have lost the ability to search for the specific valve or wiring schematic inside it.
Today, we are releasing Tomoro ColQwen3, a family of state-of-the-art multimodal embedding models building on top of the ColPali architecture designed to solve this problem by enabling end-to-end visual retrieval.
The Engineering Behind the Magic: Advanced Fine-Tuning Strategies
Building a truly effective multimodal retriever isn't as simple as taking an off-the-shelf Vision-Language Model (VLM) and asking it to search. To solve the challenges that have historically held multimodal RAG back and create Tomoro ColQwen3, we employed model merging and training strategies.
1. Transferring Embedding Capability via Tuned Merged Weighting
Instead of fine-tuning a base model from scratch, we engineered a method to transfer the knowledge of retrieval tasks from an existing textual embedding model. A standard VLM knows how to describe images, but it doesn't necessarily know how to rank them semantically against a query.
To bridge this gap, we utilized tuned merged weighting strategies. In our experiments, we found that the retrieval capability of Qwen3-Embedding in the textual latent space could be directly transferred to the multimodal space, generalizing to image and video retrieval even without immediate training. Leveraging this effective initialization as a strong starting point, we then fine-tuned the model to further optimize performance and ensure robustness. It improves the final retrieval performance compared to fine-tuning from Qwen3-VL base model.
2. Careful Hyper-parameter Tuning
Once the embedding capabilities were transferred, we focused on alignment. We conducted careful hyper-parameter searches and ablation studies including the optimal number of epochs, batch size, learning rate, LoRA rank, and mixture of datasets in order to maximize retrieval performance.
For the datasets, we chose VDR, ColPali train set, visrag_syn and visrag_ind as our fine-tuning datasets. We used UniMax sampling for the fine-tuning. Since we applied model merging and the base merged model already inherits partial retriever multimodal capability, we have found that increasing to more datasets would actually slightly degrade the performance, therefore we didn’t scale to more datasets.
Here are the hyper-parameters we selected for the fine-tuning:
| LoRA Rank | 32 |
| LoRA Alpha | 32 |
| LoRA target modules | both image and text all layers except embedding |
| Learning Rate | 5e-5 |
| Max Visual Tokens | 1280 |
| Training Epochs | 1 |
| Global Batch Size | 512 |
| Warmup Ratio | 5% |
3. The "ColBERT" Dilemma: High Performance vs. Storage Cost
Historically, models using "ColBERT-style" token-level embeddings (like ColPali) offered incredible performance but at a prohibitive storage cost. Indexing every visual token created massive vector databases that were too expensive for enterprise scale.
- The Optimization Strategy: We addressed the storage challenge by carefully balancing the size of our embeddings to retain maximum performance while preventing storage costs from exploding. To maintain SOTA accuracy within this efficient footprint, we carefully selected dim size to be 320 for most balanced retrieval performance and storage cost.
- The Result: Tomoro ColQwen3 achieves a 13x reduction in storage costs compared to previous generations of similar models.
- Baseline: A standard approach (like NVIDIA Nemo-3B) requires approximately 10.3 TB to store embeddings for 1 million images (1,802 tokens @ 3,072 dims).
- Tomoro ColQwen3: Stores the same 1 million images in just 0.82 TB (max 1,280 tokens @ 320 dims).
Tomoro ColQwen3 achieves SOTA retrieval accuracy without blowing up cloud infrastructure budget.
Evaluation Results
We validated our approach on the ViDoRe benchmark suite. The results confirm that Tomoro ColQwen3 sets new standards on the latest V3 and V2 benchmarks (both English and Multilingual) while maintaining top-tier performance on legacy V1 tasks.
ViDoRe V3 (Latest)
Tomoro ColQwen3-8B leads in English and Multilingual retrieval.
English nDCG@5
| Model | CompSci | Energy | Finance | HR | Ind. | Pharma | Physics | Avg |
| tomoro-colqwen3-8b | 0.7443 | 0.6491 | 0.6823 | 0.6421 | 0.5766 | 0.6665 | 0.4747 | 0.6113 |
| tomoro-colqwen3-4b | 0.7419 | 0.6023 | 0.6753 | 0.6037 | 0.5787 | 0.6612 | 0.4640 | 0.5934 |
| nemo-colembed-3b | 0.7514 | 0.5838 | 0.6712 | 0.6256 | 0.5447 | 0.6524 | 0.4128 | 0.5769 |
| jina-embeddings-v4 | 0.7175 | 0.5842 | 0.6417 | 0.6206 | 0.5443 | 0.6303 | 0.4191 | 0.5680 |
Multilingual nDCG@5
| Model | CompSci | Energy | Finance | HR | Ind. | Pharma | Physics | Avg |
| tomoro-colqwen3-8b | 0.7194 | 0.6619 | 0.6172 | 0.6097 | 0.5164 | 0.6403 | 0.4706 | 0.5866 |
| tomoro-colqwen3-4b | 0.7213 | 0.6374 | 0.6019 | 0.5637 | 0.5131 | 0.6351 | 0.4636 | 0.5708 |
| nemo-colembed-3b | 0.7216 | 0.5901 | 0.5646 | 0.5504 | 0.4335 | 0.6170 | 0.4192 | 0.5383 |
ViDoRe V2 & V1 Highlights
Consistent high performance across ESG, Economics, and DocVQA.
| Benchmark | Model | Score (Avg) | Notable Win |
| ViDoRe V2 (English) | Tomoro ColQwen3-8B | 0.6772 | #1 in ESG & BioMed |
| ViDoRe V2 (Multi) | Tomoro ColQwen3-8B | 0.6085 | #1 in Economics |
| ViDoRe V1 (English) | Nemo ColEmbed 3B | 0.9100 | Slightly higher average |
| ViDoRe V1 (English) | Tomoro ColQwen3-8B | 0.9076 | #1 in ArxivQA & Syn-Gov |
Video Retrieval: CareBench Evaluation
To demonstrate that Tomoro ColQwen3 strongly generalizes to video retrieval, we evaluated the models on the CareBench benchmark for text-to-video (General Retrieval) tasks.
For this evaluation, we utilised a raw video encoding approach: our models encoded the video files directly without any additional textual annotations or metadata inputs. This highlights the model's ability to perform retrieval based purely on visual semantics.
| Model | Recall@1 | Recall@5 | Recall@10 |
| tomoro-colqwen3-8b | 0.8670 | 0.9590 | 0.9850 |
| tomoro-colqwen3-4b | 0.8620 | 0.9570 | 0.9800 |
| Care7B | 0.7700 | 0.9560 | 0.9870 |
Unlocking "Dark Data": The Video Frontier
One of the most profound shifts enabled by Tomoro ColQwen3 is its ability to treat video just like documents. In many enterprises, video is "dark data." It sits in cold storage, completely unsearchable unless a human has manually tagged every file.
Tomoro ColQwen3 changes this by enabling frame-wise retrieval on segmented clips.
Use Case Spotlight: The Corporate Memory Bank
Every day, enterprises record thousands of hours of internal video meetings and, usually, these recordings are lost to the void.
- The Workflow: By automatically segmenting long meeting recordings into shorter, logical clips and indexing them with ColQwen3, you create a searchable "corporate memory."
- The Query: A project manager can ask: "Show me the clip where the engineering lead explained the latency issue with the API."
- The Result: Instead of returning a 60-minute video file, the system retrieves the exact 45-second segment where that specific diagram was shared on screen and discussed, even if the speakers didn't explicitly say the word "latency" at that exact second.
Enterprise Use Cases: Solving High-Value Problems
The move to native multimodal embedding unlocks entirely new workflows across industries. We have extensively benchmarked this model against some of the most complex enterprise data environments.
1. Benchmark Highlight: Urban Consulting & Architecture
We tested Tomoro ColQwen3 against the data challenges faced by urban consulting firms, which manage massive libraries of master plans, zoning maps, and complex architectural elevations.
- The Challenge: In these environments, traditional RAG pipelines struggle. OCR tools typically caption complex site plans simply as "schematic," missing critical details about traffic flow, green zones, or building setbacks.
- The Performance: Our internal benchmarks demonstrate that ColQwen3 effectively bypasses the need for expensive OCR/Captioning pre-processing. In tests with high-density architectural drawings, the model successfully retrieved documents based on specific visual markers such as pedestrian access points or distinct zoning boundaries which significantly outperforming text-only baselines while promising a drastic reduction in knowledge ingestion costs.
2. Complex Financial & Market Intelligence
Investment bankers and analysts spend thousands of hours sifting through annual reports and investor decks.
- The Workflow: An analyst can ask, "Find the breakdown of APAC revenue vs. EMEA revenue from the 2024 slide decks."
- The Shift: Instead of relying on keyword matches, the model retrieves the exact slide based on the visual presence of the specific data visualisation, allowing the RAG system to answer questions with high precision.
3. Industrial Manufacturing & Field Service
Manufacturing firms possess vast libraries of technical manuals and piping diagrams (P&IDs).
- The Workflow: A field engineer fixing a turbine can take a photo of a part or ask, "Show me the wiring diagram for the hydraulic pump assembly."
- The Shift: ColQwen3 identifies the visual representation of the wiring diagram and retrieves the correct page, potentially reducing downtime by hours.
4. Legal & Compliance
Legal discovery involves reviewing millions of scanned pages where the "needle" is often a visual marker.
- The Workflow: A compliance officer can search for "Documents signed by the CFO" or "Contracts containing the official 'Confidential' stamp."
- The Shift: The model distinguishes between a draft and a finalized document based on the visual presence of signatures or stamps, something pure OCR often misses.
Getting Started
Tomoro ColQwen3 is open-weights (Apache 2.0) and available now on Hugging Face. We have designed it to be a drop-in upgrade for modern RAG pipelines, providing the inference code needed to process text, images, and video immediately.
For enterprises ready to move beyond simple text search and unlock the intelligence trapped in their visual assets, this is the new standard.
Next Step: Explore the model card and try the live demo on Hugging Face: TomoroAI/tomoro-colqwen3-embed-8b and TomoroAI/tomoro-colqwen3-embed-4b


