Field Notes: Multimodal RAG for Technical PDFs

By Volodymyr Khrystynych · March 30, 2026

The Problem Standard RAG Quietly Has

A standard RAG pipeline embeds the text of your documents and matches a query against those embeddings. That works fine until the answer the user is looking for lives inside an image — a torque rating in a diagram, a wiring layout in a schematic, a value in a hand-drawn callout.

In a technical-documentation corpus, that case is not an edge. It is most of the questions that matter. A text embedding cannot retrieve a pixel.

This is the problem we set out to solve at D&V Electronics: a RAG system over internal technical PDFs where a meaningful share of the useful information is visual, and the engineers using it expect to ask in plain English.

Closing the Modality Gap

The trick is to make the images addressable by text. We ran every image through a vision-language model — Qwen3 8B, hosted locally — and produced a natural-language description of what the image actually shows. Those descriptions then live alongside the surrounding text in the same chunk and get embedded together.

Once an image has a textual surrogate, normal vector search can find it. The query "what is the torque spec for the rear bracket" no longer has to compete with an empty embedding for a diagram; it competes with a sentence that says "Diagram showing torque values for the rear bracket assembly."

The chunking ended up at roughly 800 characters per section with 150 characters of overlap. We weighted images at about 300 characters of budget, which kept image-heavy pages from being underrepresented.

Hybrid Search Still Wins

Pure vector retrieval misses things keyword retrieval catches, and vice versa. We ran both — vector embeddings for semantic similarity, BM25 for exact terms — and merged the candidates through a reranker that scored relevance against the actual query. The top five results made it into the final LLM call, with a 4,000-character context budget.

The pattern is unglamorous and consistent: the retrieval layer wins on recall through redundancy, and the reranker wins on precision by being mean.

Evaluation Came Late, Then Saved Us

We bolted RAGAS on after the system was already running. Within an afternoon it told us what the next two weeks of work should be: context retrieval was over 90%, faithfulness was 80–85%, and answer relevancy was sitting at 60%. The retrieval was fine. The model was making coherent answers. They just weren't the answers the user was asking for.

If we had set up evaluation at the start, we would have known where to push from week one. Evaluation infrastructure is not a deliverable — it is a steering wheel. Build it before you start optimizing or you will optimize the wrong thing.

What Actually Moved the Needle

A short, honest list:

  • HYDE (hypothetical document embeddings) — generate a fake answer to the query first, embed that, and search with it. Worth two or three percent on relevancy.
  • Mixed embeddings — 70% summary, 30% original chunk. Better than either alone.
  • Context budget tuning — bigger gains than swapping the model.

Upgrading the VLM by an order of magnitude bought us about 2%. Tightening how we managed context bought us a lot more. If you are choosing between "use a bigger model" and "spend a week on your context strategy," the second is almost always the answer.

The Last Mile Will Eat Your Schedule

Everything ran on-prem on a 128GB Dell workstation. The 8B model was the right call — the 32B did not produce noticeably better answers and made every RAGAS run a multi-hour event.

The unexpected time sink was the frontend. We were embedded in a WPF application with DevExpress's AI chat controls — a stack chosen for reasons that had nothing to do with this project. Images did not render in the chat surface. Saved conversations needed HTML, not Markdown. Conversation management was rigid in ways that took real workarounds to soften.

This is the part of any RAG project that estimates wrong. The retrieval is interesting and well-trodden. The pipeline is mostly solved. The thing that consumes the budget is making the answer presentable inside the system the user actually opens.

Takeaways

  • Retrieval works in one modality. If your data lives in another, translate it first.
  • Set up evaluation before you start optimizing.
  • Context strategy beats model size.
  • Budget the last-mile UI work explicitly. It is never small.

Volodymyr Khrystynych

Written by Volodymyr Khrystynych, partner at Khrystynych Innovations Inc an AI and Web3 consultancy specializing in multimodal RAG, AI automation, AI training, and smart contract engineering on Ethereum and Solana.

Have a project in mind? Let's talk.