OCR & Document Parsing for RAG: OpenCV vs Docling vs Mistral OCR

If you are building a RAG pipeline for your Norwegian SMB, you will eventually face a critical question: how do you get structured text out of PDFs, scanned invoices, and printed contracts? The answer is not as simple as "just use OCR." The tooling landscape in 2026 spans from pixel-level computer vision to full document intelligence engines, and choosing the wrong tool wastes both time and server resources.
This guide breaks down the key differences, compares the best open-source and API-based options, and recommends a practical stack that runs on a budget 2GB VPS — the same setup we covered in our Qdrant deployment guide.
The Fundamental Difference: Computer Vision vs Document Intelligence
Before comparing tools, it is essential to understand that OpenCV and OCR/parsing engines solve fundamentally different problems.
OpenCV is a computer vision library. It operates on pixels. It can crop, rotate, sharpen, threshold, and filter images. It has no concept of "text" or "tables" — it sees matrices of numbers. OpenCV is powerful for image preprocessing, but it cannot read a single word on its own.
OCR and document parsing engines are document intelligence tools. They recognize characters, understand layout structure, identify tables and headings, and output structured text in formats like Markdown or JSON. They are purpose-built for extracting meaning from documents.
The key insight: These are not competitors. They are complementary. The best RAG pipelines use both — OpenCV for preparation, OCR for extraction, and parsing for structuring.
How Each Tool Category Fits into a RAG Pipeline
Computer Vision (OpenCV)
OpenCV handles the preprocessing stage. Real-world documents are messy: scanned at odd angles, photographed with shadows, printed on colored paper. OpenCV cleans them up before OCR touches them.
Typical preprocessing steps:
import cv2
import numpy as np
def preprocess_for_ocr(image_path: str) -> np.ndarray:
"""Prepare a scanned document for OCR extraction."""
img = cv2.imread(image_path)
# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Deskew (straighten rotated scans)
coords = np.column_stack(np.where(gray > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
h, w = gray.shape
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
rotated = cv2.warpAffine(gray, M, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE)
# Adaptive thresholding (handles uneven lighting)
binary = cv2.adaptiveThreshold(
rotated, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 31, 11
)
# Remove noise
denoised = cv2.fastNlMeansDenoising(binary, h=10)
return denoised
What OpenCV does well for RAG:
- Straightens skewed scans (deskewing)
- Removes shadows and uneven lighting
- Increases contrast for faded text
- Crops regions of interest (e.g., isolating a table from a full page)
- Detects page boundaries in photographed documents
What OpenCV cannot do:
- Recognize text characters
- Understand document structure (headings, paragraphs, tables)
- Output text in any format
OCR Engines (Text Recognition)
OCR engines take a cleaned image and recognize the characters in it. They output raw text, often with bounding box coordinates.
Common OCR engines:
- Tesseract: The classic open-source OCR. Supports 100+ languages including Norwegian. Accuracy varies widely with image quality.
- PaddleOCR: Lightweight, fast, and more accurate than Tesseract on most benchmarks. Runs well on CPU.
- EasyOCR: Python-friendly, GPU-accelerated, good multilingual support.
Document Parsing Engines (Structural Intelligence)
Parsing engines go beyond character recognition. They understand that a block of bold text is a heading, that a grid of cells is a table, and that a numbered sequence is a list. They output structured formats — Markdown, JSON, HTML — that are directly useful for RAG chunking.
This is where the real value lives for RAG. Raw OCR text is a wall of characters. Parsed output preserves the document's semantic structure, which means better chunks, better embeddings, and better retrieval.
The Best Practice Workflow
For production RAG systems, the recommended pipeline has three stages:
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ OpenCV │───▶│ OCR │───▶│ Structural │
│ (Prepare) │ │ (Extract) │ │ Parsing │
│ │ │ │ │ (Structure) │
│ Deskew, │ │ Tesseract, │ │ Unstructured.io, │
│ denoise, │ │ PaddleOCR │ │ LlamaParse, │
│ threshold │ │ │ │ Docling │
└──────────────┘ └──────────────┘ └──────────────────┘
Some modern tools like Docling combine all three stages into a single pipeline. Others, like Mistral OCR, handle OCR and parsing together via API. The choice depends on your infrastructure constraints and accuracy requirements.
Best Self-Hosted Tools (Open Source)
Docling (IBM Research) — The Gold Standard
Docling is IBM Research's open-source document processing engine. It handles the full stack: OCR, layout analysis, table detection, and structural parsing. It outputs clean Markdown or JSON with accurate table preservation.
Key stats:
- 97.9% accuracy on standard benchmarks
- Handles PDFs, images, DOCX, PPTX, HTML
- Built-in table structure recognition
- Outputs Markdown ready for RAG chunking
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("invoice.pdf")
# Get structured Markdown output
markdown = result.document.export_to_markdown()
print(markdown)
Pros:
- Highest accuracy among open-source tools
- No API costs — fully self-hosted
- Excellent table and layout recognition
- Active development by IBM Research
Cons:
- Resource-heavy: needs 1.5-2GB RAM minimum
- Slower than lightweight alternatives
- Initial model download is large (~500MB)
Marker — Fast PDF to Markdown
Marker is optimized for converting digital PDFs to clean Markdown. It excels at removing headers, footers, and watermarks while preserving content structure.
# Install and convert a PDF
pip install marker-pdf
marker_single input.pdf output/ --parallel_factor 1
Best for: Digital PDFs (not scanned documents). If your documents were born digital — exported from Word, generated by software — Marker is fast and accurate. For scanned documents, pair it with an OCR engine.
PaddleOCR — Lightweight and Fast
PaddleOCR from Baidu is the best option when you need OCR without the overhead of a full parsing engine. It consistently outperforms Tesseract in accuracy benchmarks while using fewer resources.
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang='en')
result = ocr.ocr('document.png', cls=True)
for line in result[0]:
bbox, (text, confidence) = line
if confidence > 0.8:
print(f"{text} ({confidence:.2f})")
Key advantages:
- Runs efficiently on CPU (no GPU required)
- <500MB RAM usage
- Supports 80+ languages including Norwegian
- Fast inference: processes a page in 1-3 seconds on CPU
Best for: Lightweight OCR-only needs. When you need text extraction without structural parsing, PaddleOCR delivers the best performance-to-resource ratio.
Best API-Based Tools
When using API-based OCR, your documents are sent to external servers — an important consideration for businesses handling sensitive data. Review the risks of third-party AI APIs before routing confidential documents through external services.
Mistral OCR — Top-Tier Multimodal
Mistral's OCR API is built on their multimodal model and delivers exceptional results for complex documents. It extracts tables, mathematical equations, and mixed-language content into clean Markdown.
import requests
def extract_with_mistral(image_path: str, api_key: str) -> str:
"""Extract structured text using Mistral OCR API."""
with open(image_path, "rb") as f:
response = requests.post(
"https://api.mistral.ai/v1/ocr",
headers={"Authorization": f"Bearer {api_key}"},
files={"file": f},
data={"model": "mistral-ocr-latest"}
)
return response.json()["text"]
Strengths:
- Excellent accuracy on complex layouts
- Handles tables, equations, and multi-column text
- Outputs structured Markdown
- Competitive pricing for API usage
Best for: Complex documents that stump self-hosted tools. Invoices with nested tables, technical papers with equations, or documents mixing Norwegian and English text.
LlamaParse — Fastest Commercial Option
LlamaParse by LlamaIndex is the fastest commercial parsing API, processing most documents in approximately 6 seconds. It integrates directly with LlamaIndex for seamless RAG pipeline construction.
from llama_parse import LlamaParse
parser = LlamaParse(
api_key="your-api-key",
result_type="markdown",
language="en"
)
documents = parser.load_data("annual_report.pdf")
for doc in documents:
print(doc.text[:500])
Strengths:
- Fast processing (~6 seconds per document)
- Excellent table structure preservation
- Native LlamaIndex integration
- Free tier available (1,000 pages/day)
Unstructured.io — Enterprise Standard
Unstructured.io is the industry standard for enterprise RAG document processing. It offers two modes: Hi-Res (maximum accuracy, slower) and Fast (quick processing, good accuracy).
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="contract.pdf",
strategy="hi_res",
languages=["eng", "nor"]
)
for element in elements:
print(f"[{element.category}] {element.text[:100]}")
Best for: Enterprise deployments where you need consistent, reliable parsing across thousands of document types. The self-hosted option requires significant resources, but the API is straightforward.
Comparison Table: Tools on a 2GB RAM VPS
For Norwegian SMBs running RAG on budget infrastructure, resource usage is a primary concern. Here is how the main options compare on a 2GB RAM VPS:
| Tool | RAM Usage | Self-Hosted? | Best For | Notes |
|---|---|---|---|---|
| Docling | 1.5-2GB | Yes | Full-stack parsing with tables | Needs swap file; tight fit on 2GB |
| PaddleOCR | 300-500MB | Yes | Lightweight OCR extraction | Comfortable on 2GB; fast on CPU |
| Marker | 500MB-1GB | Yes | Digital PDF to Markdown | Good fit; not for scanned docs |
| Mistral OCR | N/A (API) | No | Complex docs, tables, equations | Pay per use; no local resources |
| LlamaParse | N/A (API) | No | Fast batch processing | Free tier available |
| Unstructured | 2GB+ (Hi-Res) | Partial | Enterprise-grade parsing | Too heavy for 2GB self-hosted |
Pro Tips for Norwegian SMBs on a Budget VPS
If you are running the budget RAG setup from our Qdrant guide, here is how to add document parsing without exceeding your 2GB RAM budget.
Strategy 1: Docling with Swap File (Primary Engine)
Docling fits on a 2GB VPS if you configure a swap file. This is the highest-accuracy self-hosted option.
# Create 2GB swap file (if not already configured)
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Set swappiness low to prefer RAM
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Install Docling
pip install docling
# Run with memory-conscious settings
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
Trade-off: Processing will be slower due to swap usage, but accuracy remains high. Expect 30-60 seconds per page on a single-core VPS.
Strategy 2: Mistral OCR API as Fallback
For documents that are too complex or too large for local processing, offload to Mistral OCR. This hybrid approach keeps costs low while handling edge cases.
import os
from docling.document_converter import DocumentConverter
def smart_parse(file_path: str, max_local_pages: int = 10) -> str:
"""Use local Docling for small docs, Mistral API for large ones."""
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
if file_size_mb < 5 and get_page_count(file_path) <= max_local_pages:
# Local processing with Docling
converter = DocumentConverter()
result = converter.convert(file_path)
return result.document.export_to_markdown()
else:
# Offload to Mistral OCR API
return extract_with_mistral_api(file_path)
Strategy 3: PaddleOCR for Lightweight Needs
If your documents are straightforward — single-column text, no complex tables — PaddleOCR uses a fraction of the resources and processes pages in seconds.
from paddleocr import PaddleOCR
# Initialize once, reuse for all documents
ocr = PaddleOCR(
use_angle_cls=True,
lang='en',
use_gpu=False,
cpu_threads=1, # Single thread for 2GB VPS
enable_mkldnn=False # Disable MKL-DNN to save memory
)
def extract_text(image_path: str) -> str:
"""Extract text with minimal resource usage."""
result = ocr.ocr(image_path, cls=True)
lines = []
for line in result[0]:
text, confidence = line[1]
if confidence > 0.75:
lines.append(text)
return "\n".join(lines)
Integrating with Your RAG Pipeline
Once you have structured text from your parsing engine, the next step is chunking and embedding for your vector database. Here is how the document parsing stage connects to the Qdrant setup from our budget RAG guide:
from docling.document_converter import DocumentConverter
def document_to_rag_chunks(file_path: str, chunk_size: int = 500) -> list:
"""Parse a document and prepare chunks for vector storage."""
# Step 1: Parse document to Markdown
converter = DocumentConverter()
result = converter.convert(file_path)
markdown = result.document.export_to_markdown()
# Step 2: Split by semantic boundaries (headings, paragraphs)
sections = markdown.split("\n## ")
chunks = []
for section in sections:
if len(section) > chunk_size:
# Sub-chunk long sections by paragraph
paragraphs = section.split("\n\n")
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) <= chunk_size:
current_chunk += para + "\n\n"
else:
if current_chunk.strip():
chunks.append(current_chunk.strip())
current_chunk = para + "\n\n"
if current_chunk.strip():
chunks.append(current_chunk.strip())
else:
if section.strip():
chunks.append(section.strip())
return chunks
Norwegian Language Considerations
When processing Norwegian documents, keep these points in mind:
- PaddleOCR supports Norwegian through its multilingual models. Set
lang='no'or use the Latin script model. - Docling handles Norwegian text natively since it processes at the document structure level.
- Tesseract requires the
norlanguage pack:sudo apt install tesseract-ocr-nor. - Mistral OCR handles Norwegian well given Mistral's multilingual training data.
- Norwegian characters (ae, oe, aa) can cause issues with older OCR engines. PaddleOCR and Docling handle them reliably.
Echo's Recommendation
For Norwegian SMBs building RAG systems on budget infrastructure, we recommend a two-tier approach:
Self-hosted primary engine: PaddleOCR
- Fits comfortably on a 2GB VPS alongside Qdrant
- Fast enough for real-time document ingestion
- Handles Norwegian text accurately
- Low maintenance overhead
API fallback for complex documents: Mistral OCR
- Best-in-class accuracy for tables, equations, and complex layouts
- Pay only for what you use — but review the data privacy implications of sending documents to external APIs
- No local resource consumption
- Excellent multilingual support
When to upgrade to Docling:
- When you move to a 4GB+ VPS
- When table structure preservation is critical (invoices, financial reports)
- When you need fully offline processing with no API dependencies
This combination gives you reliable document parsing at minimal cost, feeding clean structured text into the Qdrant vector database from our budget RAG setup. For a deeper understanding of why RAG matters for your business, see our RAG explained guide.
Frequently Asked Questions
What is the difference between OCR and document parsing for RAG?
OCR (Optical Character Recognition) converts image pixels into raw text characters. Document parsing goes further by understanding the structure of the document -- identifying headings, tables, lists, and paragraphs -- and outputting organized formats like Markdown or JSON. For RAG systems, parsing is far more valuable because structured text produces better chunks and more accurate embeddings.
Can I run Docling alongside Qdrant on a 2GB VPS?
It is technically possible but tight. Docling uses 1.5-2GB RAM, and Qdrant uses 1-1.5GB. You need a swap file and must process documents sequentially, not in parallel. For a more comfortable setup, either upgrade to a 4GB VPS or use PaddleOCR (300-500MB RAM) as your primary engine alongside Qdrant, reserving Docling for a separate processing step.
Which OCR tool handles Norwegian text best?
PaddleOCR and Docling both handle Norwegian text reliably, including special characters. PaddleOCR supports Norwegian through its Latin script model and achieves 95-98% accuracy. Tesseract also supports Norwegian (install the nor language pack) but scores lower on accuracy benchmarks. For API-based options, Mistral OCR handles Norwegian well given its multilingual training data.
Is it worth paying for Mistral OCR or LlamaParse instead of self-hosting?
For most Norwegian SMBs, a hybrid approach works best. Use PaddleOCR locally for everyday documents (free, fast, low resource usage) and call Mistral OCR or LlamaParse only for complex documents with nested tables, multi-column layouts, or mixed-language content. This keeps costs under €10/month for API calls while maintaining high accuracy where it matters most.
Do I need OpenCV preprocessing for digital PDFs?
No. OpenCV preprocessing is only necessary for scanned documents or photographs of paper documents. If your PDFs were generated digitally (exported from Word, created by software), tools like Docling and Marker can parse them directly without any image preprocessing step. Skipping OpenCV for digital PDFs saves processing time and simplifies your pipeline.
Related Reading
- Budget RAG Setup: Qdrant on 2GB VPS for Norwegian SMBs — The companion guide to this post, covering vector database deployment.
- RAG Explained: A Business Guide — Understanding the business value of retrieval-augmented generation.
- Vector Database Comparison 2026 — Comparing Qdrant, Weaviate, Milvus, and Pinecone for Norwegian SMBs.
Need help building a document parsing pipeline for your Norwegian business? Contact Echo AlgoriData for hands-on implementation support.
Stay Updated
Subscribe to our newsletter for the latest AI insights and industry updates.
Get in touch