OCR & Document Parsing for RAG: OpenCV vs Docling vs Mistral OCR

If you are building a RAG pipeline for your Norwegian SMB, you will eventually face a critical question: how do you get structured text out of PDFs, scanned invoices, and printed contracts? The answer is not as simple as "just use OCR." The tooling landscape in 2026 spans from pixel-level computer vision to full document intelligence engines, and choosing the wrong tool wastes both time and server resources.

This guide breaks down the key differences, compares the best open-source and API-based options, and recommends a practical stack that runs on a budget 2GB VPS — the same setup we covered in our Qdrant deployment guide.

The Fundamental Difference: Computer Vision vs Document Intelligence

Before comparing tools, it is essential to understand that OpenCV and OCR/parsing engines solve fundamentally different problems.

OpenCV is a computer vision library. It operates on pixels. It can crop, rotate, sharpen, threshold, and filter images. It has no concept of "text" or "tables" — it sees matrices of numbers. OpenCV is powerful for image preprocessing, but it cannot read a single word on its own.

OCR and document parsing engines are document intelligence tools. They recognize characters, understand layout structure, identify tables and headings, and output structured text in formats like Markdown or JSON. They are purpose-built for extracting meaning from documents.

The key insight: These are not competitors. They are complementary. The best RAG pipelines use both — OpenCV for preparation, OCR for extraction, and parsing for structuring.

How Each Tool Category Fits into a RAG Pipeline

Computer Vision (OpenCV)

OpenCV handles the preprocessing stage. Real-world documents are messy: scanned at odd angles, photographed with shadows, printed on colored paper. OpenCV cleans them up before OCR touches them.

Typical preprocessing steps:

import cv2
import numpy as np

def preprocess_for_ocr(image_path: str) -> np.ndarray:
    """Prepare a scanned document for OCR extraction."""
    img = cv2.imread(image_path)

    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Deskew (straighten rotated scans)
    coords = np.column_stack(np.where(gray > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    h, w = gray.shape
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(gray, M, (w, h),
                              flags=cv2.INTER_CUBIC,
                              borderMode=cv2.BORDER_REPLICATE)

    # Adaptive thresholding (handles uneven lighting)
    binary = cv2.adaptiveThreshold(
        rotated, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 31, 11
    )

    # Remove noise
    denoised = cv2.fastNlMeansDenoising(binary, h=10)

    return denoised

What OpenCV does well for RAG:

Straightens skewed scans (deskewing)
Removes shadows and uneven lighting
Increases contrast for faded text
Crops regions of interest (e.g., isolating a table from a full page)
Detects page boundaries in photographed documents

What OpenCV cannot do:

Recognize text characters
Understand document structure (headings, paragraphs, tables)
Output text in any format

OCR Engines (Text Recognition)

OCR engines take a cleaned image and recognize the characters in it. They output raw text, often with bounding box coordinates.

Common OCR engines:

Tesseract: The classic open-source OCR. Supports 100+ languages including Norwegian. Accuracy varies widely with image quality.
PaddleOCR: Lightweight, fast, and more accurate than Tesseract on most benchmarks. Runs well on CPU.
EasyOCR: Python-friendly, GPU-accelerated, good multilingual support.

Document Parsing Engines (Structural Intelligence)

Parsing engines go beyond character recognition. They understand that a block of bold text is a heading, that a grid of cells is a table, and that a numbered sequence is a list. They output structured formats — Markdown, JSON, HTML — that are directly useful for RAG chunking.

This is where the real value lives for RAG. Raw OCR text is a wall of characters. Parsed output preserves the document's semantic structure, which means better chunks, better embeddings, and better retrieval.

The Best Practice Workflow

For production RAG systems, the recommended pipeline has three stages:

┌──────────────┐    ┌──────────────┐    ┌──────────────────┐
│   OpenCV     │───▶│   OCR        │───▶│  Structural      │
│   (Prepare)  │    │   (Extract)  │    │  Parsing         │
│              │    │              │    │  (Structure)     │
│ Deskew,      │    │ Tesseract,   │    │ Unstructured.io, │
│ denoise,     │    │ PaddleOCR    │    │ LlamaParse,      │
│ threshold    │    │              │    │ Docling          │
└──────────────┘    └──────────────┘    └──────────────────┘

Some modern tools like Docling combine all three stages into a single pipeline. Others, like Mistral OCR, handle OCR and parsing together via API. The choice depends on your infrastructure constraints and accuracy requirements.

Best Self-Hosted Tools (Open Source)

Docling (IBM Research) — The Gold Standard

Docling is IBM Research's open-source document processing engine. It handles the full stack: OCR, layout analysis, table detection, and structural parsing. It outputs clean Markdown or JSON with accurate table preservation.

Key stats:

97.9% accuracy on standard benchmarks
Handles PDFs, images, DOCX, PPTX, HTML
Built-in table structure recognition
Outputs Markdown ready for RAG chunking

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("invoice.pdf")

# Get structured Markdown output
markdown = result.document.export_to_markdown()
print(markdown)

Pros:

Highest accuracy among open-source tools
No API costs — fully self-hosted
Excellent table and layout recognition
Active development by IBM Research

Cons:

Resource-heavy: needs 1.5-2GB RAM minimum
Slower than lightweight alternatives
Initial model download is large (~500MB)

Marker — Fast PDF to Markdown

Marker is optimized for converting digital PDFs to clean Markdown. It excels at removing headers, footers, and watermarks while preserving content structure.

# Install and convert a PDF
pip install marker-pdf
marker_single input.pdf output/ --parallel_factor 1

Best for: Digital PDFs (not scanned documents). If your documents were born digital — exported from Word, generated by software — Marker is fast and accurate. For scanned documents, pair it with an OCR engine.

PaddleOCR — Lightweight and Fast

PaddleOCR from Baidu is the best option when you need OCR without the overhead of a full parsing engine. It consistently outperforms Tesseract in accuracy benchmarks while using fewer resources.

from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='en')
result = ocr.ocr('document.png', cls=True)

for line in result[0]:
    bbox, (text, confidence) = line
    if confidence > 0.8:
        print(f"{text} ({confidence:.2f})")

Key advantages:

Runs efficiently on CPU (no GPU required)
<500MB RAM usage
Supports 80+ languages including Norwegian
Fast inference: processes a page in 1-3 seconds on CPU

Best for: Lightweight OCR-only needs. When you need text extraction without structural parsing, PaddleOCR delivers the best performance-to-resource ratio.

Best API-Based Tools

When using API-based OCR, your documents are sent to external servers — an important consideration for businesses handling sensitive data. Review the risks of third-party AI APIs before routing confidential documents through external services.

Mistral OCR — Top-Tier Multimodal

Mistral's OCR API is built on their multimodal model and delivers exceptional results for complex documents. It extracts tables, mathematical equations, and mixed-language content into clean Markdown.

import requests

def extract_with_mistral(image_path: str, api_key: str) -> str:
    """Extract structured text using Mistral OCR API."""
    with open(image_path, "rb") as f:
        response = requests.post(
            "https://api.mistral.ai/v1/ocr",
            headers={"Authorization": f"Bearer {api_key}"},
            files={"file": f},
            data={"model": "mistral-ocr-latest"}
        )
    return response.json()["text"]

Strengths:

Excellent accuracy on complex layouts
Handles tables, equations, and multi-column text
Outputs structured Markdown
Competitive pricing for API usage

Best for: Complex documents that stump self-hosted tools. Invoices with nested tables, technical papers with equations, or documents mixing Norwegian and English text.

LlamaParse — Fastest Commercial Option

LlamaParse by LlamaIndex is the fastest commercial parsing API, processing most documents in approximately 6 seconds. It integrates directly with LlamaIndex for seamless RAG pipeline construction.

from llama_parse import LlamaParse

parser = LlamaParse(
    api_key="your-api-key",
    result_type="markdown",
    language="en"
)

documents = parser.load_data("annual_report.pdf")
for doc in documents:
    print(doc.text[:500])

Strengths:

Fast processing (~6 seconds per document)
Excellent table structure preservation
Native LlamaIndex integration
Free tier available (1,000 pages/day)

Unstructured.io — Enterprise Standard

Unstructured.io is the industry standard for enterprise RAG document processing. It offers two modes: Hi-Res (maximum accuracy, slower) and Fast (quick processing, good accuracy).

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="contract.pdf",
    strategy="hi_res",
    languages=["eng", "nor"]
)

for element in elements:
    print(f"[{element.category}] {element.text[:100]}")

Best for: Enterprise deployments where you need consistent, reliable parsing across thousands of document types. The self-hosted option requires significant resources, but the API is straightforward.

Comparison Table: Tools on a 2GB RAM VPS

For Norwegian SMBs running RAG on budget infrastructure, resource usage is a primary concern. Here is how the main options compare on a 2GB RAM VPS:

Tool	RAM Usage	Self-Hosted?	Best For	Notes
Docling	1.5-2GB	Yes	Full-stack parsing with tables	Needs swap file; tight fit on 2GB
PaddleOCR	300-500MB	Yes	Lightweight OCR extraction	Comfortable on 2GB; fast on CPU
Marker	500MB-1GB	Yes	Digital PDF to Markdown	Good fit; not for scanned docs
Mistral OCR	N/A (API)	No	Complex docs, tables, equations	Pay per use; no local resources
LlamaParse	N/A (API)	No	Fast batch processing	Free tier available
Unstructured	2GB+ (Hi-Res)	Partial	Enterprise-grade parsing	Too heavy for 2GB self-hosted

Pro Tips for Norwegian SMBs on a Budget VPS

If you are running the budget RAG setup from our Qdrant guide, here is how to add document parsing without exceeding your 2GB RAM budget.

Strategy 1: Docling with Swap File (Primary Engine)

Docling fits on a 2GB VPS if you configure a swap file. This is the highest-accuracy self-hosted option.

# Create 2GB swap file (if not already configured)
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Set swappiness low to prefer RAM
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Install Docling
pip install docling

# Run with memory-conscious settings
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

Trade-off: Processing will be slower due to swap usage, but accuracy remains high. Expect 30-60 seconds per page on a single-core VPS.

Strategy 2: Mistral OCR API as Fallback

For documents that are too complex or too large for local processing, offload to Mistral OCR. This hybrid approach keeps costs low while handling edge cases.

import os
from docling.document_converter import DocumentConverter

def smart_parse(file_path: str, max_local_pages: int = 10) -> str:
    """Use local Docling for small docs, Mistral API for large ones."""
    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)

    if file_size_mb < 5 and get_page_count(file_path) <= max_local_pages:
        # Local processing with Docling
        converter = DocumentConverter()
        result = converter.convert(file_path)
        return result.document.export_to_markdown()
    else:
        # Offload to Mistral OCR API
        return extract_with_mistral_api(file_path)

Strategy 3: PaddleOCR for Lightweight Needs

If your documents are straightforward — single-column text, no complex tables — PaddleOCR uses a fraction of the resources and processes pages in seconds.

from paddleocr import PaddleOCR

# Initialize once, reuse for all documents
ocr = PaddleOCR(
    use_angle_cls=True,
    lang='en',
    use_gpu=False,
    cpu_threads=1,       # Single thread for 2GB VPS
    enable_mkldnn=False  # Disable MKL-DNN to save memory
)

def extract_text(image_path: str) -> str:
    """Extract text with minimal resource usage."""
    result = ocr.ocr(image_path, cls=True)
    lines = []
    for line in result[0]:
        text, confidence = line[1]
        if confidence > 0.75:
            lines.append(text)
    return "\n".join(lines)

Integrating with Your RAG Pipeline

Once you have structured text from your parsing engine, the next step is chunking and embedding for your vector database. Here is how the document parsing stage connects to the Qdrant setup from our budget RAG guide:

from docling.document_converter import DocumentConverter

def document_to_rag_chunks(file_path: str, chunk_size: int = 500) -> list:
    """Parse a document and prepare chunks for vector storage."""
    # Step 1: Parse document to Markdown
    converter = DocumentConverter()
    result = converter.convert(file_path)
    markdown = result.document.export_to_markdown()

    # Step 2: Split by semantic boundaries (headings, paragraphs)
    sections = markdown.split("\n## ")
    chunks = []

    for section in sections:
        if len(section) > chunk_size:
            # Sub-chunk long sections by paragraph
            paragraphs = section.split("\n\n")
            current_chunk = ""
            for para in paragraphs:
                if len(current_chunk) + len(para) <= chunk_size:
                    current_chunk += para + "\n\n"
                else:
                    if current_chunk.strip():
                        chunks.append(current_chunk.strip())
                    current_chunk = para + "\n\n"
            if current_chunk.strip():
                chunks.append(current_chunk.strip())
        else:
            if section.strip():
                chunks.append(section.strip())

    return chunks

Norwegian Language Considerations

When processing Norwegian documents, keep these points in mind:

PaddleOCR supports Norwegian through its multilingual models. Set lang='no' or use the Latin script model.
Docling handles Norwegian text natively since it processes at the document structure level.
Tesseract requires the nor language pack: sudo apt install tesseract-ocr-nor.
Mistral OCR handles Norwegian well given Mistral's multilingual training data.
Norwegian characters (ae, oe, aa) can cause issues with older OCR engines. PaddleOCR and Docling handle them reliably.

Echo's Recommendation

For Norwegian SMBs building RAG systems on budget infrastructure, we recommend a two-tier approach:

Self-hosted primary engine: PaddleOCR

Fits comfortably on a 2GB VPS alongside Qdrant
Fast enough for real-time document ingestion
Handles Norwegian text accurately
Low maintenance overhead

API fallback for complex documents: Mistral OCR

Best-in-class accuracy for tables, equations, and complex layouts
Pay only for what you use — but review the data privacy implications of sending documents to external APIs
No local resource consumption
Excellent multilingual support

When to upgrade to Docling:

When you move to a 4GB+ VPS
When table structure preservation is critical (invoices, financial reports)
When you need fully offline processing with no API dependencies

This combination gives you reliable document parsing at minimal cost, feeding clean structured text into the Qdrant vector database from our budget RAG setup. For a deeper understanding of why RAG matters for your business, see our RAG explained guide.

Frequently Asked Questions

What is the difference between OCR and document parsing for RAG?

OCR (Optical Character Recognition) converts image pixels into raw text characters. Document parsing goes further by understanding the structure of the document -- identifying headings, tables, lists, and paragraphs -- and outputting organized formats like Markdown or JSON. For RAG systems, parsing is far more valuable because structured text produces better chunks and more accurate embeddings.

Can I run Docling alongside Qdrant on a 2GB VPS?

It is technically possible but tight. Docling uses 1.5-2GB RAM, and Qdrant uses 1-1.5GB. You need a swap file and must process documents sequentially, not in parallel. For a more comfortable setup, either upgrade to a 4GB VPS or use PaddleOCR (300-500MB RAM) as your primary engine alongside Qdrant, reserving Docling for a separate processing step.

Which OCR tool handles Norwegian text best?

PaddleOCR and Docling both handle Norwegian text reliably, including special characters. PaddleOCR supports Norwegian through its Latin script model and achieves 95-98% accuracy. Tesseract also supports Norwegian (install the nor language pack) but scores lower on accuracy benchmarks. For API-based options, Mistral OCR handles Norwegian well given its multilingual training data.

Is it worth paying for Mistral OCR or LlamaParse instead of self-hosting?

For most Norwegian SMBs, a hybrid approach works best. Use PaddleOCR locally for everyday documents (free, fast, low resource usage) and call Mistral OCR or LlamaParse only for complex documents with nested tables, multi-column layouts, or mixed-language content. This keeps costs under €10/month for API calls while maintaining high accuracy where it matters most.

Do I need OpenCV preprocessing for digital PDFs?

No. OpenCV preprocessing is only necessary for scanned documents or photographs of paper documents. If your PDFs were generated digitally (exported from Word, created by software), tools like Docling and Marker can parse them directly without any image preprocessing step. Skipping OpenCV for digital PDFs saves processing time and simplifies your pipeline.