From 4 Weeks to 45 Minutes Designing a Production Hybrid Document Extraction System

Why Not Throw an LLM at Every PDF?

The initial instinct for many teams facing a messy document extraction problem is to reach for the most advanced AI model available. But as this real-world case from a senior AI/Data Engineer shows, the smartest engineering decision was often to not use AI at all.

When faced with 4,700 engineering drawings in mixed formats (text-based and scanned image PDFs), the team designed a two-stage hybrid system. The first stage, using deterministic Python code with PyMuPDF, handled 70-80% of the documents at near-zero cost. Only the remaining ambiguous or image-based files were sent to GPT-4 Vision.

The result? Processing time dropped from an estimated 4 weeks to just 45 minutes, API costs stayed under $70, and accuracy hit 96%. This isn't a story about a magic model — it's a masterclass in systems thinking.

Key principle: Start with the cheapest viable method. The LLM is a scalpel, not a sledgehammer.

Reference: Original article on Towards Data Science

Python code snippet showing PyMuPDF extraction logic for PDF text parsing System Abstract Visual

The Hybrid Architecture in Detail

Stage 1: PyMuPDF (Deterministic, Zero Cost)

The core insight is to constrain the search space. Engineering drawings have a predictable layout — the title block is almost always in the bottom-right quadrant. The extraction logic uses spatial filtering and anchor-based scoring:

def extract_native_pymupdf(pdf_path: Path) -> Optional[RevResult]:
    """Try native PyMuPDF text extraction with spatial filtering."""
    try:
        best = process_pdf_native(
            pdf_path,
            brx=DEFAULT_BR_X,  # bottom-right X threshold
            bry=DEFAULT_BR_Y,  # bottom-right Y threshold
            blocklist=DEFAULT_REV_2L_BLOCKLIST,
            edge_margin=DEFAULT_EDGE_MARGIN
        )
        if best and best.value:
            value = _normalize_output_value(best.value)
            return RevResult(
                file=pdf_path.name,
                value=value,
                engine=f"pymupdf_{best.engine}",
                confidence="high" if best.score > 100 else "medium",
                notes=best.context_snippet
            )
        return None
    except Exception:
        return None

The blocklist filters out common false positives: section markers (SECTION C-C), grid reference letters (A, B, C along edges), and revision history tables. This simple heuristic cuts false matches to near zero.

Stage 2: GPT-4 Vision (For the Hard Cases)

When Stage 1 returns empty — either because the PDF is image-based or the layout is too ambiguous — the system renders the first page as a PNG and sends it to GPT-4 Vision via Azure OpenAI:

def pdf_to_base64_image(self, pdf_path: Path, page_idx: int = 0,
                        dpi: int = 150) -> Tuple[str, int, bool]:
    """Convert PDF page to base64 PNG with smart rotation handling."""
    rotation, should_correct = detect_and_validate_rotation(pdf_path)
    with fitz.open(pdf_path) as doc:
        page = doc[page_idx]
        pix = page.get_pixmap(matrix=fitz.Matrix(dpi/72, dpi/72), alpha=False)
        if rotation != 0 and should_correct:
            img_bytes = correct_rotation(pix, rotation)
            return base64.b64encode(img_bytes).decode(), rotation, True
        else:
            return base64.b64encode(pix.tobytes("png")).decode(), rotation, False

Key engineering decisions:

150 DPI was the sweet spot — higher resolutions bloated payloads without improving accuracy.
Rotation correction uses a heuristic: if PyMuPDF can extract more than ten text blocks from the uncorrected page, orientation is fine.
Prompt engineering includes explicit negative examples to prevent hallucination (e.g., avoiding revision history table values).

Production Results

Metric	Hybrid (PyMuPDF + GPT-4)	GPT-4 Only
Accuracy (n=400)	96%	98%
Processing time (n=4,730)	~45 minutes	~100 minutes
API cost	~$10-15	~$47 (all files)
Human review rate	~5%	~1%

The 2% accuracy gap was an intentional trade-off for a 55-minute runtime reduction and bounded costs. For a data migration where engineers would spot-check anyway, 96% was perfectly acceptable.

Lesson: The 'right' accuracy target balances cost, latency, and downstream workflow requirements.

Diagram of hybrid document extraction pipeline combining deterministic and AI methods

Limitations and Pitfalls You Must Know

No system is perfect. Here are the two classes of problems that only surfaced at scale:

Rotation ambiguity: Engineering drawings are often stored in landscape orientation, but PDF metadata encoding varies wildly. Some files set /Rotate correctly; others physically rotate the content but leave metadata at zero. The heuristic solution (checking extracted text blocks) works but isn't 100% reliable.
Prompt hallucination: The model would sometimes latch onto values from its own examples. If every example showed REV 2-0, the model developed a bias toward outputting 2-0 even when the drawing clearly showed A or 3-0. The fix required diverse examples and explicit anti-memorization warnings.

What this means for your projects:

Always validate at scale, not on cherry-picked samples. The edge cases (rotation, revision history tables, grid references) only appeared in the full 4,700-document corpus.
Treat prompt engineering as software engineering. Version your prompts, test them systematically, and include explicit negative cases.
Newer models (GPT-5+) offered no meaningful lift for this task. When the task is spatially constrained pattern matching, the ceiling is the prompt and preprocessing, not the model's reasoning capability.

Server rack and cloud infrastructure representing production-grade AI system deployment Coding Session Visual

From Script to Production System

What started as a command-line tool evolved into a lightweight internal web application with a file upload interface, adopted by engineering teams at multiple sites. The core extraction logic — 600 lines of Python — remained identical.

Three takeaways for your next AI project:

Start with the cheapest viable method. Deterministic extraction handled 70-80% of the corpus at zero cost. The LLM only added value because it was focused on the cases where rules fell short.
Measure what matters to the stakeholder. Engineers didn't care whether the pipeline used PyMuPDF, GPT-4, or carrier pigeons. They cared that 4,700 drawings were processed in 45 minutes instead of four weeks, at $50-70 in API calls instead of £8,000+ in engineering time.
Know where the model belongs in the system. Sometimes the highest-impact AI work isn't about using the most powerful model. It's about designing the system that keeps the model in its place.

Next Steps for Your Learning

If you're building similar systems, explore:

Azure AI Document Intelligence for out-of-the-box form extraction
LayoutLMv3 for layout-aware document understanding
Prompt versioning with tools like LangChain for systematic prompt management

For more on production AI systems, check out our analysis of React Conf 2025 Key Takeaways on the Compiler, React 19.2, and the Future of Native and Beyond Prototypes How Vercels New v0 Brings AI Coding to Production.

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.

From 4 Weeks to 45 Minutes Designing a Production Hybrid Document Extraction System

Why Not Throw an LLM at Every PDF?

The Hybrid Architecture in Detail

Stage 1: PyMuPDF (Deterministic, Zero Cost)

Stage 2: GPT-4 Vision (For the Hard Cases)

Production Results

Limitations and Pitfalls You Must Know

From Script to Production System

Next Steps for Your Learning

Share this post

Did you find this post helpful?
It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Why Not Throw an LLM at Every PDF?

The Hybrid Architecture in Detail

Stage 1: PyMuPDF (Deterministic, Zero Cost)

Stage 2: GPT-4 Vision (For the Hard Cases)

Production Results

Limitations and Pitfalls You Must Know

From Script to Production System

Next Steps for Your Learning

Share this post

Did you find this post helpful?It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Did you find this post helpful?
It helps the author a lot!