Why Not Throw an LLM at Every PDF?
The initial instinct for many teams facing a messy document extraction problem is to reach for the most advanced AI model available. But as this real-world case from a senior AI/Data Engineer shows, the smartest engineering decision was often to not use AI at all.
When faced with 4,700 engineering drawings in mixed formats (text-based and scanned image PDFs), the team designed a two-stage hybrid system. The first stage, using deterministic Python code with PyMuPDF, handled 70-80% of the documents at near-zero cost. Only the remaining ambiguous or image-based files were sent to GPT-4 Vision.
The result? Processing time dropped from an estimated 4 weeks to just 45 minutes, API costs stayed under $70, and accuracy hit 96%. This isn't a story about a magic model — it's a masterclass in systems thinking.
Key principle: Start with the cheapest viable method. The LLM is a scalpel, not a sledgehammer.
Reference: Original article on Towards Data Science

The Hybrid Architecture in Detail
Stage 1: PyMuPDF (Deterministic, Zero Cost)
The core insight is to constrain the search space. Engineering drawings have a predictable layout — the title block is almost always in the bottom-right quadrant. The extraction logic uses spatial filtering and anchor-based scoring:
def extract_native_pymupdf(pdf_path: Path) -> Optional[RevResult]:
"""Try native PyMuPDF text extraction with spatial filtering."""
try:
best = process_pdf_native(
pdf_path,
brx=DEFAULT_BR_X, # bottom-right X threshold
bry=DEFAULT_BR_Y, # bottom-right Y threshold
blocklist=DEFAULT_REV_2L_BLOCKLIST,
edge_margin=DEFAULT_EDGE_MARGIN
)
if best and best.value:
value = _normalize_output_value(best.value)
return RevResult(
file=pdf_path.name,
value=value,
engine=f"pymupdf_{best.engine}",
confidence="high" if best.score > 100 else "medium",
notes=best.context_snippet
)
return None
except Exception:
return None
The blocklist filters out common false positives: section markers (SECTION C-C), grid reference letters (A, B, C along edges), and revision history tables. This simple heuristic cuts false matches to near zero.
Stage 2: GPT-4 Vision (For the Hard Cases)
When Stage 1 returns empty — either because the PDF is image-based or the layout is too ambiguous — the system renders the first page as a PNG and sends it to GPT-4 Vision via Azure OpenAI:
def pdf_to_base64_image(self, pdf_path: Path, page_idx: int = 0,
dpi: int = 150) -> Tuple[str, int, bool]:
"""Convert PDF page to base64 PNG with smart rotation handling."""
rotation, should_correct = detect_and_validate_rotation(pdf_path)
with fitz.open(pdf_path) as doc:
page = doc[page_idx]
pix = page.get_pixmap(matrix=fitz.Matrix(dpi/72, dpi/72), alpha=False)
if rotation != 0 and should_correct:
img_bytes = correct_rotation(pix, rotation)
return base64.b64encode(img_bytes).decode(), rotation, True
else:
return base64.b64encode(pix.tobytes("png")).decode(), rotation, False
Key engineering decisions:
- 150 DPI was the sweet spot — higher resolutions bloated payloads without improving accuracy.
- Rotation correction uses a heuristic: if PyMuPDF can extract more than ten text blocks from the uncorrected page, orientation is fine.
- Prompt engineering includes explicit negative examples to prevent hallucination (e.g., avoiding revision history table values).
Production Results
| Metric | Hybrid (PyMuPDF + GPT-4) | GPT-4 Only |
|---|---|---|
| Accuracy (n=400) | 96% | 98% |
| Processing time (n=4,730) | ~45 minutes | ~100 minutes |
| API cost | ~$10-15 | ~$47 (all files) |
| Human review rate | ~5% | ~1% |
The 2% accuracy gap was an intentional trade-off for a 55-minute runtime reduction and bounded costs. For a data migration where engineers would spot-check anyway, 96% was perfectly acceptable.
Lesson: The 'right' accuracy target balances cost, latency, and downstream workflow requirements.

Limitations and Pitfalls You Must Know
No system is perfect. Here are the two classes of problems that only surfaced at scale:
-
Rotation ambiguity: Engineering drawings are often stored in landscape orientation, but PDF metadata encoding varies wildly. Some files set
/Rotatecorrectly; others physically rotate the content but leave metadata at zero. The heuristic solution (checking extracted text blocks) works but isn't 100% reliable. -
Prompt hallucination: The model would sometimes latch onto values from its own examples. If every example showed REV
2-0, the model developed a bias toward outputting2-0even when the drawing clearly showedAor3-0. The fix required diverse examples and explicit anti-memorization warnings.
What this means for your projects:
- Always validate at scale, not on cherry-picked samples. The edge cases (rotation, revision history tables, grid references) only appeared in the full 4,700-document corpus.
- Treat prompt engineering as software engineering. Version your prompts, test them systematically, and include explicit negative cases.
- Newer models (GPT-5+) offered no meaningful lift for this task. When the task is spatially constrained pattern matching, the ceiling is the prompt and preprocessing, not the model's reasoning capability.

From Script to Production System
What started as a command-line tool evolved into a lightweight internal web application with a file upload interface, adopted by engineering teams at multiple sites. The core extraction logic — 600 lines of Python — remained identical.
Three takeaways for your next AI project:
-
Start with the cheapest viable method. Deterministic extraction handled 70-80% of the corpus at zero cost. The LLM only added value because it was focused on the cases where rules fell short.
-
Measure what matters to the stakeholder. Engineers didn't care whether the pipeline used PyMuPDF, GPT-4, or carrier pigeons. They cared that 4,700 drawings were processed in 45 minutes instead of four weeks, at $50-70 in API calls instead of £8,000+ in engineering time.
-
Know where the model belongs in the system. Sometimes the highest-impact AI work isn't about using the most powerful model. It's about designing the system that keeps the model in its place.
Next Steps for Your Learning
If you're building similar systems, explore:
- Azure AI Document Intelligence for out-of-the-box form extraction
- LayoutLMv3 for layout-aware document understanding
- Prompt versioning with tools like LangChain for systematic prompt management
For more on production AI systems, check out our analysis of React Conf 2025 Key Takeaways on the Compiler, React 19.2, and the Future of Native and Beyond Prototypes How Vercels New v0 Brings AI Coding to Production.