OCR PDF

How it works

PDF pages are converted to images using Ghostscript

Tesseract OCR reads text from each page image

Extracted text is compiled into a .txt file for download

Server Requirements

Ghostscript — Required for PDF-to-image conversion. Download from ghostscript.com

Tesseract OCR — Required for text recognition. Download from tesseract-ocr.github.io. Language packs must be installed separately for non-English languages.

Pull real, searchable, copyable text out of a scanned or image-based PDF using optical character recognition, downloaded as a plain text file.

How It Works

How OCR PDF (Extract Text from Scanned PDF) Works

The server first uses Ghostscript to rasterize every page of your uploaded PDF into a separate PNG image at your chosen resolution (150, 200, or 300 DPI) — higher DPI produces sharper images for recognition but takes longer to process.

Each page image is then run through Tesseract OCR, an open-source text-recognition engine, using the language pack you select (English by default, with other languages available), which analyzes the pixels and outputs the recognized text for that page.

The recognized text from every page is concatenated together with page-number markers and returned as a single downloadable .txt file — this tool extracts text only, it does not produce a new PDF with a text layer overlaid on the images.

Worked Example

See It In Action

A 5-page scanned invoice PDF with no selectable text, run through OCR at 200 DPI with the English language pack, returns a plain text file with each page's recognized text clearly separated by "=== Page 1 ===" markers, ready to copy into a spreadsheet or document.

FAQ

Frequently Asked Questions

What DPI should I choose?

200 DPI is a good default balance of accuracy and speed. Use 300 DPI for small or dense text where accuracy matters most, or 150 DPI for large, clear text where you want faster processing.

Does this give me back a PDF, or just text?

Just text — the output is a downloadable .txt file containing everything Tesseract recognized, not a new PDF with a searchable text layer added on top of the scanned images.

Why did I get little or no text back?

This usually means the scan quality is too low, the page contains mostly non-text imagery, or the selected language doesn't match the document's actual language — try a higher DPI or the correct language pack.

Which languages are supported?

Recognition quality depends on the Tesseract language pack installed on the server; English is the default, and other common language codes can be selected if supported.

Drop your scanned PDF here

How it works

Server Requirements

Extracted Text Preview

How OCR PDF (Extract Text from Scanned PDF) Works

See It In Action

Frequently Asked Questions

OCR PDF

Drop your scanned PDF here

How it works

Server Requirements

Extracted Text Preview

How OCR PDF (Extract Text from Scanned PDF) Works

See It In Action

Frequently Asked Questions

Related PDF Tools Calculators