← All Tools ZeroDataUpload Home

OCR Converter

18-language OCR powered by Tesseract.js WebAssembly — extract text from images and PDFs entirely in your browser

Launch OCR Converter →
OCR Converter

Table of Contents

  1. Overview
  2. Key Features
  3. How Tesseract.js Works
  4. 6 Output Formats Explained
  5. How to Use
  6. Frequently Asked Questions
  7. Privacy & Security

Overview

The OCR Converter is a browser-based optical character recognition tool that extracts text from images and PDF documents using Tesseract.js 5.1.1, a WebAssembly compilation of the world's most widely-used open-source OCR engine. It supports 18 languages spanning Latin, Cyrillic, CJK (Chinese-Japanese-Korean), Arabic, Devanagari, and Thai scripts, with multi-page PDF processing powered by Mozilla's PDF.js 3.11.174. All processing happens entirely in your browser — no images, PDFs, or extracted text are ever uploaded to any server.

The tool accepts seven image formats (PNG, JPG/JPEG, BMP, WEBP, GIF, TIFF, and PDF) up to 50 MB in size, and outputs extracted text in six formats: plain TXT, Microsoft Word DOCX (via docx.js 8.5.0), PDF (via jsPDF 2.5.1), styled HTML, structured JSON with statistics, and line-by-line CSV. Each recognition includes a confidence score from 0–100%, giving you a quantitative measure of extraction accuracy.

The Tesseract OCR engine has a remarkable history. It was originally developed at Hewlett-Packard Labs in Bristol, England, beginning in 1985 as a proprietary research project. After over two decades of internal development, HP released Tesseract as open-source software in 2006, and Google adopted it as a sponsored project, contributing significant engineering resources. The pivotal upgrade came in 2018 with the addition of LSTM (Long Short-Term Memory) neural network recognition, dramatically improving accuracy for complex scripts, degraded images, and variable fonts. Tesseract.js brings this entire engine — including the LSTM neural networks — to the browser by compiling the C++ codebase to WebAssembly, delivering desktop-grade OCR accuracy without any server dependency.

The OCR Converter relies on five JavaScript libraries totaling approximately 2.6 MB before language data: Tesseract.js (66KB) for the OCR API, the Tesseract Worker (121KB) for the WebAssembly thread, PDF.js (313KB core + 1.1MB worker) for PDF rendering, jsPDF (356KB) for PDF output generation, docx (726KB) for Word document creation, and FileSaver (2.7KB) for cross-browser file downloads. Language data files range from 2–15 MB per language and are downloaded from a CDN on first use, then cached permanently in IndexedDB so subsequent OCR runs start instantly.

Key Features

Tesseract.js 5.1.1

The world's best open-source OCR engine compiled to WebAssembly. Same LSTM neural networks as desktop Tesseract, running in your browser with full recognition accuracy and no server dependency.

18 Languages

English, Spanish, French, German, Italian, Portuguese, Russian, Japanese, Chinese (Simplified + Traditional), Korean, Arabic, Hindi, Thai, Vietnamese, Polish, Dutch, and Turkish — covering Latin, Cyrillic, CJK, Arabic, Devanagari, and Thai scripts.

Multi-Page PDF OCR

PDF.js renders each page at 2x scale to canvas for higher resolution, Tesseract processes pages sequentially, extracted text is joined with page separators, and confidence scores are averaged across all pages.

6 Output Formats

TXT (plain text), DOCX (docx.js with heading and Calibri font), PDF (jsPDF A4 with auto page breaks), HTML (styled with viewport meta), JSON (structured with statistics), and CSV (line-by-line with UTF-8 BOM).

Confidence Scoring

0–100% accuracy estimate from the Tesseract engine for each image or page. Multi-page PDFs show the arithmetic mean across all pages. Color-coded display helps you assess extraction quality at a glance.

Automatic Preprocessing

Otsu thresholding, adaptive binarization, contrast normalization, deskewing, and automatic page segmentation — all handled internally by the WebAssembly engine before character recognition begins.

Real-Time Progress

Status messages ("Loading language data", "Initializing API", "Recognizing text") plus a percentage progress bar that tracks both PDF rendering and OCR recognition stages in real time.

5-Library Architecture

Tesseract.js (OCR engine), PDF.js (PDF rendering), jsPDF (PDF output), docx (DOCX output), and FileSaver (downloads) — all running locally in your browser with zero server communication.

How Tesseract.js Works

Tesseract.js operates through a 3-layer architecture that brings the full power of the desktop Tesseract OCR engine into the browser. Understanding this architecture helps explain both the capabilities and the performance characteristics of browser-based OCR.

Layer 1: JavaScript API

The top layer is the Tesseract.js JavaScript API (66KB), which provides the interface your browser code interacts with. When the OCR Converter calls Tesseract.createWorker(lang, 1, {logger, workerPath}), this layer creates a new Web Worker and establishes communication between the main thread and the worker thread. The second parameter (1) specifies a single worker thread. The logger callback receives progress updates with two properties: m.status (a human-readable stage name like "loading language data", "initializing api", or "recognizing text") and m.progress (a float from 0 to 1 that the OCR Converter converts to a percentage for the progress bar). A fresh worker is created for each OCR run and terminated afterward to ensure clean memory management.

Layer 2: Worker Thread

The middle layer is the Tesseract Worker (tesseract-worker-5.1.1.min.js, 121KB), which runs in a dedicated Web Worker thread separate from the main browser thread. This prevents the computationally intensive OCR processing from freezing the user interface. The worker thread loads the WebAssembly binary, initializes the OCR engine, loads the language data file, and executes the actual recognition pipeline. Because it runs in a Web Worker, the main thread remains responsive — the progress bar updates smoothly, and the user can still interact with the page.

Layer 3: WebAssembly Engine

The bottom layer is the compiled C++ Tesseract engine running as WebAssembly (WASM). This is the same codebase that powers the desktop version of Tesseract, cross-compiled using Emscripten. It includes the LSTM (Long Short-Term Memory) neural network recognizer that was added in the 2018 upgrade, which dramatically improved accuracy compared to the older pattern-matching approach. The WASM engine performs the entire recognition pipeline internally: grayscale conversion, contrast normalization, Otsu thresholding for binarization, adaptive local thresholding, connected component analysis, deskewing, page segmentation (PSM 3 — fully automatic), and finally LSTM-based character recognition.

The LSTM Recognition Process

The LSTM neural network is the core of modern Tesseract's accuracy. Unlike the older approach that matched individual characters against templates, LSTM processes sequences of pixels and learns contextual relationships between characters. It analyzes text line by line: each line is segmented, normalized in height, and fed through the neural network as a sequence. The LSTM network maintains internal memory cells that capture dependencies between characters — for example, recognizing that a "q" is almost always followed by "u" in English. This sequence-based approach is why Tesseract 4+ with LSTM achieves significantly higher accuracy than earlier versions, especially on degraded images, unusual fonts, and complex scripts like Chinese, Japanese, Korean, and Arabic.

Language Data Files

Each language requires a trained data file ranging from 2–15 MB that contains the LSTM network weights, character dictionaries, and language-specific rules. These files are downloaded from a public CDN on first use and cached permanently in the browser's IndexedDB storage. After the initial download, all future OCR runs in that language start immediately without any network request. The OCR Converter supports 18 language data files: eng, spa, fra, deu, ita, por, rus, jpn, chi_sim, chi_tra, kor, ara, hin, tha, vie, pol, nld, and tur.

Internal Preprocessing Pipeline

Before character recognition begins, the WASM engine automatically applies a series of image preprocessing steps. Otsu thresholding is a critical first step: it analyzes the pixel intensity histogram of the entire image to find the optimal threshold value that separates text (dark pixels) from background (light pixels). The algorithm assumes a bimodal histogram — one peak for background and one for text — and calculates the threshold that minimizes intra-class variance. For images with uneven lighting (such as photographs of documents), adaptive binarization supplements Otsu by calculating local thresholds for different regions of the image, handling shadows and lighting gradients that a single global threshold cannot address. Connected component analysis then identifies groups of connected dark pixels, which correspond to individual characters or character fragments. Deskewing detects and corrects slight rotations in the image, and page segmentation (PSM 3, fully automatic) identifies text blocks, columns, and reading order.

Multi-Page PDF Pipeline

For PDF documents, the OCR Converter uses Mozilla's PDF.js (pdf-3.11.174.min.js) to render each page before sending it to Tesseract. The pipeline works as follows: (1) PDF.js loads the PDF file as an ArrayBuffer, (2) each page is rendered to an HTML Canvas element at 2x scale (200% of the original size — higher resolution yields better OCR accuracy because the LSTM network has more pixel data to analyze), (3) the rendered canvases are stored in a pdfPageImages array, (4) Tesseract processes the pages sequentially (page 1, then page 2, then page 3 — not in parallel, because only a single worker thread is used), (5) the extracted text from each page is concatenated with '\n\n--- Page N ---\n\n' separators, and (6) the confidence score is calculated as the arithmetic mean across all pages (totalConf / pageCount). The progress tracking accounts for both the PDF rendering stage and the per-page OCR stages.

6 Output Formats Explained

The OCR Converter generates output in six distinct formats, each created entirely in the browser using client-side JavaScript libraries. No format conversion happens on any server.

1. TXT — Plain Text

The simplest format: raw extracted text saved as a .txt file with UTF-8 encoding. The text is written directly as a Blob with MIME type text/plain. No formatting, no metadata, no structure — just the recognized characters exactly as Tesseract extracted them, with line breaks preserved. This format is universally compatible and ideal for pasting into other applications or further text processing.

2. DOCX — Microsoft Word Document

Generated using docx.js 8.5.0, the DOCX format creates a proper Microsoft Word document with structured content. The document begins with a Heading 1 element containing "OCR Extracted Text", followed by a metadata line that records the source filename, confidence percentage, and extraction date. The extracted text is split by line breaks, and each line becomes a separate Paragraph element in the document. Typography uses Calibri at 11pt for the body text and 16pt for the heading. The resulting .docx file can be opened in Microsoft Word, Google Docs, LibreOffice Writer, and any other word processor that supports the Open XML format.

3. PDF — Portable Document Format

Generated using jsPDF 2.5.1, the PDF output creates an A4 document (210×297mm) with professional formatting. The page uses 20mm margins on all sides, Helvetica font at 11pt for body text, a bold 16pt heading ("OCR Extracted Text"), and an italic 9pt gray metadata line showing source, confidence, and date. The text is rendered with 5.5mm line height, and jsPDF automatically inserts page breaks when text reaches the bottom margin. This format is ideal for archiving extracted text as a permanent, print-ready document.

4. HTML — Styled Web Page

The HTML output creates a complete, self-contained web page with a DOCTYPE declaration, <meta charset="UTF-8">, and <meta name="viewport"> for mobile responsiveness. The body is styled with max-width: 800px, centered with auto margins, and uses the system font stack. Content includes an <h1> title, a <p> metadata block, and a <div> containing the extracted text with newline characters converted to <br> tags. All HTML entities in the extracted text are properly escaped to prevent rendering issues. The file opens directly in any web browser.

5. JSON — Structured Data

The JSON output provides the most detailed and machine-readable format. The structure contains: source (original filename), extractedAt (ISO 8601 timestamp), language (language code used for OCR), confidence (percentage rounded to 1 decimal place), statistics (an object with characters, words, and lines counts), text (the full extracted text as a single string), and paragraphs (an array of strings created by splitting the text on double newlines). This format is ideal for programmatic processing, data pipelines, and integration with other tools or scripts.

6. CSV — Comma-Separated Values

The CSV output structures the extracted text as a spreadsheet-compatible file with one row per line of text. The file begins with a UTF-8 BOM (\uFEFF) to ensure proper character encoding when opened in Excel. The first row contains headers: "Line,Text". Each subsequent row contains a line number and the corresponding text, with the text value enclosed in double quotes and any internal double quotes escaped by doubling them. This format opens directly in Microsoft Excel, Google Sheets, LibreOffice Calc, and any CSV-compatible tool.

How to Use

  1. Open the OCR Converter — Navigate to the tool and drag-and-drop an image or PDF file onto the upload area, or click to browse. Supported formats: PNG, JPG/JPEG, BMP, WEBP, GIF, TIFF, and PDF (up to 50 MB).
  2. Preview your file — A preview of your uploaded file appears immediately. For images, the full image is displayed. For PDFs, the first page is rendered as a preview image using PDF.js.
  3. Select the OCR language — Choose from 18 languages in the dropdown menu. The first time you use a language, the language data file (2–15 MB) is downloaded from a CDN and cached permanently in IndexedDB for all future uses.
  4. Choose your output format — Select your desired format from the dropdown: TXT (plain text), DOCX (Word document), PDF (formatted document), HTML (web page), JSON (structured data), or CSV (spreadsheet).
  5. Click "Extract Text" — The progress bar activates, showing real-time status messages: "Loading language data" (first use only), "Initializing API", and "Recognizing text" with percentage completion. For multi-page PDFs, progress tracks each page sequentially.
  6. Review the results — The extracted text appears in the result area. Check the confidence score (0–100%) and the word, character, and line statistics to assess extraction quality. Scores above 85% indicate good accuracy.
  7. Copy or download — Click the copy button to copy the extracted text to your clipboard, or click the download button to save the output in your selected format. The file is generated entirely in your browser and downloaded directly.

Frequently Asked Questions

What is Tesseract.js?
Tesseract.js is a WebAssembly compilation of the Tesseract OCR engine, which was originally developed at Hewlett-Packard Labs in 1985, released as open source by HP in 2006, adopted by Google as a sponsored project, and upgraded with LSTM neural network recognition in 2018. Tesseract.js brings the same LSTM neural networks and recognition pipeline to the browser by compiling the C++ source code to WebAssembly using Emscripten. Version 5.1.1 uses a 3-layer architecture: a JavaScript API on the main thread, a Web Worker for background processing, and the WASM engine that performs the actual character recognition. The result is desktop-grade OCR accuracy running entirely in your browser with no server required.
How many languages are supported?
The OCR Converter supports 18 languages: English (eng), Spanish (spa), French (fra), German (deu), Italian (ita), Portuguese (por), Russian (rus), Japanese (jpn), Chinese Simplified (chi_sim), Chinese Traditional (chi_tra), Korean (kor), Arabic (ara), Hindi (hin), Thai (tha), Vietnamese (vie), Polish (pol), Dutch (nld), and Turkish (tur). These cover six major script families: Latin (English, Spanish, French, German, Italian, Portuguese, Vietnamese, Polish, Dutch, Turkish), Cyrillic (Russian), CJK (Japanese, Chinese Simplified, Chinese Traditional, Korean), Arabic (Arabic), Devanagari (Hindi), and Thai (Thai). Each language requires a separate trained data file containing LSTM network weights and language-specific dictionaries.
Why is the first conversion slow?
The first time you use a particular language, the OCR Converter must download the language data file from a public CDN. These files range from 2–15 MB depending on the language (CJK languages like Chinese and Japanese have larger files due to their extensive character sets). After the initial download, the language data is cached permanently in your browser's IndexedDB storage. All subsequent OCR runs in that language start immediately without any network request. The progress bar shows "Loading language data" during this one-time download. If you switch to a new language you have not used before, another download occurs for that language's data file.
Can it OCR multi-page PDFs?
Yes. The OCR Converter uses Mozilla's PDF.js (version 3.11.174) to handle multi-page PDFs. The pipeline works in stages: PDF.js loads the entire PDF as an ArrayBuffer, then renders each page to an HTML Canvas element at 2x scale (200% of original resolution) for improved OCR accuracy. Tesseract then processes the pages sequentially — one at a time using a single worker thread, not in parallel. The extracted text from each page is concatenated with page separators formatted as '\n\n--- Page N ---\n\n'. The final confidence score is the arithmetic mean of confidence scores across all pages. The progress bar tracks both the PDF rendering stage and each individual page's OCR stage.
What affects OCR accuracy?
Several factors influence recognition accuracy. Resolution: 300+ DPI is ideal; lower resolutions reduce accuracy because the LSTM network has fewer pixels to analyze. Contrast: black text on white background produces the best results; low-contrast or colored backgrounds degrade accuracy. Alignment: straight, horizontal text is recognized most accurately; Tesseract applies automatic deskewing, but severely rotated or warped text will suffer. Font quality: standard printed fonts are recognized well; decorative, stylized, or very small fonts reduce accuracy. Language selection: choosing the correct language is critical, as each language's LSTM model is trained on that specific script and vocabulary. Handwriting: Tesseract is optimized for printed text, so handwriting accuracy is significantly lower, especially for cursive.
What is Otsu thresholding?
Otsu thresholding is an automatic image binarization technique that converts a grayscale image to pure black-and-white. It works by analyzing the pixel intensity histogram of the entire image, which typically shows two peaks — one for background pixels (usually light) and one for text pixels (usually dark). The Otsu algorithm calculates the threshold value that minimizes the intra-class variance (the variance within the background pixel group and the text pixel group separately), which is mathematically equivalent to maximizing the inter-class variance between the two groups. This produces the optimal dividing point: pixels above the threshold become white (background) and pixels below become black (text). Tesseract applies Otsu thresholding automatically as part of its internal preprocessing pipeline, supplemented by adaptive binarization for images with uneven lighting.
Does it preserve formatting?
No. OCR extracts raw text content only. Bold, italic, underline, font sizes, font families, tables, columns, headers, footers, margins, and page layout are not preserved. The Tesseract engine recognizes characters and their reading order, but does not detect or reproduce visual formatting. If your document has multiple columns, Tesseract's page segmentation (PSM 3, fully automatic) will attempt to identify the columns and read them in the correct order, but the output will be linear plain text. For documents where layout preservation is critical, consider using the PDF Forge converter which works with native (non-scanned) PDFs that already contain embedded text.
What is the JSON output structure?
The JSON output provides the most detailed, machine-readable format. The structure is: {"source": "filename.png", "extractedAt": "2026-03-26T12:00:00.000Z", "language": "eng", "confidence": 94.7, "statistics": {"characters": 1523, "words": 287, "lines": 42}, "text": "full extracted text...", "paragraphs": ["paragraph 1...", "paragraph 2..."]}. The source field contains the original filename, extractedAt is an ISO 8601 timestamp, confidence is rounded to 1 decimal place, statistics provides character, word, and line counts, text contains the complete extracted text as a single string, and paragraphs is an array created by splitting the text on double newline characters. This format is ideal for integration with scripts, APIs, and data processing pipelines.
Can it read handwriting?
In a limited capacity, yes, but with significantly reduced accuracy. Tesseract's LSTM neural networks are primarily trained on printed text: standard computer fonts, book typefaces, and machine-generated characters. Handwriting recognition is possible for neat, printed handwriting (block letters), but accuracy drops substantially for cursive, connected script, and individual handwriting styles. The engine has no specific training for handwriting recognition, and confidence scores for handwritten text typically fall below 70%. For best results with handwritten content, ensure high contrast (dark ink on white paper), good lighting, and as neat printing as possible. Dedicated handwriting recognition systems use different neural network architectures specifically trained on handwriting datasets.
Is my document data safe?
Yes, completely. All OCR processing runs in your browser via WebAssembly — no images, PDFs, or extracted text are ever uploaded to any server. The Tesseract.js engine executes within a Web Worker thread in your browser, PDF.js renders pages to local Canvas elements, and all output generation (DOCX via docx.js, PDF via jsPDF, HTML, JSON, CSV) happens via client-side JavaScript libraries. The only network request is the one-time language data download from a public CDN (cdn.jsdelivr.net), which contains only the generic language model — no user data is included in this request. Once cached in IndexedDB, even this request is eliminated. Your documents never leave your device.

Privacy & Security

Your Documents Never Leave Your Device

OCR of documents is one of the most privacy-sensitive operations you can perform online. Scanned receipts contain financial data, photographed documents contain personal information like names, addresses, and identification numbers, and PDFs may contain confidential business content, legal documents, or medical records. Uploading these to a cloud-based OCR service means entrusting your most sensitive documents to a third party.

The OCR Converter processes everything locally. Tesseract.js runs as WebAssembly in your browser's Web Worker thread, PDF.js renders PDF pages to local Canvas elements, and all output generation — DOCX (docx.js), PDF (jsPDF), HTML, JSON, and CSV — happens via client-side JavaScript libraries running on your device. The only network request is the one-time language data download from a public CDN, which is a generic language model containing no user data and is cached permanently in IndexedDB after the first download. Your images are never uploaded. Your PDFs are never uploaded. Your extracted text is never uploaded. There is no server, no API endpoint, and no cloud processing. You are in complete control of your documents at all times.

Ready to extract text from your images and PDFs? It's free, private, and runs entirely in your browser.

Launch OCR Converter →

Related

Milan Salvi

Milan Salvi

Founder, Leena Software Solutions

Milan is the founder of ZeroDataUpload and Leena Software Solutions, building privacy-first browser tools that process everything client-side. View all articles ยท About the author.

Last Updated: March 26, 2026