OCR to JSON API (PDF/Image Text Extractor) — Flask + Tesseract + PyMuPDF

A lightweight, fast, and powerful text extraction tool built using Flask, Tesseract OCR, and PyMuPDF.
This project allows users to extract text from scanned images and PDF documents and return it in structured JSON format through both a web interface and a REST API.


Key Features

  • Upload Support — Accepts PDF and image files (JPG, PNG, TIFF, BMP, GIF)
  • OCR + Digital Text Extraction — Handles both scanned and digital text
  • Multilingual Support — English and Hindi (easily extendable to other languages)
  • JSON Output — Extracted text returned as structured JSON via API
  • Modern Web UI — Built with Bootstrap for a clean, responsive interface
  • Cross-Platform — Works seamlessly on Windows, macOS, and Linux

Tech Stack

  • Backend: Flask (Python)
  • OCR Engine: Tesseract OCR
  • PDF Processing: PyMuPDF
  • Frontend: Bootstrap
  • API Format: REST (JSON Response)