Image to Malayalam Text with Tesseract

if You need to convert PDF made of images, convert each page to png with image magick

Windows Installer at https://github.com/UB-Mannheim/tesseract/wiki

After install upadate Env Var PATH C:\Program Files\Tesseract-OCR

test with
tesseract --list-langs
default langs availabe are eng and osd. more langs could be added.

extract text with

tesseract pdfPage-0.png OCR/pdfPage-0.txt

code for OCR of korean text. This works only if respective language date is available. confirm with tesseract --list-langs
tesseract pdfPage-1.png OCR/pdfPage-1.txt -l kor

for that appropriate mapper with name tessdata\lang.traindedata must be present in tesseractfolder.

You can create using tools like jTessBoxEditor (https://www.youtube.com/watch?v=-GBQcgA14PQ) is a Java box editor for Tesseract OCR data.
Training Tesseract 5 for a New Font

Malayalam

Download mal.traineddata

https://groups.google.com/g/tesseract-ocr/c/U1JjX5ZNn1Q/m/BCqy_2Ge3F4J
download from https://tesseract-ocr.github.io/tessdoc/Data-Files.html#latest-data-files-september-15-2017

or https://git.archive.org/archivecd/tessdata_fast

eg : https://git.archive.org/archivecd/tessdata_fast/-/blob/master/mal.traineddata?ref_type=heads

Copy to C:\Program Files\Tesseract-OCR\tessdata

Code for malayalam OCR
tesseract GSBVPNotice.jpeg GSBVPNotice.txt -l mal

Image to Malayalam Text with Tesseract

Malayalam

Related Posts