Image to Malayalam Text with Tesseract

Image to Malayalam Text with Tesseract

First tinstall image magick

if You need to convert PDF made of images, convert each page to png with image magick

Windows Installer at https://github.com/UB-Mannheim/tesseract/wiki

After install upadate Env Var PATH C:\Program Files\Tesseract-OCR

test with
tesseract --list-langs
default langs availabe are eng and osd. more langs could be added.

extract text with

tesseract pdfPage-0.png OCR/pdfPage-0.txt

code for OCR of korean text. This works only if respective language date is available. confirm with tesseract --list-langs
tesseract pdfPage-1.png OCR/pdfPage-1.txt -l kor

for that appropriate mapper with name tessdata\lang.traindedata must be present in tesseractfolder.

You can create using tools like jTessBoxEditor (https://www.youtube.com/watch?v=-GBQcgA14PQ) is a Java box editor for Tesseract OCR data.
Training Tesseract 5 for a New Font

Malayalam

Download mal.traineddata

https://groups.google.com/g/tesseract-ocr/c/U1JjX5ZNn1Q/m/BCqy_2Ge3F4J
download from https://tesseract-ocr.github.io/tessdoc/Data-Files.html#latest-data-files-september-15-2017

or https://git.archive.org/archivecd/tessdata_fast

eg : https://git.archive.org/archivecd/tessdata_fast/-/blob/master/mal.traineddata?ref_type=heads

Copy to C:\Program Files\Tesseract-OCR\tessdata

Code for malayalam OCR
tesseract GSBVPNotice.jpeg GSBVPNotice.txt -l mal