This is a very lite tutorial to do some OCR on a Linux using ImageMagick to optimize images and using athe trial version of a commercial OCR software, OCR shop XTR, that is really powerful and can do the job very well.
Note: I've no relation with www.vividata.com and I'm not advertising their product.. It's just a product that I've tried and could respect so much.
Requirements:
1- Imagemagick , I think you can find a package of Imagemagick on any famous Linux distro, either oriented for desktop or servers ... if you didn't find any you can download and install it from its website.. http://www.imagemagick.org/
2- OCR shop XTR for Linux, you can download your trial version here. You'll have to provide your machine hostname and your netwrok card mac address to get the key. Installation is really very easy. You can follow the instructions here.
Note: Images processing can take very long time when you processes hundreds of images ... so be patient and test your options on some sample images at 1st before you apply it to many images ...
Steps:
1- We need to optimize your images to be well recognized by ocrxtr ..So we'll use "convert" command to which is bundled with ImageMagick package to do the job .. You can skip this step if you see that you really have images with good resolution and clarity ..
convert sourceimage.ext -resize 200% -fill white -tint 60 -level 0%,80% -sharpen 3 -compress none -monochrome destinationimage.tif
You can finely adjust those options to adapt it for your needs but those were what worked for me after too many attempts.
2- If you skipped step one you need to do this so as the image can be used successfully with ocrxtr :
convert sourceimage.ext -compress none -monochrome destinationimage.tif
3- Now let's use the OCR, assuming that you need to get pdf files that contain the text hidden under the image, transparently, and keep the images in its proper state and you assume overwriting the destination file ..
/opt/Vividata/bin/ocrxtr -overwrite=y -in_res=150 -out_text_format=pdf -out_text_name="%s.pdf" destinationimage.tif
You can read ocr xtr documentation if you want to play with other options..
