Extract Text from Images or Pdf Files

Pendem RajuHow To0 Comments

Sometimes when you are surfing net you may come across some images with quotes or images related to your search query and sometimes you may download pdf files which are read only. Many people want to extract text from those images or pdf files.

OCR(Optical Character Recognition) is used to convert images or scanned copies of books into editable text. There are many free OCR(optical character recognition) tools available in online. Following are few free OCR tools.

Part1: Online OCR software

Online OCR tools are handy and you don’t need to install any software. Just upload the image from which you want to extract the text. The online OCR tool will extract the text for you.

extract text from image

1Google Docs

If you have a Gmail or other Google account you might tryGoogle Docs first. It’s is not a OCR tool but through it’s file conversion feature you can convert image into pdf or doc.

To get text from image or PDF files you need to first upload and convert the files to Google Docs. Then you can do the further editing online or/and download it back as PDF, DOC, TXT etc.

In Google Docs to upload the files first you need to click Upload button. While uploading it asks for conversion. If you are uploading image file convert it to pdf so that you can extract the text from it.

Google Docs conversion works pretty good, especially with English texts. Over 30 different languages can be selected but if your language is not included in the list, the conversion may give an error and the file will not be processed. Of course – if you don’t have a Google account you can create one any time.

  • Input image file types: most bitmap formats
  • Input PDF files: yes
  • Output file types: ODT, PDF, TXT, RTF, DOC, HTML
  • Languages: 30+
Google Docs / PROS: CONS:
  • Unlimited processing capacity
  • Text in some minor languages may not be recognized

2Free Online OCR

  • Input image file types: GIF, BMP, JPEG, TIFF, PNG
  • Input PDF files: yes
  • Output file types: DOC, PDF, RTF, TXT
  • Languages: English dictionary only
Free Online OCR / PROS: CONS:
  • No capacity limits for processing
  • Keeps original formatting and Layout
  • Only English dictionary supported. Text in other languages may be not recognized

3i2OCR

  • Input image file types: TIF, JPEG, PNG, BMP, GIF, PBM, PGM, PPM
  • Input PDF files: no
  • Output file types: TXT
  • languages: 30+
i2OCR/ PROS: CONS:
  • No limits for uploading
  • Has a review option after character recognition – the original image and result text is shown side-by-side on screen.
  • Only text output, all the original formatting will be lost. Though at least it supports multi column pages correctly.
  • Creates “hard” linebreaks at the end of each line.
  • Does not process PDF files.

4OCR Online

  • Input image file types: JPG, TIFF, PNG, GIF
  • Input PDF files: yes
  • Output file types: TXT, PDF, RTF, DOC
  • Languages: 150+
OCRonline/ PROS: CONS:
  • Excellent recognition quality
  • Rebuilds original formatting
  • Impressive list of 150 language dictionaries
  • Limited upload capacity – 5 pages in a week, file size up to 10 MB. Need to pay to get extra pages.

5Online OCR

  • Input image file types: JPG, JPEG, BMP, TIFF, GIF
  • Input PDF files: only for registered users
  • Output file types: DOC, XLS, TXT (+ PDF for registered users)
  • Languages: 30+

Part2: Desktop OCR software

Desktop OCR software you need to download the software and install it on your computer. Some software are capable of reading images from scanner for user flexibility.

extract text from pdf

6Cuneiform OpenOCR

OpenOCR is based on commercial product Cuneiform that was released as freeware on 2007.

  • License: freeware
  • Input image: most bitmap file formats
  • Input PDF: no
  • Scanner input: yes
  • Output: TXT, RTF, HTML + output to Word/Excel
  • Dictionary languages: 20+
Cuneiform OpenOCR / PROS: CONS:
  • Includes both single file and batch of files processing mode.
  • Installation program creates invalid start menu shortcuts like NewFolder1

7Free OCR

This is another of the programs that uses the open source Tesseract OCR engine. Tesseract was originally developed by HP and is currently sponsored by Google.

  • License: freeware
  • Requires: Microsoft .NET
  • Input image: TIFF, multi-page TIFF
  • Input PDF: yes
  • Scanner input: yes
  • Output: TXT
  • Dictionary languages: 9
FreeOCR / PROS: CONS:
  • Tesseract OCR engine has good accuracy.
    • Only text output, no formatting recognition
    • No multi-column support (must crop the image manually to one column)

8gImage Reader

gImageReader is one of the front-ends to the free Tesseract OCR engine. You need to download and install Tesseract separately from this page. Tesseract engine uses OpenOffice dictionaries and spellcheckers that can be downloaded from here.

  • License: freeware (GNU)
  • Requires: Tesseract, need to download separately
  • Input PDF: yes
  • Dictionary languages: many, uses freely downloadable OpenOffice spellcheckers
  • Scanner input: yes
  • Input image: JPEG, GIF, PNG, TIFF
  • Output: TXT
gImageReader / PROS: CONS:
  • Tesseract OCR engine has good accuracy
  • OCR area(s) can be manually selected
    • Only text output, no formatting recognition

9Puma.NET

Puma.NET is actually not a user solution but a development kit based on CuneiForm OCR engine, though it contains a sample program with the front-end.

After installing there will be no launch icon in Start Menu but you can find the program Puma.Net.Sample.exe deep in the C: Program Files Puma.NET Sample bin x86 Debugfolder.

  • License: freeware (BSD)
  • Requires: Microsoft .NET
  • Input image: BMP, GIF, EXIG, JPG, PNG and TIFF
  • Input PDF: no
  • Scanner input: no
  • Output: TXT, RTF, HTML
  • Dictionary languages: 27
Puma.NET / PROS: CONS:
  • Font and formatting detection
    • You have to create the shortcut to start the program by yourself
    • Leaves “hard” linebreaks

10Simple OCR

SimpleOCR uses its own OCR engine that is capable of learning the fonts in a particular document.

  • License: free for all non-commercial purposes
  • Input image: TIFF, JPG, BMP
  • Input PDF: no
  • Scanner input: yes
  • Output: DOC, TXT
  • Dictionary languages: 3

Note: SimpleOCR seems to give better results from color JPEGs, not grayscale.

SimpleOCR / PROS: CONS:
    • Word by word text revision
    • Ability to train the engine to use specific fonts
    • Includes both single file and batch of files processing mode
    • Only 3 languages dictionary.
    • No font and format detection

 

Leave a Reply

Your email address will not be published. Required fields are marked *