Language: EN

csharp-ocr-tesseract

How to do OCR in C# with OCR Tesserate

Tesseract for C# is a library that allows us to use the popular OCR Tesseract from a .NET application.

OCR Tesseract is an open-source optical character recognition (OCR) library that allows us to extract text available in images such as screenshots or scanned texts.

OCR Tesseract was originally developed at HP Labs and released in 2006. It is now maintained by Google and the Open Source community.

We can use Tesseract from a C# application through a wrapper that allows using the functions from a .NET application.

Normally, OCR is a process that does not usually yield good results. However, Tesseract, while not completely infallible, is the best library we can find to convert image to text.

How to use OCR Tesseract

We can easily add the library to a .NET project through the corresponding Nuget package.

Install-Package H.InputSimulator

On the other hand, we will need to download the trained data for our language from this repository https://github.com/tesseract-ocr/tessdata/

Next, we can use Tesseract. To do this, we create an instance of the TesseractEngine class, which will be used to process the text image. The constructor has two parameters: the path to the folder containing the OCR Tesseract language files, and the language in which the text in the image is written.

The language should match one of the data sets we downloaded earlier, and which we will add as a resource to the solution.

using System.IO;
using Tesseract;

...

var engine = new TesseractEngine(@"C:\tessdata", "eng");
var image = Pix.LoadFromFile(@"C:\image.jpg");
var page = engine.Process(image);

var text = page.GetText();

File.WriteAllText(@"C:\output.txt", text);

Once the instance of the TesseractEngine class is created, the Process method can be used to process the text image.

OCR Tesseract is Open Source, and all the code and documentation is available in the project’s repository at https://github.com/charlesw/tesseract