How to do OCR in C# with OCR Tesserate

Jun 2023
2 min

Tesseract for C# is a library that allows us to use the popular OCR Tesseract from a .NET application.

OCR Tesseract is an open-source Optical Character Recognition (OCR) library that allows us to extract text available in images such as screenshots or scanned texts.

OCR Tesseract was originally developed at HP Labs and released in 2006. It is now maintained by Google and the Open Source community.

We can use Tesseract from a C# application through a wrapper that allows us to use its functions from a .NET application.

Typically, OCR is a process that often doesn’t yield great results. However, Tesseract, while not completely infallible, is the best library we can find for converting images to text.

How to use OCR Tesseract

We can easily add the library to a .NET project via the corresponding Nuget package.

Install-Package H.InputSimulator

On the other hand, we will need to download the trained data for our language from this repository: https://github.com/tesseract-ocr/tessdata/

Then we can start using Tesseract. To do this, we create an instance of the TesseractEngine class, which will be used to process the text image. The constructor has two parameters: the path to the folder containing the OCR Tesseract language files, and the language in which the text in the image is written.

The language must match one of the datasets we downloaded earlier, which we will add as a resource to the solution.

using System.IO;
using Tesseract;

...

var engine = new TesseractEngine(@"C:\tessdata", "eng");
var image = Pix.LoadFromFile(@"C:\image.jpg");
var page = engine.Process(image);

var text = page.GetText();

File.WriteAllText(@"C:\output.txt", text);

Copied!

Once the instance of the TesseractEngine class is created, we can use the Process method to process the text image.

OCR Tesseract is Open Source, and all the code and documentation is available in the project repository at https://github.com/charlesw/tesseract