Table of Contents

Class TextPostprocessor<T>

Namespace
AiDotNet.Postprocessing.Document
Assembly
AiDotNet.dll

TextPostprocessor - OCR text postprocessing utilities.

public class TextPostprocessor<T> : PostprocessorBase<T, string, string>, IPostprocessor<T, string, string>, IDisposable

Type Parameters

T

The numeric type for calculations.

Inheritance
TextPostprocessor<T>
Implements
Inherited Members

Remarks

TextPostprocessor provides a comprehensive pipeline for cleaning and correcting text output from OCR systems, improving readability and accuracy.

For Beginners: OCR output often contains errors and formatting issues. This tool cleans up the text:

  • Remove unwanted characters
  • Fix common OCR errors
  • Normalize whitespace
  • Correct formatting

Key features:

  • Character normalization
  • Whitespace handling
  • Common OCR error correction
  • Language-aware processing

Example usage:

var processor = new TextPostprocessor<float>();
var cleanText = processor.Process(rawOcrText);

Constructors

TextPostprocessor()

Creates a new TextPostprocessor with default options.

public TextPostprocessor()

TextPostprocessor(TextPostprocessorOptions)

Creates a new TextPostprocessor with specified options.

public TextPostprocessor(TextPostprocessorOptions options)

Parameters

options TextPostprocessorOptions

Properties

SupportsInverse

Text postprocessor supports inverse transformation (returns original).

public override bool SupportsInverse { get; }

Property Value

bool

Methods

Dispose()

Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.

public void Dispose()

Dispose(bool)

Releases resources used by the text postprocessor.

protected virtual void Dispose(bool disposing)

Parameters

disposing bool

ExtractParagraphs(string)

Extracts paragraphs from processed text.

public IList<string> ExtractParagraphs(string text)

Parameters

text string

Returns

IList<string>

ExtractSentences(string)

Extracts sentences from processed text.

public IList<string> ExtractSentences(string text)

Parameters

text string

Returns

IList<string>

FixCommonOcrErrors(string)

Fixes common OCR recognition errors.

public string FixCommonOcrErrors(string text)

Parameters

text string

Returns

string

MergeBrokenLines(string)

Merges lines that were incorrectly broken.

public string MergeBrokenLines(string text)

Parameters

text string

Returns

string

NormalizeCharacters(string)

Normalizes special characters to ASCII equivalents.

public string NormalizeCharacters(string text)

Parameters

text string

Returns

string

NormalizeWhitespace(string)

Normalizes whitespace in the text.

public string NormalizeWhitespace(string text)

Parameters

text string

Returns

string

ProcessCore(string)

Processes OCR text through the full postprocessing pipeline.

protected override string ProcessCore(string input)

Parameters

input string

The raw OCR text.

Returns

string

The cleaned and corrected text.

RemoveControlCharacters(string)

Removes control characters from text.

public string RemoveControlCharacters(string text)

Parameters

text string

Returns

string

RemoveDuplicateSpaces(string)

Removes duplicate consecutive spaces.

public string RemoveDuplicateSpaces(string text)

Parameters

text string

Returns

string

RemoveHeadersFooters(string, int, int)

Removes headers and footers from document text.

public string RemoveHeadersFooters(string text, int headerLines = 2, int footerLines = 2)

Parameters

text string
headerLines int
footerLines int

Returns

string

RemovePageNumbers(string)

Removes page numbers from text.

public string RemovePageNumbers(string text)

Parameters

text string

Returns

string

ValidateInput(string)

Validates the input text.

protected override void ValidateInput(string input)

Parameters

input string