Class TextPostprocessor<T>
- Namespace
- AiDotNet.Postprocessing.Document
- Assembly
- AiDotNet.dll
TextPostprocessor - OCR text postprocessing utilities.
public class TextPostprocessor<T> : PostprocessorBase<T, string, string>, IPostprocessor<T, string, string>, IDisposable
Type Parameters
TThe numeric type for calculations.
- Inheritance
-
TextPostprocessor<T>
- Implements
- Inherited Members
Remarks
TextPostprocessor provides a comprehensive pipeline for cleaning and correcting text output from OCR systems, improving readability and accuracy.
For Beginners: OCR output often contains errors and formatting issues. This tool cleans up the text:
- Remove unwanted characters
- Fix common OCR errors
- Normalize whitespace
- Correct formatting
Key features:
- Character normalization
- Whitespace handling
- Common OCR error correction
- Language-aware processing
Example usage:
var processor = new TextPostprocessor<float>();
var cleanText = processor.Process(rawOcrText);
Constructors
TextPostprocessor()
Creates a new TextPostprocessor with default options.
public TextPostprocessor()
TextPostprocessor(TextPostprocessorOptions)
Creates a new TextPostprocessor with specified options.
public TextPostprocessor(TextPostprocessorOptions options)
Parameters
optionsTextPostprocessorOptions
Properties
SupportsInverse
Text postprocessor supports inverse transformation (returns original).
public override bool SupportsInverse { get; }
Property Value
Methods
Dispose()
Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.
public void Dispose()
Dispose(bool)
Releases resources used by the text postprocessor.
protected virtual void Dispose(bool disposing)
Parameters
disposingbool
ExtractParagraphs(string)
Extracts paragraphs from processed text.
public IList<string> ExtractParagraphs(string text)
Parameters
textstring
Returns
ExtractSentences(string)
Extracts sentences from processed text.
public IList<string> ExtractSentences(string text)
Parameters
textstring
Returns
FixCommonOcrErrors(string)
Fixes common OCR recognition errors.
public string FixCommonOcrErrors(string text)
Parameters
textstring
Returns
MergeBrokenLines(string)
Merges lines that were incorrectly broken.
public string MergeBrokenLines(string text)
Parameters
textstring
Returns
NormalizeCharacters(string)
Normalizes special characters to ASCII equivalents.
public string NormalizeCharacters(string text)
Parameters
textstring
Returns
NormalizeWhitespace(string)
Normalizes whitespace in the text.
public string NormalizeWhitespace(string text)
Parameters
textstring
Returns
ProcessCore(string)
Processes OCR text through the full postprocessing pipeline.
protected override string ProcessCore(string input)
Parameters
inputstringThe raw OCR text.
Returns
- string
The cleaned and corrected text.
RemoveControlCharacters(string)
Removes control characters from text.
public string RemoveControlCharacters(string text)
Parameters
textstring
Returns
RemoveDuplicateSpaces(string)
Removes duplicate consecutive spaces.
public string RemoveDuplicateSpaces(string text)
Parameters
textstring
Returns
RemoveHeadersFooters(string, int, int)
Removes headers and footers from document text.
public string RemoveHeadersFooters(string text, int headerLines = 2, int footerLines = 2)
Parameters
Returns
RemovePageNumbers(string)
Removes page numbers from text.
public string RemovePageNumbers(string text)
Parameters
textstring
Returns
ValidateInput(string)
Validates the input text.
protected override void ValidateInput(string input)
Parameters
inputstring