Table of Contents

Interface IDocumentModel<T>

Namespace
AiDotNet.Document.Interfaces
Assembly
AiDotNet.dll

Base interface for all document AI models in AiDotNet.

public interface IDocumentModel<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations (e.g., float, double).

Inherited Members
Extension Methods

Remarks

This interface extends IFullModel<T, TInput, TOutput> to provide the core contract for document AI models, inheriting standard methods for training, inference, model persistence, and gradient computation.

For Beginners: A document AI model processes document images (scanned pages, PDFs, photos of text) to extract information, understand layout, or answer questions.

Key concepts:

  • Document images have shape [batch, channels, height, width]
  • Models can run in Native mode (pure C#) or ONNX mode (optimized runtime)
  • All models support both training and inference
  • Many document models combine vision and language understanding

Example usage:

var model = new LayoutLMv3<double>(architecture);
var layout = model.DetectLayout(documentImage);
var text = model.ExtractText(documentImage);

Properties

ExpectedImageSize

Gets the expected input image size (assumes square images).

int ExpectedImageSize { get; }

Property Value

int

Remarks

Common values: 224 (ViT base), 384, 448, 512, 768, 1024. Input images will be resized to [ImageSize x ImageSize] before processing.

IsOnnxMode

Gets whether this model is running in ONNX inference mode.

bool IsOnnxMode { get; }

Property Value

bool

Remarks

When true, the model uses pre-trained ONNX weights for fast inference. When false, the model uses native layers and can be trained.

MaxSequenceLength

Gets the maximum sequence length for text processing.

int MaxSequenceLength { get; }

Property Value

int

Remarks

For layout-aware models, this is the maximum number of text tokens. Common values: 512, 1024, 2048.

RequiresOCR

Gets whether this model requires OCR preprocessing.

bool RequiresOCR { get; }

Property Value

bool

Remarks

OCR-free models (Donut, Pix2Struct) return false - they process raw pixels directly. Layout-aware models (LayoutLM) return true - they need text and bounding boxes from OCR.

SupportedDocumentTypes

Gets the supported document types for this model.

DocumentType SupportedDocumentTypes { get; }

Property Value

DocumentType

Methods

EncodeDocument(Tensor<T>)

Processes a document image and returns encoded features.

Tensor<T> EncodeDocument(Tensor<T> documentImage)

Parameters

documentImage Tensor<T>

The document image tensor [batch, channels, height, width] or [channels, height, width].

Returns

Tensor<T>

Encoded document features suitable for downstream tasks.

Remarks

For Beginners: This method converts a document image into a numerical representation (feature vector) that captures the document's content and structure. These features can then be used for tasks like classification, QA, or information extraction.

GetModelSummary()

Gets a summary of the model architecture.

string GetModelSummary()

Returns

string

A string describing the model's architecture, parameters, and capabilities.

ValidateInputShape(Tensor<T>)

Validates that an input tensor has the correct shape for this model.

void ValidateInputShape(Tensor<T> documentImage)

Parameters

documentImage Tensor<T>

The tensor to validate.

Exceptions

ArgumentException

Thrown if the tensor shape is invalid.