Interface IDocumentModel<T>
- Namespace
- AiDotNet.Document.Interfaces
- Assembly
- AiDotNet.dll
Base interface for all document AI models in AiDotNet.
public interface IDocumentModel<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations (e.g., float, double).
- Inherited Members
- Extension Methods
Remarks
This interface extends IFullModel<T, TInput, TOutput> to provide the core contract for document AI models, inheriting standard methods for training, inference, model persistence, and gradient computation.
For Beginners: A document AI model processes document images (scanned pages, PDFs, photos of text) to extract information, understand layout, or answer questions.
Key concepts:
- Document images have shape [batch, channels, height, width]
- Models can run in Native mode (pure C#) or ONNX mode (optimized runtime)
- All models support both training and inference
- Many document models combine vision and language understanding
Example usage:
var model = new LayoutLMv3<double>(architecture);
var layout = model.DetectLayout(documentImage);
var text = model.ExtractText(documentImage);
Properties
ExpectedImageSize
Gets the expected input image size (assumes square images).
int ExpectedImageSize { get; }
Property Value
Remarks
Common values: 224 (ViT base), 384, 448, 512, 768, 1024. Input images will be resized to [ImageSize x ImageSize] before processing.
IsOnnxMode
Gets whether this model is running in ONNX inference mode.
bool IsOnnxMode { get; }
Property Value
Remarks
When true, the model uses pre-trained ONNX weights for fast inference. When false, the model uses native layers and can be trained.
MaxSequenceLength
Gets the maximum sequence length for text processing.
int MaxSequenceLength { get; }
Property Value
Remarks
For layout-aware models, this is the maximum number of text tokens. Common values: 512, 1024, 2048.
RequiresOCR
Gets whether this model requires OCR preprocessing.
bool RequiresOCR { get; }
Property Value
Remarks
OCR-free models (Donut, Pix2Struct) return false - they process raw pixels directly. Layout-aware models (LayoutLM) return true - they need text and bounding boxes from OCR.
SupportedDocumentTypes
Gets the supported document types for this model.
DocumentType SupportedDocumentTypes { get; }
Property Value
Methods
EncodeDocument(Tensor<T>)
Processes a document image and returns encoded features.
Tensor<T> EncodeDocument(Tensor<T> documentImage)
Parameters
documentImageTensor<T>The document image tensor [batch, channels, height, width] or [channels, height, width].
Returns
- Tensor<T>
Encoded document features suitable for downstream tasks.
Remarks
For Beginners: This method converts a document image into a numerical representation (feature vector) that captures the document's content and structure. These features can then be used for tasks like classification, QA, or information extraction.
GetModelSummary()
Gets a summary of the model architecture.
string GetModelSummary()
Returns
- string
A string describing the model's architecture, parameters, and capabilities.
ValidateInputShape(Tensor<T>)
Validates that an input tensor has the correct shape for this model.
void ValidateInputShape(Tensor<T> documentImage)
Parameters
documentImageTensor<T>The tensor to validate.
Exceptions
- ArgumentException
Thrown if the tensor shape is invalid.