A document is worth a structured record: Principled inductive bias design for document recognition
Journal:
arXiv
Published Date:
Jul 11, 2025
Abstract
Many document types use intrinsic, convention-driven structures that serve to
encode precise and structured information, such as the conventions governing
engineering drawings. However, state-of-the-art approaches treat document
recognition as a mere computer vision problem, neglecting these underlying
document-type-specific structural properties, making them dependent on
sub-optimal heuristic post-processing and rendering many less frequent or more
complicated document types inaccessible to modern document recognition. We
suggest a novel perspective that frames document recognition as a transcription
task from a document to a record. This implies a natural grouping of documents
based on the intrinsic structure inherent in their transcription, where related
document types can be treated (and learned) similarly. We propose a method to
design structure-specific inductive biases for the underlying machine-learned
end-to-end document recognition systems, and a respective base transformer
architecture that we successfully adapt to different structures. We demonstrate
the effectiveness of the so-found inductive biases in extensive experiments
with progressively complex record structures from monophonic sheet music, shape
drawings, and simplified engineering drawings. By integrating an inductive bias
for unrestricted graph structures, we train the first-ever successful
end-to-end model to transcribe engineering drawings to their inherently
interlinked information. Our approach is relevant to inform the design of
document recognition systems for document types that are less well understood
than standard OCR, OMR, etc., and serves as a guide to unify the design of
future document foundation models.