WordVIS: A Color Worth A Thousand Words
Journal:
arXiv
Published Date:
Dec 13, 2024
Abstract
Document classification is considered a critical element in automated
document processing systems. In recent years multi-modal approaches have become
increasingly popular for document classification. Despite their improvements,
these approaches are underutilized in the industry due to their requirement for
a tremendous volume of training data and extensive computational power. In this
paper, we attempt to address these issues by embedding textual features
directly into the visual space, allowing lightweight image-based classifiers to
achieve state-of-the-art results using small-scale datasets in document
classification. To evaluate the efficacy of the visual features generated from
our approach on limited data, we tested on the standard dataset Tobacco-3482.
Our experiments show a tremendous improvement in image-based classifiers,
achieving an improvement of 4.64% using ResNet50 with no document pre-training.
It also sets a new record for the best accuracy of the Tobacco-3482 dataset
with a score of 91.14% using the image-based DocXClassifier with no document
pre-training. The simplicity of the approach, its resource requirements, and
subsequent results provide a good prospect for its use in industrial use cases.