OCR Approaches for Humanities: Applications of Artificial Intelligence/Machine Learning on Transcription and Transliteration of Historical Documents

Arsh Khan; Utsav Rai; Shashank Shekhar Singh; Yukinori Yamamoto; Xabier Granja Ibarreche; Harrison Meadows; Sergei Gleyzer

doi:10.1515/dsll-2024-0013

Enjoy 40% off

academic books on De Gruyter Brill *

Article Open Access

OCR Approaches for Humanities: Applications of Artificial Intelligence/Machine Learning on Transcription and Transliteration of Historical Documents

Arsh Khan , Utsav Rai , Shashank Shekhar Singh , Yukinori Yamamoto , Xabier Granja Ibarreche , Harrison Meadows and Sergei Gleyzer

Published/Copyright: December 2, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Digital Studies in Language and Literature Volume 1 Issue 1-2

Abstract

Recent advances have made Artificial Intelligence/Machine Learning (AI/ML) processes increasingly relevant to business, healthcare, finance, retail, and telecommunications interests. The objective of the present study is to explore the potential to leverage modern machine learning algorithms directly into humanities research fields. To do so, the renAIssance project was created, aiming to examine various possibilities of using AI/ML algorithms for Optical Character Recognition (OCR) to accelerate and improve the accuracy of automatized transcription in digitized historical archival documents. This article considers the state of the field as it pertains to the main processes used in natural language processing. It also explores difficulties arising from salient features of early modern printing practices that diverge from modern typographical conventions. Four AI/ML approaches were employed to achieve context-rich processing of a specifically selected historical archival dataset: Convolutional Neural Networks, Sequence-to-Sequence Contrastive Learning (SeqCLR), Vision Transformers, and Transformer-based OCR (TrOCR). The archival corpus consisted of 931 pages, 2,082 folios, and 61 manual transcriptions as ground truth to train the algorithms. This study reports on the accuracy achieved by each of the four methods when transcribing and transliterating early modern documents. Finally, it offers suggestions for future implementations to apply AI/ML tools to the analysis of archival sources commonly handled by researchers on humanities fields such as literature and history.

Keywords: artificial intelligence; machine learning; archival research; transcription; OCR

1 Introduction

Artificial Intelligence/Machine Learning (AI/ML) is by no means a new phenomenon. It dates back to the first half of the 20th century, when Alan Turing formulated a theory of computation and a corresponding method (the Turing test) in 1950 to determine the ability of computing machinery to simulate intelligence that could not be distinguished from that of a human being. In recent years, exponentially increased funding has led to the commercialization of a wide variety of consumer applications. Most importantly for the Humanities, an array of AI image and text generators newly released by OpenAI (Dall-E, ChatGPT), Anthropic (Claude 3.5 Sonnet), Stability AI (Stable Diffusion), Google (Gemini), Microsoft (Copilot), and Apple (Apple Intelligence) are fueling a new era of computational generation via artificial intelligence. The objective of the present study is to explore the potential to leverage modern machine learning algorithms directly into humanities research fields. To do so, the renAIssance project was created, aiming to examine the potential of using ML algorithms for Optical Character Recognition (OCR) to enable, accelerate, and improve the accuracy of transcription for digitized historical archival documents by orders of magnitude compared to the time and skill it would take a human being to do so manually.

Currently, ongoing endeavors such as Google Cloud Vision have adopted generalized algorithms to transcribe any type of text. This approach exponentially increases the difficulty of achieving higher accuracy, given the extreme variability of sources the model must account for. In contrast, the renAIssance project chose to specialize its code from inception: it focuses on early modern printed texts, especially those with framed marginalia. These two elements (marginalia, frames), common in early modern prose, represent a challenge to generalized OCR tools, making them error prone. The literature review that follows considers the state of the field as it pertains to Natural Language Processing and Digital Humanities, covering salient features of early modern printing practices that diverge from modern typographical conventions. After a description of the primary and secondary datasets as well as the coding assumptions adopted by the renAIssance team to account for source variability, the methods section outlines the four approaches employed in this study for context-rich ML processing. The article proceeds to report on our findings regarding the efficacy and accuracy of these four machine learning algorithms in achieving the project’s goals. Data analysis suggests that controlled ground truth sourcing, algorithm design choices, and careful pretraining of data are crucial for a successful implementation, as is further elaborated in the discussion. Finally, suggestions and criteria for applying AI/ML tools to archival research on humanities fields such as literature and history are provided.

2 Literature Review

2.1 Artificial Intelligence and Machine Learning for Natural Language Processing

When it comes to Natural Language Processing (NLP) tasks (e.g., in applications such as digital translation tools, chat bots, voice assistants, etc.), Artificial Neural Networks (ANN) have exceeded the performance of prior machine learning methods. Common types of ANNs include Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN). They are heavily inspired by biological nervous systems, in that they are based on “a high number of interconnected computational nodes (referred to as neurons)” that work in “a distributed fashion to collectively learn from the input in order to optimize its final output” (O’Shea and Nash 2015, 1). RNNs are mainly used to detect patterns in sequences of data (Schmidt 2019), whereas CNNs have traditionally been the go-to architecture for pattern recognition within images. CNNs are biased toward preferred solutions during initial training through strong inductive priors (built-in assumptions, design choices, and constraints that leverage the spatial structure of image data to guide a model’s learning process). This avoids weaknesses stemming from a lack of massive sources of ground truth data (such as the scarcity of transcribed early modern archival documents): the training is simplified, and the model learns to better capture the local dependencies in images. Inductive priors include locality (focusing on small image regions using small filters to capture local patterns), translation invariance (recognizing objects regardless of position), hierarchy of features (building complex patterns from simpler ones through multiple convolutional layers), and parameter sharing (applying the same filters across the image to reduce the number of parameters). These inductive priors enable CNNs to be data-efficient, generalize well, and reduce complexity, making them highly effective at executing image processing tasks from limited data.

The landscape of neural network architectures has undergone significant changes in recent years, particularly with the rise of Transformers. Transformers were proposed by Vaswani et al. (2017) as “a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output” that can achieve “significantly more parallelization and can reach a new state of the art in translation quality” (2). They can perform very well on image classification tasks when directly applied to sequences of image patches. When trained on large volumes of data, they can attain excellent results compared to state-of-the-art CNNs. Although initially attention mechanisms were introduced in Long Short-Term Memory (LSTM) networks, enhancing their NLP capabilities, Transformers have rapidly evolved to supplant LSTMs entirely in NLP tasks and are now extending their influence to the realm of computer vision.

As illustrated by Dosovitskiy et al. (2021), Transformers represent a paradigm shift: unlike prior specialized architectures like LSTMs and CNNs, Transformers are designed as highly general computation frameworks. They offer an increased level of flexibility thanks to their dynamic ability to compute connections between inputs on-the-fly. This enhanced general-purpose processing allows Transformers to learn inductive biases from data, potentially discovering more optimal patterns. Given the growing digitization of libraries worldwide, it is now starting to become possible to train Transformer models to decipher the content of scanned historical documents through machine learning-based OCR and perform transcriptions that outperform other approaches based on architectures with built-in or predefined inductive biases.

Most existing OCR tools designed to handle early modern historical documents suffer from common downsides limiting their usability. First and foremost, they require significant end user technical training, even those created with non-technical users in mind, like OCR4all (Reul et al. 2019). Moreover, the goal of a sufficiently accurate, generalized automatic transcription tool remains out of reach: while existing tools showcase promising performance metrics, they are highly specialized for monolingual datasets limited to reduce typeface variability.^[1] Emerging from the field to resolve these issues is READ co-op’s Transkribus online suite of historical document management tools. Transkribus offers a flexible base transcription platform (built on ABBY FineReader) which can be tailored to optimize performance for a particular dataset. End users can select from pre-trained, customized models to run the transcription using one that aligns best with the target document’s features (chronological, linguistic, typographical, etc.). The GUI is user-friendly, and four models have been made available for early modern Spanish texts though Github open access: one trained on 15th–16th-century texts printed in Gothic typefaces (“Spanish Gothic 15th–16th Century”) (Blasut 2022) and three for 17th-century printing conventions: “Spanish Golden Age Prints 1.0,” “Spanish Golden Age Theatre Prints (Spelling Modernization) 1.0,” (Cuéllar 2023) and “Spanish Redonda (Round Script) 16th–17th Century.” (Bazzaco et al. 2022) The first one has reached the highest level of generalization (99.39 % accuracy). The models designed for later 16th- and 17th-century Garamond or “Redonda” style fonts publicize promising CER levels (0.91–3.10), but ultimately, they remain best geared for the specific texts in the training data (e.g., books of chivalry and theatrical plays). Each model struggled with recognizing framed text in Padilla’s Nobleza virtuosa included in the renAIssance project’s training data, suggesting that the Transkribus models have yet to resolve the problem of layout analysis symptomatic of generalized OCR tools.

Layout analysis and line segmentation represent major obstacles to automatic transcription tools regarding historical documents, as datasets are not yet expansive enough to account for layout diversity, material deterioration, or varying scan qualities. For instance, Transkribus can only improve line segmentation generalization by introducing new data to training processes. The renAIssance transcription platform, instead, adopts a flexible segmentation tool into the workflow that allows end-users to manually adjust line bounding boxes per document to avoid and correct errors prior to running the automated transcription. Enabling non-technical users to prepare target documents for more effective layout analysis during pre-processing is a unique contribution of the renAIssance project, which offers the potential to improve prediction outputs while also learning from the new data input to increase the model’s generalized performance (see Section 4.4).

2.2 Inconsistency and Variability in Early Modern Print Sources

Early modern Spanish orthography differs from contemporary spelling conventions. Certain differences can be attributed to the period’s linguistic diversity before languages became standardized to the extent they have today, while in other instances patterns may be predictable with knowledge of archaic orthographical conventions and the material practicalities of moveable type printing. Prescriptive early modern handbooks like Alonso Víctor de Paredes’s 1680 composition manual, Institucion, y origen del arte de la Imprenta, y reglas generales para los componedores are valuable catalogues of orthographic precepts one can expect to find in printed texts from the period, especially when calibrated by informed expectations about the abundance of examples in the real-world archive that diverge from those guidelines. This background knowledge is invaluable when training software models to identify characters and shorthand techniques no longer typical to current usage, and to reliably convert and transliterate them for a modernized output text (Figure 1).

Figure 1:

Depiction of the type sorts found in an early modern compositor’s case, from Paredes’s Institucion, y origen del arte de la Imprenta … , fol. 9r.

For instance, archival sources often interchangeably render ‘u’ and ‘v’ letters, and there are also multiple characters used for the lowercase letter ‘s.’ In practice, orthographical consistency is further mitigated in early modern print documents by any number of pragmatic factors a compositor could encounter in preparing a block for press printing, including the breakage of type letters or running out of a particular letter, among others. Moreover, letter substitutions were often not carried out predictably or consistently, commonly displaying multiple spellings of a word within a single text or even the same sequence of letters rendered by different typeset characters. The inconsistency is due to a lack of distinction between certain letters from antiquity (e.g., ‘u’ and ‘v’), until the seventeenth century brought a shift towards modern distinctions such as ‘v’ becoming a consonant and ‘u’ a vowel. This variability must be accounted for in the design phase of any AI/ML algorithm used for OCR. renAIssance opted for a different approach than Transkribus and Google Cloud Vision; instead of fine-tuning a base platform like ABBY FineReader for a specific dataset or attempting a fully generalized model, renAIssance initially focused on a specific task: early modern framed prose (see Figure 4), which generally enjoys a uniform page layout that eases the burden on algorithms when analyzing what parts of an image consist of text. Coding awareness of lines dividing the main corpus, marginalia, and the outer bounds (i.e., layout analysis) allowed renAIssance to quickly train the model to differentiate these elements.

3 Methods

Aware of the collaborative potential of renAIssance, Gleyzer and Granja Ibarreche organized a test challenge for the Google Summer of Code (GSoC) 2024 competition, which participants around the globe attempt to solve in order to be funded as research contributors. The proposal included a base dataset consisting of 32 transcribed pages from Luisa de Padilla’s Nobleza Virtuosa (1637), a transcription of 20 of those pages to serve as ground truth, and 12 non-transcribed pages against which initial coding attempts should be tested. renAIssance received 28 coding proposals, from which 4 finalists (Khan, Rai, Shekhar Singh, and Yamamoto) were selected. The project was set for a medium scope of ∼175 h per contributor, covering the foreseen coding workload necessary to train the model as well as weekly Zoom meetings to facilitate collaboration (the team was dispersed across India, Japan, Spain, the United Kingdom, and the United States) to handle obstacles encountered and problem solving.

3.1 An Expanding Dataset: Growing the Pool of Archival and Historical Documents

The rationale for selecting the initially narrow dataset based on Padilla’s Nobleza Virtuosa was to minimize expected printing inconsistencies, as her work is representative of texts with a framed prose layout that have never been transcribed before. It is available in high quality digital scan format from the Biblioteca Digital Hispánica supported by the National Library of Spain (with subsequent tomes also accessible from the digital repository Hathitrust). These scans allow researchers to access copies in near pristine condition, with well-preserved pages that exhibit minimal ink bleed-through, deterioration, or other textual noise such as handwritten annotations, all of which helped facilitate identifying each ML algorithm’s viability early on. The dataset would be progressively expanded when it became necessary due to diminishing returns in training the algorithm (i.e., the code approached a performance ceiling). To accelerate this process, marginalia present in the original sources were ignored in favor of focusing on the main text, and Spanish accentuation was also overlooked due to early modern printing practices not aligning with contemporary use of accents (thus increasing ML complexity with no tangible benefit to the final transcription). As the four main approaches began to take shape, the dataset was broadened to include another 32-page extract from Padilla’s Noble perfecto y segunda parte de la Nobleza virtuosa (1639), which was printed following the same framed prose layout as its precursor, to better evaluate the models’ performance.

The dataset was progressively enlarged in two additional phases: the first involved 27 archival sources gathered at the Spanish National Library in Madrid. These texts belong to the PORCONES classification: judicial documents printed by Spanish governmental offices detailing criminal proceedings.^[2] Published between 1621 and 1693, these archival sources used letters and variations in printing practices that resembled – but were not identical to – those from Padilla’s texts. These additions increased the source pool with 363 folios, 9 of which were manually transcribed to provide ground truth for the ML models. The new documents’ relative consistency with the initial materials facilitated further training of the algorithms: some were framed prose while a few were unframed, providing the opportunity to progressively challenge the algorithms (i.e., to test layout analysis on the frameless text) while still maintaining a controlled training environment.

The second phase infused greater complexity via 10 sources that displayed far greater variability in letter rendering, typefaces, and printing practices. The additional corpus spread chronologically to include materials from the mid-sixteenth to the mid-eighteenth centuries: Luís Milán’s Libro intitulado el Cortesano (1561), the extensive proceedings of a 1590 criminal case, Juan Benito Guardiola’s Tratado de la nobleza (1591), the Constituciones Sinodales de Calahorra (1602), Sebastián de Covarrubias’s Tesoro de la lengua castellana o española (1611), the Recopilacion de las leyes destos reynos (1640), Andrés Mendo’s Príncipe perfecto y ministros aiustados (1662), Paredes’s aforementioned Institucion, y origen del arte de la Imprenta … (1680), Antonio de Ezcaray’s Vozes del dolor nacidas de la multitud de pecados (1691), and Fausto Agustín de Buendía’s Instrucción de christiana y política cortesanía con Dios y con hombres (1740). This final expansion represented a sizable training dataset increase for a total corpus consisting of 931 pages, 2082 folios, and 61 manual transcriptions as ground truth references to train the code. At this stage, printing styles increased in variety, forcing the algorithms to adapt when processing scanned pages. While this represents a growing generalization of the AI/ML model, it remained primarily trained on Renaissance and early Baroque sources.

To resolve the expected issues of variability in the final dataset, four of the more common printing features of this nature (detailed below) were targeted for the ML models to be able to detect and resolve by altering the output text to follow modernized spelling conventions. All four features were addressed throughout this study, from model training to fine tuning and problem solving. Inevitably, all four assumptions are inherently imperfect: they cannot yield consistently flawless orthographical outputs because, by design, setting parameters to simplify the immense potential variability in early modern texts is an impossible task. Nevertheless, the aim was to minimize spelling errors, which would result in a corresponding reduction in the human effort required to manually correct remaining errors in the output transcription. The researchers acknowledge that these assumptions do not currently fit all purposes: setting the parameters for modernizing the text was motivated by the code-training benefits that result from reducing variability. While this approach was most effective for renAIssance’s goals, efforts to refine this code are ongoing. Once the algorithms reach an even higher degree of accuracy, such modernizations and simplifications may be removed to allow the more capable AI/ML model to generate both diplomatic and modernized transcriptions depending on user needs. This limitation for the present study, as well as strategic suggestions for future research, are examined further in the discussion section.

3.1.1 Interchangeability of ‘u’ and ‘v’

Prescriptively, when appearing at the beginning of a word, both the letters ‘u’ and ‘v’ should be rendered as the typeset letter ‘v’ in seventeenth-century orthography. However, some texts apply the same convention when ‘u’ or ‘v’ appear at the beginning of any syllable. Paredes (1680) states that ‘u’ should only appear medially as a vowel, but in practice, the letter ‘u’ can also appear medially for ‘v,’ evidence of which is found in the dataset. For this feature, certain tendencies can be recognized in the practices of individual typesetters, but no discernible pattern emerges across the dataset that could be applied generally. The renAIssance team trained ML models to learn to assume the typeset ‘v’ at the beginning of the word should be converted to ‘u’ and remain ‘v’ when it appears medially.

3.1.2 Optical Similarities Between ‘f’ and the Long ‘s’ (‘ſ’)

For most of the letterpress era (mid-fifteenth through nineteenth centuries), there are two types of the lower-case letter ‘s’ in the compositor’s type case: the short ‘s,’ which corresponds visually with the lowercase ‘s’ in modern typography, and the long ‘s,’ which has an appearance similar to the lower-case letter ‘f.’ Like in modern typography, in the seventeenth century, the lower-case ‘f’ has a cross-stroke that fully bisects the letter’s curved stem (i.e., the cross-stroke appears on both the left- and right-hand side of the stem). The long ‘s’ (rendered as ‘ſ’), however, is distinguished from the lower-case ‘f’ by having a cross-stroke that either only appears on the left side of the stem or is nonexistent (‘ſ’ or ‘ſ’). At the onset of this project, the concern was that the software models would not distinguish between the lower-case ‘f’ and long ‘s’ with acceptable accuracy, especially in instances with textual noise interference due to the aged and degraded physical print source. ML models were trained to output transliterations of any ‘f’, ‘ſ’ or ‘ſ’ and ‘s’ letters based on their positions within words, assuming ‘s’ in an initial or final position and ‘f’ when present in medial position. Due to the frequency of this feature, the models ultimately exceeded this original assumption and learned to distinguish between the long ‘s’ and ‘f’ characters with high accuracy, rendering each respectively as ‘s’ and ‘f’ in the output text (see Section 4.4 in the Findings section).

3.1.3 Interpreting ‘ç’ as ‘z’

Given that ‘ç’ has fallen into disuse in favor of the ‘z’ in modern Spanish spelling, the ML models were trained to always interpret ‘ç’ as ‘z’ and output rendered text accordingly.

3.1.4 Use of Tildes (‘∼’) and Macrons (‘¯’) to Truncate Word-Length

To reduce the cost of materials, early modern compositors had a system for abbreviating words to save space on the page. When an ‘m’ or ‘n’ appeared at the end of a syllable and was preceded by a vowel (e.g., ‘contra,’ ‘enmendar,’ ‘compuesto,’ etc.), it could be omitted and signaled diacritically by a tilde (‘∼’) or a macron (‘¯’) over the preceding vowel (e.g., ‘cõtra,’ ‘enmēdar,’ ‘cõpuesto’). Given Spanish grammatical morphology, these abbreviations are most prevalent to omit ‘n’ rather than ‘m’, which occurs in less instances. The ML models were trained to assume the letter ‘n’ follows a vowel with either a tilde or a macron above it and replace the abbreviated form with the word spelled out in its entirety.

3.2 Machine Learning Approaches

The renAIssance opted for four state-of-the-art machine-learning models for the transcription of historical archival documents that do not follow standard printing practices. Each of the four approaches has strengths and weaknesses. A brief description of each of the four methods employed follows.

3.2.1 SeqCLR: Optical Character Recognition via Self-Supervised Contrastive Learning

A self-supervised approach enables a machine learning model to learn from a small number of images by generating labels from the data automatically. As demonstrated by Aberdam et al. (2021), SeqCLR is a self-supervised OCR process that allows a machine learning model to be trained to recognize features of characters in images by learning to put similar characters closer to each other and push different characters away, a technique generally known as Contrastive Learning (CL). The model uses a set of character images and creates automatically augmented versions (slightly different copies made with alterations such as diffusion or rotation to increase training complexity) to learn to better distinguish new input images. The process is regarded as self-supervised because the labels, as data attached to the inputs to assist the model in distinguishing them from other inputs, are generated by algorithms.

However, using SeqCLR is not without difficulties. As demonstrated by Liu et al. (2022), SeqCLR may incorrectly learn to put different kinds of instances (negative pairs) close and the same kinds of instances (positive pairs) away from each other due to data misalignments. When an instance in an image feature and an instance in the corresponding augmented image feature share the same index, they should be paired positively. When instances in different indices or images represent the same information (same characters, word style, etc.), SeqCLR may routinely treat them as negative pairs despite the expectation of them being a positive pair. This means that due to SeqCLR’s inherent architecture, some of the raw input image data may be overlooked during image recognition and require successive architectures to be implemented in the coding phase to improve the final output. Because of this, PerSec was created: a successor model to SeqCLR that improves on the architecture by leveraging Hierarchical CL, which “can achieve better performance than other methods under both un- and semi-supervised learning settings” (Liu et al. 2022, 1703). Unlike traditional CL methods that are only applied to the last layer, PerSec is trained with CL in multiple layers: an intermediate layer (Stroke Context Perceiver) and a last layer (Semantic Context Perceiver). By not focusing solely on features from the last layer but instead adding an intermediate one, PerSec contrasts features within the same image across different levels, thereby improving its discriminative power.

3.2.2 Vision Transformers (ViT) for OCR of Historical Texts

ViTs represent a novel approach to image recognition, adapted from models initially designed for NLP. As detailed by Dosovitskiy et al. (2021) in “An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale”, images are broken down into smaller patches, which are embedded into a linear sequence that is then processed with a Transformer model, akin to how words are grouped to structure sentences in natural languages, as well as how words are embedded in NLP models. Positional embeddings are added to maintain the spatial information of the patches, following a similar logic to processes that add data labeling, so that the Transformer can effectively process and understand the entire image without relying on convolutional layers. ViT performance has been shown to be highly dependent on pretraining with large datasets: code that pretrains in this fashion, such as ImageNet-21k or JFT-300M, can achieve a remarkable degree of accuracy that often surpasses traditional CNNs in image classification tasks. Large-scale pretraining, then, is crucial to help the Transformer model learn more generalized features and has led ViT to set new benchmarks in image recognition tasks well-known in the field, such as ImageNet or CIFAR-100.

These results make ViT a highly attractive option for image recognition applications such as the OCR of historical documents sought by the renAIssance project. Early modern literary documents exhibit significant variability in composition, fonts, and material characteristics which traditional OCR tools struggle to handle. Given ViT’s lesser reliance on convolutional layers, it is expected to better manage source material diversity, hence leading to an increase in the accuracy of text recognition and a decrease in the Character Error Rate (CER) and Word Error Rate (WER). Additionally, ViT’s scalability implies an enhanced ability to process large collections of historical texts of various types and prints, with potential implications to enable further OCR projects but also aiding in the preservation and digitization of important historical records.

3.2.3 TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Another approach to OCR that prioritizes employing pre-trained Transformer models for both image and text processing is TrOCR, or Transformer-based OCR. Traditional OCR systems typically rely on CNNs for image understanding and RNNs for text generation: where CNNs process images by detecting patterns through convolutional layers, RNNs handle the sequenced data by maintaining context through cyclical connections. Instead, TrOCR utilizes a pure Transformer architecture for both tasks, resulting in a unified and highly effective model. As demonstrated by Li et al. (2023) in “TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models,” this method adopts an encoder-decoder architecture in which the encoder is based on a pre-trained ViT. In a similar way to Dosovitskiy et al. (2021), TrOCR is designed to process images by dividing them into smaller patches to capture detailed visual features from the input images. Next, the decoder, a pre-trained text Transformer such as the Bidirectional Encoder Representations from Transformers (BERT) model proposed by Devlin et al. (2019), generates the recognized text from the visual features provided by the encoder. This integrated approach eliminates the need for a separate language model, thereby streamlining the OCR process.

By adopting a unified Transformer architecture for both image and text processing, TrOCR simplifies the overall model while enhancing performance. Leveraging large-scale pre-trained models, TrOCR benefits from extensive visual and language data, improving text recognition accuracy. The model achieves state-of-the-art performance across WER, CER, and F-score (a measure of predictive performance) OCR benchmarks, including printed, handwritten, and scene text recognition, without the need for complex pre- and post-processing steps. The relevance of TrOCR when it comes to historical document OCR is notable given its flexibility for recognizing text in diverse formatting styles, fonts, and varying conditions that challenge traditional OCR tools. TrOCR’s advanced visual and language modelling capabilities increase the potential for precise text extraction, and, like ViT models, it reduces the ratio of CER and WER while maintaining rich potential for highly scalable applications.

3.2.4 MaskOCR: Text Recognition with Masked Encoder–Decoder Pretraining

MaskOCR represents another innovative approach to text recognition by integrating visual and language pre-training into a unified model. This method significantly enhances the model’s ability to accurately recognize text by leveraging both visual representations and linguistic knowledge, as demonstrated by Lyu et al. (2022) in “MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining.” MaskOCR employs an encoder-decoder architecture, where the feature encoder is pre-trained using a large set of unlabeled text images through a masked image modeling approach. This technique involves hiding (or “masking”) certain parts of the input data, such as sections of an image or text, and training the model to predict the missing parts. This encourages the model to develop a deeper understanding of the underlying structure and context, enhancing its capacity for generalization. As a result, pre-training enables the model to learn robust visual representations without the need for labeled data, thus improving its image recognition capabilities. For the sequence decoder, text data is transformed into synthesized text images, which are then utilized to enhance the language modeling capabilities of the decoder using a masked image-language modeling technique. To preserve the quality of visual representations during pre-training, the encoder is kept frozen: parameters are no longer updated, allowing the model to retain and use the pre-learned features without alteration while learning new tasks on the target dataset.

The unification of vision and language pre-training within the encoder-decoder framework that MaskOCR is based on significantly augments the model’s ability to capture both visual and linguistic features. The method leverages self-supervised learning techniques, such as masked image modeling and masked image-language modeling, to pre-train the model on extensive unlabeled and synthetic data. As such, MaskOCR achieves state-of-the-art performance on benchmark datasets, demonstrating substantial improvements in text recognition accuracy, particularly for Chinese and English text images. Given its demonstrated effectiveness, the renAIssance team prioritized MaskOCR to analyze our dataset consisting of historical Spanish documents: the integration of both visual and language pre-training should more effectively handle complications inherent to the more variable printing practices in early modern archival sources. This unpredictability makes the adaptability of methods like MaskOCR fundamental for the renAIssance project.

4 Findings

The four machine learning approaches that the team selected displayed varying levels of accuracy and potential for improvement. These findings are described below, specifically detailing what was learned from each of the four approaches employed.

4.1 A Convolutional Recurrent Neural Network with a CRAFT Addition

The first approach is based on a recurrent CNN architecture (CRNN), as introduced by Shi et al. (2017), which trained on the dataset after having the source PDFs pre-converted into images for processing. By applying Baek et al.’s (2019) well-regarded Character Region Awareness for Text Detection model (CRAFT), this approach facilitated the precise detection of bounding boxes around text regions, crucial for generating ground truth labels and characterizing text instances effectively. Bounding box coordinates were split and sorted based on vertical proximity and horizontal alignment to ensure that words within sentences were logically sequenced, adhering to the natural reading order of Spanish text. Input data was organized for the CRNN model by associating images with corresponding word names and defining acceptable levels of Connectionist Temporal Classification (CTC) loss function. CTC is a process that sums all possible alignments of the input sequence (characters in the text) that could result in the target output (transcription), considering all valid ways to map the input sequence to the output sequence. The loss value is differentiable with respect to each node, therefore, the closest the loss value is to 0, the more accurate the algorithm is, thus learning to correctly predict the expected letters. The CRNN approach produced a CTC loss that approximated 1 after pre-training and fine-tuning, but further data augmentation was required to increase its accuracy (Figure 2).

Figure 2:

Example of completed data augmentation involving +/− 5-degree rotation as well as insertion of randomized noise via Gaussian filter.

Given that the CRNN model was trained on the target dataset alone, without NLP tools pre-trained on existing datasets, data augmentation processes were necessary to drastically increase the training data pool. This process added complexity to the task, challenging the in-training code to identify similar, but not identical looking text, and thus improved its prediction capabilities. Data augmentation led to a decrease in CTC loss to 0.1 in training as well as after testing for 15 epochs, which was also aided by a learning rate scheduler set to 1e−3. The function of the scheduler automatically decreases the model’s learning rate as the number of epochs increases until there is a negligible change in CTC loss, and thus in any noticeable code improvement. By reducing the learning rate by a factor of 0.5 every 3 epochs (with a minimum rate of 1e−6, every 6 epochs), difficulty is gradually increased to keep challenging the model to become more accurate. A major obstacle for this approach was handling typeface variability, which was solved by re-training the code with the complete renAIssance dataset once it was fully expanded. Figure 3 shows the capability of this CRNN approach after training to transcribe text from a page it has never seen before: accuracy, although not yet perfect given that the model is still considered “in-progress”, is remarkably high with minor errors in “ooligaciones” (failing to properly identify the ‘b’) and “qunie” (instead of ‘quie + n’) and an unexpected result in “ojos” instead of the simple ‘a’ in the source.^[3] This approach achieved a CER of 0.0282, which equates to an accuracy that rose to 95.03 % at the character level. Like the SeqCLR approach, this model outputs single word data, rendering WER values incalculable for accuracy.

Figure 3:

Example of CRNN model performance on previously unprocessed text, after training was completed through data augmentation.

4.2 SeqCLR and the Limitations of Contrastive Learning

The second approach consists of a deep learning model using self-supervised learning. Instead of using SeqCLR for word level text recognition as proposed by Liu et al. (2022), the renAIssance team used it for line-level recognition by optimizing the encoder’s output (i.e., frame sequences), integrating Baek et al.’s (2019) CRAFT model and PyTorch (a framework for building deep learning models) to construct a pipeline that could deliver greater accuracy in word-level recognition (Baek et al. 2019). Our hypothesis was that a cleaner source frame would make identification easier in later stages. This architecture combines ResNet50 (a popular architecture used for computer vision tasks) and a 2-layer Bidirectional LSTM as the encoder (capable of bidirectional sequential processing both in the order of occurrence and in reverse), as well as an Attention LSTM Decoder. The ML model was trained on the archival sources, and automatically generated a dataset containing up to 700,000 synthetic word images with annotated bounding boxes, which allowed the model to focus on the main text instead of the marginalia.^[4] Essential features from the data were extracted by contrasting text images with their transformed copies, including vertical crop, Gaussian blur, random perspective, and random affine transformations. The decoder was trained using both the automatically generated images as well as the 3966 archival folio page renAIssance dataset. CER was used to evaluate the model’s recognition accuracy, as it is generally deemed the most appropriate metric in the field for a decoder predicting characters (Figure 4).

Figure 4:

Example of annotated bounding boxes as used to train the model.

The initial learning rate was set to 0.01 for the CL (3 epochs) and fine-tuning phases (10 epochs). The loss, which indicates how well the model learns during training, remained constant, implying that the model did not learn effectively. In fact, SeqCLR performed better without CL than with it, contrary to the results reported in Liu et al. (2022): with CL, CER exceeded 10 %, whereas without CL, CER decreased to just 0.1. Hypothesizing that implementing CL compromised SeqCLR’s performance forcing the model to learn unrelated information, the team altered two essential values: first, the temperature of the CL loss function, a hyperparameter that determines insensitivity to differences between sequences. Given that the features the model encounters during CL are abstract and difficult to distinguish, temperature was lowered to 0.5, which gradually decreased the loss value during this phase. Second, the team decreased the learning rate during fine-tuning, which dictates how much the model’s parameters are adjusted during training, to 0.0001. These changes brought CL to a 1 % improvement in CER over an approach without CL after 30 fine-tuning epochs, signaling that CL is a valuable tool that can eliminate the need for synthetic data. In parallel with the SeqCLR method, the team explored the potential of PerSec to further increase accuracy with CL. Unlike SeqCLR, PerSec repeatedly demonstrated a gradually decreasing CER during the CL phase in Liu et al. (2022). However, the fine-tuning phase generated poor results, producing identical outputs regardless of input. Further research is required to overcome this deficiency; as the CL components Stroke Context Perceiver and Semantic Context Perceiver may be overfitted (when an algorithm fits too closely to its training data and thus can’t make accurate predictions) to the task, reducing generalization.^[5] Despite these difficulties, this approach still managed to achieve a CER of 0.041, representing a level of accuracy of 95.81 % at the character level. Given that this model uses CRAFT to output data as segmented single words rather than complete sentence lines, WER values do not apply as a calculable metric for accuracy.

4.3 Vision Transformers

The third approach focuses on creating a ViT-based model. Transformers excel at handling sets of tokens through the mechanism of attention, involving a massive computational load as the number of tokens increases. Their application to images, however, can be challenging given that they must process rasterizations of pixels containing enormous numbers of individual data points. A simple 250 × 250 pixel image, while exceedingly small for most contemporary commercial uses in 2024, represents an oversized computational load for a Transformer: every pixel must attend to every other pixel in the training phase, leading to ((250)²)² or 3,906,250,000 connections; billions of calculations that make such an arithmetical processing impractical for OCR. Instead of performing attention over individual pixels, the ViT operates in image patches such as 16 × 16 pixels, significantly reducing the computational burden (Dosovitskiy et al. 2021). The renAIssance dataset scans were first processed through morphological dilation and binarization to remove visual artifacts through masking, as seen in Figure 5. The images were then divided and fed to the model, unrolled into vectors, and processed with the Transformer along with embeddings that were encoded considering position and dimension using sine and cosine functions. In this approach to OCR, positional embeddings are crucial for a successful model, given that transformers are inherently permutation invariant: changing the order of the input tokens would not affect the resulting output, so the embeddings inform the model about the position of each patch within the original image.

Figure 5:

Example of data processing through morphological dilation, binarization, and segmentation to achieve an artifact-free image for use in model training.

This model was trained on both non-augmented and augmented datasets with a cross-entropy loss function, which is useful to measure difference between the predicted probability distribution for each of the tokens in a sequence. However, it evaluates each token independently, which can lead to suboptimal sequence generation, as it doesn’t account for the overall sequence quality. The approach started displaying signs of exposure bias, which, given that it trained on ground truth sequences while it generated its own predictions, can sometimes lead to discrepancies between training and inference performance. To solve this, the team shifted to use a beam search loss, which is often used to generate sequences because it explores multiple potential output sequences and selects the best one based on a scoring mechanism. By evaluating multiple candidate sequences and their likelihoods, the beam search loss optimizes for the best sequences by considering the sequence as a whole, improving coherence and accuracy in text generation. This change mitigated exposure bias and provided a more robust training signal for the model. These refinements achieved a CER or 0.049 and WER of 0.099, indicating a high level of accuracy at 95 % for characters and 90 % at the word level.^[6]

4.4 Enhancing CNNs with TrOCR

The fourth and final approach began as a comparison between the MaskOCR and TrOCR methods applied to the renAIssance dataset. However, it soon became evident that the latter was a better fit for early modern archival sources, because (1) TrOCR was better optimized for text-to-sequence conversion, (2) it leveraged a pretrained language model that enhanced linguistic structure and context handling, and (3) it proved easier to fine-tune than MaskOCR. For these reasons and given TrOCR’s generalizability as well as its robust initial renderings of seventeenth-century Spanish documents, MaskOCR was de-emphasized. The team expected that TrOCR applied to a CNN should surpass a traditional CNN-based approach, given that the latter might face limitations in handling diverse and complex dataset variability, such as the common printing variances encountered in early modern sources.

Data was pre-processed by converting raw scanned text images into formats that are optimal for model inference: this workflow involved a variety of stages that included PDF preprocessing, rescaling, binarization, noise removal, dilation/erosion, rotation/de-skewing, borders, transparency/alpha channel, and line segmentation. This approach also integrated Baek et al.’s (2019) CRAFT model to segment pages into individual lines to execute preprocessing and increase the chances of optimal detection performance. With text lines detected and segmented, individual line segments were matched with their corresponding source text to create a comprehensive dataset for training the OCR model (see Figure 6). Line segmentation was improved by using OpenCV-based methods to better understand the structure of printed texts and remove outliers. Finally, the TrOCR model was fine-tuned on Spanish text by resizing images and dividing them into 16 × 16 patches as described on the ViT study carried out by Dosovitskiy et al. (2021). By establishing a correct ground truth with the provided transcriptions, the code was trained to account for printing practice variability in order to learn contextual representation at the character level, which was expected to aid in resolving ambiguities such as the interchangeability of ‘u’/’v’ as well as the characters ‘f’/‘ſ’/‘s’.

Figure 6:

Processing steps from a raw scanned image, to deskewing, binarization, noise removal, border removal, and final line segmentation filtered for padding, thresholds, overlapping texts, marginalia, headers, and footers.

The model was trained to account for printing practice variability within the initial dataset. This helped the algorithm to learn contextual representation at the character level, which was fundamental in aiding the code to resolve expected ambiguities such as the marginalia that had chosen to ignore for this project, or the interchangeability of ‘u’/‘v’ as well as the distinction between the ‘f’/‘ſ’ characters common in early seventeenth-century print. This model was able to account for letter exchanges quite adeptly, making some general assumptions for the most likely candidate depending on word position. While this assumption is necessarily generalizing and thus incorrect, it minimized CER by accounting for Spanish word morphology and likely character combinations. This approach would inevitably require some human correction at the conclusion of the transcription, but the consistency the general assumption provides eased the complexity of training the AI. Some cases were easy to code for, such as the use of ‘ç’ or cedilla, which was no longer used after an orthographic reform in the eighteenth century, replaced by ‘z’.

In cases such as Figure 7, the assumption of ‘z’ whenever a ‘ç’ was present posed no difficulty other than the correct OCR result with a simple substitution applied afterwards. The model did not output “çar, como dixo Ciceron”, but “zar, como dixo Ciceron,” the desired result of the renAIssance team. When it came to ‘u’/‘v’ interchangeability, however, the process was more complex, such as with the example in Figure 8.

Figure 7:

Extract from Padilla’s Nobleza Virtuosa, p. 198.

Instead of transcribing this short passage incorrectly as “vno inconftante ceffando, o necio fi” in Figure 8, the algorithm was able to account for initial position ‘u’/‘v’ assuming a higher likelihood of ‘u’ instead of ‘v’, as well as intra-word ‘ſ’ as ‘s’ in most likely scenarios. Thus, the code correctly transcribed “uno inconstante cessando, o necio si” in this instance. This adaptability is a direct result of the use of TrOCR applied to a CNN, given this method’s aforementioned flexibility to recognize text in different conditions. A similar versatility was observed when it came to ‘ſ’/‘s’ interchangeability.

Figure 8:

Extract from Padilla’s Nobleza Virtuosa, p. 198.

The model was trained to assume intra-word ‘f’/‘s’ as likely ‘f’, while at the initial of final position of a word the character should most likely be ‘s’ given how pluralization works in Spanish as well as the minimal number of words that end in ‘f’. As seen in Figure 9, the model correctly outputs “stro” for the first word (itself the second component of a word hyphenated in the previous text line) and “antes” for the last one. However, the expectation for the second and third words would have been an output of “gufto” and “confiderando”, yet they were correctly rendered as “gusto” and “considerando.” The multiple rounds employed to pre-train and subsequently train the ML algorithm made their usefulness evident in this case, as the code learned from the context and frequency of certain words and phrases. Despite the assumption about ‘f’/‘s’ positioning, the TrOCR enhanced CNN was able to spell the words correctly because it recognized that “gusto” and “considerando” are far more likely in the given context than their alternative forms. This suggests that the fine-tuning process allowed the model to balance initial assumptions with learned linguistic patterns, leading to accurate predictions even in cases where the assumptions might have led to errors. This approach achieved a CER of 0.024 and WER of 0.047, indicating an excellent level of accuracy at 97.5 % for characters and a similarly high 95.22 % accuracy at the word level.^[7] Table 1 compares the CER, WER, and accuracy values of this approach with the other three explored by the renAIssance team.

Figure 9:

Extract from Padilla’s Nobleza Virtuosa, p. 198.

Table 1:

A comparison between the CER, WER, and accuracy values for all four AI/ML approaches studied by the renAIssance team.

Approach	CER	WER	Accuracy
CRNN	0.0282	n/a	95.03 %
SEQCLR	0.0410	n/a	95.81 %
VIT	0.049	0.099	95 %
TROCR	0.024	0.047	97.5 %

As the variability of the dataset progressively increased, this model encountered an obstacle caused by the initial assumptions on out-of-frame marginalia with which it was coded: the algorithm was trained to ignore marginalia and instead focus on the main framed text (see blue/green lines in Figure 6), as this was the printing style used in all of Padilla’s texts. However, upon expanding the dataset to a broader set of source texts as detailed in the Methods section, the model continued to apply this assumption to texts that had no frame or marginalia. Seeing neither of these two parameters, the model opted to ignore the last word of each line of text, expecting it to be text outside the frame as if it were marginalia. This was not an error by the model: the obstacle stemmed from the assumptions with which the code was created: these were accurate for the Padilla texts, but not applicable as a general principle to transcribe a more heterogenous corpus of archival materials. The solution was straightforward: the model was trained to differentiate whether there is a frame encasing the main text as a trigger to account for marginalia, making the algorithm incrementally more generalizable without sacrificing specialized performance achieved on the existing dataset.

5 Discussion

The four machine learning-based approaches presented in the findings were successful in providing OCR digitization of early modern historical documents, with varying degrees of broader applicability. Although the TrOCR enhanced CNN model ended up outputting some of the most usable results, this does not imply the other approaches were less valuable; rather, their application may be more suitable to other types of printed text (e.g., from different historical periods, with less noise interference impeding character recognition, or upon further expanding the dataset materials to be used as a richer basis of ground truth that is simply not available within the historical documents selected for this study). There are several inferences that can be made about best practices for future research work that uses machine learning algorithms for humanities research, especially when it comes to transcription of historical sources that target archival materials of varying levels of scan quality.

First and foremost, the consistency and clarity of the archival sources that comprise the dataset is of fundamental importance to the success of the model. The team opted for a tightly controlled and cohesive initial dataset to provide predictable ground truth to easily train the algorithms, enriched by data augmentation to insert some noise and complexity while preserving the same variability limitations when it comes to character recognition. Once the accuracy reached a certain level, the dataset was expanded to allow for more complex training on diversified data to improve the generalization ability of the models. This training granularity allowed the renAIssance team to ascertain when the model became sufficiently accurate on one particular source (Padilla’s books) and only then explore greater adaptability and generalization (e.g., transcribing a more diverse set of print styles but with less accuracy).

Second, initial design choices carefully taken into account at the project’s onset should be reevaluated as the dataset itself evolves. As seen in the fourth approach where the model applied text frame and marginalia assumptions indiscriminately, some project assumptions have the upside of successfully specializing an algorithm to a particular set of texts, but also have the downside of negatively affecting the code’s capacity for generalization. Although guiding principles such as the ones adopted by the renAIssance team were useful for the initial dataset, they could not cover – by design – the vast amounts of possible styles in the early modern print market. Consequently, depending on a research team’s objectives, more sophisticated ML models are needed to completely automate historical document transcription to avoid a manual or case by case process. In the meantime, algorithm fine-tuning is essential to the success of a focused artificial intelligence/humanities research endeavor.

Third, collaborative work throughout the coding phase provides a greater number of opportunities for creativity and problem-solving. While the four approaches to this study were selected to identify the best approach, they greatly informed and complimented each other. Improvements in code, algorithm design, and problem resolution benefited directly from a team structure that encouraged communication, information sharing, and a collaborative research methodology.

Fourth, our study suggests that pre-training the ML models to interpret the expected spelling irregularities in the archival source texts was key to the project’s success. OCR output analysis revealed that the progressive evolution of the dataset allowed the model to be scaled up in a controlled environment, opening the door to further expansions in the future that may process large collections of historical documents while increasing the likelihood of robust and accurate text recognition across diverse historical sources.

Given these inferences, informed by the findings throughout this study, there is strong evidence that certain best practices can increase the probability of success of future research undertakings that integrate artificial intelligence and machine learning with humanities research: patch extraction, positional embeddings, sequencing fed to an encoder, classification tokenization, and pre-training on both small and large databases depending on the scope of the project. Future iterations of research endeavors such as the one presented here may choose to train an ML model not just to read printed texts, but manuscripts that are exponentially harder for either researchers or computers to transcribe. Further development is also needed to optimize the CRAFT model to avoid errors in word-level segmentation, which was an obstacle for the renAIssance team when automating data annotation. Other research endeavors could build upon this ML OCR transcription work and enrich it with a process that checks transcribed output text against a digital dictionary, or even more advanced large language models based on generative AI/ML tools such as the recent ChatGPT 4o.

Ongoing work in optimizing approaches such as the specialized one taken by renAIssance is crucial to achieve greater viability in integrating AI/ML approaches with the Humanities. The team continues to compile materials to create an increasingly comprehensive and diverse early modern Spanish print text dataset to train the model towards the development of a freely shareable transcription tool. This has the potential to enable, accelerate, and democratize access to AI/ML OCR transcription by libraries, archives, and interested institutions around the world that hold thousands of these documents but currently lack the resources to OCR them. Finally, renAIssance is committed to providing fully open access resources, from the algorithms to the end-user tool, as signaled by our code’s availability on GitHub.^[8] Transkribus’s recent shift to a premium subscription model limits free use of the app (albeit to a generous 3,000 pages), which ultimately reduces its utility and accessibility for entities working with large datasets at the scale envisioned in renAIssance’s long-term objectives.

6 Conclusions

The findings in this study are tied to specific documents, as demonstrated by the initial controlled dataset and its careful progressive expansion towards greater variability and thus an increased AI/ML model design challenge. The fact that the four algorithm models in this study achieved an accuracy of at least 95 % represents a remarkable achievement for Digital Humanities. Given the renAIssance team’s target of transcribing specific sources in lieu of creating a fully generalizable ML approach to OCR, this study demonstrates that integrating artificial intelligence into the transcription of historical archival sources is not only feasible, but potentially groundbreaking for the field. Ongoing work on targeted efforts such as renAIssance and Transkribus is crucial, given that OCR tools widely available to regular consumers such as Adobe Acrobat are largely unable to handle transcription challenges presented by historical documents. As the Covarrubias extract in Figure 10 shows, processing it with Adobe’s Acrobat Pro software rendered “fancacafadeSanSaluadordeOuiedo.Klpr’meroelgranDiegodeCouarruuiss.” By contrast, the renAIssance approach rendered a much more accurate “santa casa de San Salvador de Oviedo. El primero el gran Diego de Covarruvis” as its transcription, missing only one vowel (‘a’) at the end of the sentence.

Figure 10:

Extract from Covarrubias’s Tesoro de la lengua castellana o española, p. 8.

Expanding the size of the dataset over the course of the study, and its corresponding increased variability, proved beneficial to make the algorithm more generalizable. Taken together, these findings suggest that despite TrOCR’s higher final accuracy, all four approaches employed by renAIssance have different uses that may be more or less productive depending on the materials to be transcribed. Envisioning the potential of combining such different models relative to a research project’s goals allows humanities researchers to not just conceptualize the positive disruption that ML may bring to so-called ‘traditional’ fields such as literature and history, but to develop new research practices that can bring them into the future of the AI economy. This study shows that as ML improves its capabilities through further development, exploration, and fine-tuning, the development of field-specific artificial intelligence algorithms enables humanities researchers to conduct new forms of investigation. The amount of computation power made possible by machine learning would be impossible or impracticable for a human to do manually (i.e., exploring printing variability akin to the practices detailed in this article, but expanding the archival corpus to thousands of sources that would take a human being a lifetime to gather, catalog, analyze, and study).

The four methods to train an ML model presented in this article can be applied to different datasets, and they all hold value as potential pathways to transcribe different text sources. Altogether, these methods create more efficient and analytically divergent technical possibilities for the study of historical documents. As the work of the renAIssance team in this study shows, the door is open for humanities researchers to develop diverse approaches enhanced and/or enabled by integrating artificial intelligence into academic fields often thought of as ‘traditional’ disciplines. In this age of technological progress and scientific advancement at an increasingly rapid cadence, the Humanities must harness the power of AI/ML and evolve from siloed and heterogenous ‘traditional’ repositories of knowledge to develop a broadened, diversified, and more accessible digital archive for the twenty-first century.

Corresponding author: Xabier Granja Ibarreche, Modern Languages and Classics, The University of Alabama, Box 870246, 400 McCorvey Drive, Tuscaloosa, Al 35487-0246, USA, E-mail: xgranja@ua.edu

Arsh Khan, Utsav Rai, Shashank Shekhar Singh, Yukinori Yamamoto contributed equally to this work. They are recognized first among the authors due to Google’s stipulations regarding inclusion of Summer of Code contributors in publications that result from work funded by GSoC.

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission. Gleyzer: Conceptualization (Supporting), Funding acquisition, Project administration, Supervision. Granja Ibarreche: Conceptualization (Lead), Formal analysis, Funding acquisition, Project administration, Resources, Supervision, Writing. Khan: Data curation, Investigation, Methodology, Software, Validation. Meadows: Formal analysis, Project administration, Resources, Supervision, Writing. Rai: Data curation, Investigation, Methodology, Software, Validation. Shekhar Singh: Data curation, Investigation, Methodology, Software, Validation. Yamamoto: Data curation, Investigation, Methodology, Software, Validation.
Use of Large Language Models, AI and Machine Learning Tools: None declared
Conflict of interest: The authors state no conflict of interest.
Research funding: A.K., U.R., S.S.S. and Y. Y. were participants in the renAIssance project, coordinated by the HumanAI Organization, as part of the 2024 Google Summer of Code.
Data availability: Not applicable.

References

Aberdam, Aviad, Ron Litman, Shahar Tsiper, Oron Anschel, Ron Slossberg, Shai Mazor, R. Manmatha, and Pietro Perona. 2021. “Sequence-to-Sequence Contrastive Learning for Text Recognition.” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15297–307. IEEE.10.1109/CVPR46437.2021.01505Search in Google Scholar

Buendía, Fausto Agustín de. 1740. Instrucción de christiana y política cortesanía con Dios y con hombres. Gerona: Jayme Bró.Search in Google Scholar

Baek, Youngmin, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. “Character Region Awareness for Text Detection.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9357–66. Long Beach, CA: IEEE.10.1109/CVPR.2019.00959Search in Google Scholar

Bazzaco, Stefano, Ana Milagros Jiménez Ruiz, Mónica Martín Molares, and Ángela Torralba Ruberte. 2022. “Sistemas de reconocimiento de textos e impresos hispánicos de la Edad Moderna. La creación de unos modelos de HTR para la transcripción automatizada de documentos en gótica y redonda (s. XV–XVII).” Historias Fingidas (Special Issue 1 Humanidades Digitales y estudios literarios hispánicos): 67–125. https://doi.org/10.13136/2284-2667/1190.Search in Google Scholar

Blasut, Giada. 2022. “Los modelos de HTR Silves1549_BNE y Spanish Gothic como herramientas de la labor ecdótica.” Historias Fingidas (Special Issue 1 Humanidades Digitales y estudios literarios hispánicos): 175–93. https://doi.org/10.13136/2284-2667/1178.Search in Google Scholar

Christy, Matthew, Anshul Gupta, Elizabeth Grumbach, Laura Mandell, Richard Furuta, and Ricardo Gutierrez-Osuna. 2017. “Mass Digitization of Early Modern Texts with Optical Character Recognition.” Journal of Computing and Cultural Heritage 11 (1): 1–25. https://doi.org/10.1145/3075645.Search in Google Scholar

Cuéllar, Álvaro. 2023. “La Inteligencia Artificial al rescate del Siglo de Oro. Transcripción y modernización automática de mil trescientos impresos y manuscritos teatrales.” Hipogrifo. Revista de literatura y cultura del Siglo de Oro 11 (1): 101–15. https://doi.org/10.13035/H.2023.11.01.08.Search in Google Scholar

Covarrubias, Sebastián de. [1611] 1674. Tesoro de la lengua castellana o española. Madrid: Melchor Sánchez.Search in Google Scholar

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In Long and Short Papers, edited by Jill Burstein, Christy Doran, and Thamar Solorio, 4171–86. Vol. 1 of Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, MN: Association for Computational Linguistics.Search in Google Scholar

Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al.. 2021. “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.” Paper Presented at the Ninth International Conference on Learning Representations, Virtual Conference, May 2021. https://doi.org/10.48550/arXiv.2010.11929.Search in Google Scholar

Ezcaray, Antonio de. 1691. Vozes del dolor nacidas de la multitud de pecados…. Sevilla: Thomas López de Haro.Search in Google Scholar

Guardiola, Juan Benito. 1591. Tratado de la nobleza, y de los títulos y ditados que oy día tienen los varones claros y grandes de España. Madrid: Viuda de Alonso Gómez.Search in Google Scholar

Li, Minghao, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and F. Wei. 2023. “TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models.” Proceedings of the AAAI Conference on Artificial Intelligence, 37 (11): 13094–102. https://doi.org/10.1609/aaai.v37i11.26538. https://www.microsoft.com/en-us/research/publication/trocr-transformer-based-optical-character-recognition-with-pre-trained-models/.Search in Google Scholar

Liu, H., B. Wang, Z. Bao, M. Xue, S. Kang, D. Jiang, Y. Liu, and B. Ren. 2022. “Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition.” Proceedings of the AAAI Conference on Artificial Intelligence 36 (2): 1702–10. https://doi.org/10.1609/aaai.v36i2.20062.Search in Google Scholar

Lyu, Pengyuan, Chengquan Zhang, Shanshan Liu, Meina Qiao, Yangliu Xu, Liang Wu, Kun Yao, Junyu Han, Errui Ding, and Jingdong Wang. 2022. “MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining.” arXiv preprint. https://doi.org/10.48550/arXiv.2206.00311.Search in Google Scholar

Martínek, Jiří, Ladislav Lenc, and Pavel Král. 2020. “Building an Efficient OCR System for Historical Documents with Little Training Data.” Neural Computing and Applications 32 (23): 17209–27. https://doi.org/10.1007/s00521-020-04910-x.Search in Google Scholar

Mendo, Andrés. 1662. Príncipe perfecto y ministros aiustados, documentos políticos y morales: en emblemas. León de Francia: Horacio Boissat y George Remeus.Search in Google Scholar

Milán, Luís. 1561. Libro intitulado el Cortesano, dirigido a la Catholica, Real Magestad, del Invictíssimo don Phelipe, por la gracia de Dios Rey de España nuestro señor. Valencia: Casa de Ioan de Arcos.Search in Google Scholar

O’Shea, Keiron, and Ryan Nash. 2015. “An Introduction to Convolutional Neural Networks.” ArXiv e-prints, 1–11. https://doi.org/10.48550/arXiv.1511.08458.Search in Google Scholar

Padilla, Luisa de. 1637. Nobleza virtuosa. Zaragoza: Juan de Lanaja.Search in Google Scholar

Padilla, Luisa de. 1639. Noble perfecto y segunda parte de la Nobleza virtuosa. Zaragoza: Juan de Lanaja.Search in Google Scholar

Paredes, Alonso Víctor de. 1680. Institucion, y origen del arte de la Imprenta, y reglas generales para los componedores. Madrid.Search in Google Scholar

Reul, Christian, Dennis Christ, Alexander Hartelt, Nico Balbach, Maximilian Wehner, Uwe Springmann, Christoph Wick, Christine Grundig, Andreas Büttner, and Frank Puppe. 2019. “OCR4all—An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings.” Applied Sciences 9 (22): 4853. https://doi.org/10.3390/app9224853.Search in Google Scholar

Schmidt, Robin M. 2019. “Recurrent Neural Networks (RNNs): A Gentle Introduction and Overview.” ArXiv e-prints, 1–16.Search in Google Scholar

Shi, Baoguang, Xiang Bai, and Cong Yao. 2017. “An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (11): 2298–304. https://doi.org/10.1109/TPAMI.2016.2646371.Search in Google Scholar

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention is All You Need.”In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), edited by Isabelle Guyon, Ulrike Von Luxburg, Samy Bengio, Hannah Wallach, Rob Fergus, S.V.N. Vishwanathan, and Roman Garnett, 5999–6009. Vol. 30 of Advances in Neural Information Processing Systems. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.Search in Google Scholar

Received: 2024-08-30

Accepted: 2024-11-06

Published Online: 2024-12-02

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/dsll-2024-0013

Keywords for this article

artificial intelligence; machine learning; archival research; transcription; OCR

Creative Commons

BY 4.0