OCR for old prints - opportunities and challenges
The rapid development of information technology we see today opened a completely new empirical perspective for the linguistic research mainly due to the possibility of creating models describing statistical features of a language. This process is possible on the basis of a so-called corpus which means a collection of texts representative for a given language. For people dealing with contemporary Polish, an invaluable resource is the National Polish Language Corpus comprising one and a half billion words. For obvious reasons, one cannot expect that a corpus of comparable size for old Polish language could ever be created. Nevertheless, some tools that might prove a useful solution for researchers have been developed, including a noteworthy extensive collection of the 16th century Polish corpus containing texts from the 16th century Polish Dictionary, collated in accordance with TEI (Text Encoding Initiative) recommendations. Moreover, a computer database for the corpus of Polish texts of the 17th and 18th centuries (to 1772) is being systematically built – including the source quotes for the entries in the Electronic Dictionary of the Polish language of the 17th and 18th centuries. In the era of mass digitization of library collections, however, the possibilities it offers are much greater.
POLONA already provides access to thousands of digital versions of old prints in high resolution, and this number is growing month by month, as the “Patrimonium” project progresses. These resources represent an unprecedented potential for building an extremely extensive diachronic corpus of the Polish language – all it needs is to read the printed texts and save them in a computer-readable form. Given the tremendous progress in the development of algorithms related to image processing and analysis, automation of this process seems within the reach. What is more, it seems possible to develop the software for creating digital versions of documents in XML format based on a series of bitmaps. In the case of old prints, however, it still remains a great challenge.
Fragment of the Psalter by Jan Kochanowski (Krakow, 1617) edited in TEI XML format
What is OCR?
The OCR process (Optical Character Recognition) first requires identification of the set of characters to be recognized. The key algorithm boils down to extracting a specific object in a given matrix of pixels (bitmap), and assigning it to one of these identified characters. As far as a clearly printed text is concerned, in a standard Latin typeface scanned in high resolution, its automatic recognition is no longer an issue. Indeed, modern OCR software can process it well enough with negligible error rate. In the case of texts published centuries back, before extensive standardization and automation allowed for high degree of print uniformity, the situation is far different.
In the case of the traditional printing process developed by Johannes Gutenberg, the original shape of each printed character was precisely engraved in a metal punch called a counter. The next stages consisted in copying this shape in subsequent carriers – embossing in a copper block called matrix, transferring to a font cast on this matrix, to be ultimately stamped on paper with printing ink. The OCR software, chiefly reverses this process – on the basis of the printed shapes it recognizes the reflected counters.
Claude Garamond’s typeface – counters, matrix, fonts and print
The first stage in the process of automatic text editing is the digital mapping of the document. The quality of such representation – including sharpness of contours, resolution, color depth – sets the limits for further processing capabilities. In the case of historical and particularly protected objects including old prints, this procedure should be applied only once, therefore it is of pivotal importance for the success of the entire undertaking. From this moment on, the process is applied not to a physical object any more, but to an abstract representation in the form of a set of mathematical matrices.
Contemporary printed documents may be easily placed on a flat scanner and pressed to its surface as necessary. However, in the case of historic objects, the digitization process must be absolutely safe, which means that mechanical intervention should be limited to an absolute minimum, and usually rules out any possibility of opening an old book wide – any disfigurements that might occur in the process may be corrected in digital processing of output files. Given that we often deal with volumes substantially damaged by the passage of time, it turns out that advanced processing is already required for the preliminary image preparation.
In addition to correcting any distortions, bends or corrugations, so as to obtain maximum of straight and horizontal lines of the text, a number of other issues need to be resolved. These include all damage caused during the use of the document – tearing, erasures, fading, stains and other dirt, but also other imperfections resulting from the original printing process. The printing press could unevenly spread the paint on the hand-cast and stacked fonts that could better or worse adhere to the printing paper. In addition, hand-made paper could not absorb hand-mixed paint well enough, or the paint could also visibly permeate to the other side of the sheet.
Pre-processing of the fragment of Jan Kochanowski’s Psalter (Kraków, 1587)
This is partly why the high quality of the digitization process applied to an old print is so critical – significant color depth allows for analysis enabling optimal standardization and binarization of the image. The thing is that the entire color space of each element in our matrix can be basically reduced to one bit; eventually, the preliminary processing boils down to distinguishing the print from the background. Sometimes however, the fragments of the text are completely illegible or were completely damaged; in such cases, accurate binarization is impossible, and such instances should be detected and marked in advance to avoid distortions in editing.
A damaged and illegible fragment of Jan Kochanowski’s Psalter (Kraków, 1583)
Here we proceed to the next stage, that is image segmentation. A great variety of algorithms may be used for page layout analysis. There are two general strategies that may be applied for this purpose; the first (“top-down”) consists in examining the structure from the most general level and recursive segmentation of the page contents into smaller and smaller fragments until individual lines, words and characters are extracted. Alternative strategy (“bottom-up”) consists in separating individual
objects and combining them into larger and larger structures. Either approach has its advantages and disadvantages, therefore it seems appropriate to develop a proper, comprehensive combination of both strategies – especially that the goal is not only to read the printed text, but to edit it in TEI XML format.
Page layout in old prints may be sophisticated and it significantly differs from the layout of contemporary documents, for which OCR software is designed. It is therefore necessary to identify various text elements, such as decorative initials. Another complication may also be the use of different fonts which should also be identified and properly marked in the edited version. It may occur that the algorithm identifying specific text areas will be incapable of distinguishing the text and decorative elements or illustrations.
It is only now that the process of classification of particular characters can be launched. This procedure is based on a dedicated model which classifies the element according to a set of its selected, distinctive, and characteristic features. In the case of old prints, such a model is trained from scratch. It is not only due to the fact that the outdated typefaces must be properly recognized. The main issue is the absence of a common spelling standard that would be generally followed in the past, and the dynamically changing lettering, for obvious reasons resorting partly to handwriting tradition; all these result in that among the glyphs we may encounter a vast array of ligatures and allographic variants such as ſ, i.e. long ‘s’. In addition, writers would also experiment with specific diacritics, as adapting the Latin alphabet to Polish phonology was not such an obvious a matter. For this reason, the range of fonts applied was extremely wide.
Polish alphabet proposed by Łukasz Górnicki (Krakow, 1594)
For the reasons specified above, choosing a proper coding system is an important issue that should be considered before the model is trained. Even with an extensive range of characters, the Unicode standard might not be sufficient. It seems advisable to use its dedicated range for private use in accordance to MUFI (Medieval Unicode Font Initiative) recommendations, but there would still be a large number of glyphs which it does not cover. Since the target output format of the entire process is to be XML, setting appropriate character references may prove to be a solution, however, this also requires a general standard to be established.
The classification model may be built in many different ways, there are also many methods of selecting and extracting specific features; each OCR system is based on a different approach, and for choosing the most suitable version, an in-depth analysis and many experiments are required. In any case, however, the basis for training such a model is a set of master data – chiefly – a set of digital images subjected to segmentation along with the correctly recognized text from the corresponding print.
Such a set shall be further divided into two separate sets of data. The first one is used for training the model, the second one for testing – as the effectiveness of the model is evidenced by how it handles recognition of the text on images that it had not processed before. At the very beginning, the ground truth data set must be created manually, which is a very laborious and time-consuming task. However, with better and better models created and tested, the collection of such a set should proceed more and more efficiently due to increasing collection of data already generated and shorter time required for correction.
In automatic correction of contemporary texts read using OCR, an important function are dictionary references. However, this gives rise to another issue, as due to the absence of spelling standards mentioned above, a multitude of spelling variants could have been applied. In addition to this, there may be numerous printing errors; if the resultant edited document is to be true to original, it cannot be automatically corrected. For this reason, the process of manual correction is even more important – especially at the very beginning of the process. Therefore, the software interface used for this kind of correction should be user-friendly to the greatest extent possible.
Research and project works
The development of the automatic text recognition process applied to historical documents has become a subject of extensive research and numerous projects, of which two – European IMPACT (Improving Access to Text) and American eMOP (Early Modern OCR Project) are most noteworthy. The projects referred to above resulted in a large number of publications on the subject, but also contributed to the development of various practical tools.
IMPACT project was implemented from 2008 to 1012, with the objective to increase public access to texts recorded in historical prints. Coordinated by Koninklijke Bibliotheek, it united twenty-six institutions in different countries, and resulted in developing the methods and resources for retrieving texts in nine old languages, including Polish. Consequently, model data were created in the form of a historical corpus of the old Polish language containing documents published in the years 1617-1756. This may serve as a very useful resource, on account of the fact that other copies of some of those documents are contained in Polona; and some fragments which are illegible in one copy, can be effectively decoded from another one.
„On hot springs in Skle” (Zamość, 1617) – two copies from the same edition
As the result of the IMPACT project, a competence center is to be established, providing recommendations, resources, tools and services which may be helpful in working on the development of all the aforementioned stages of the process of retrieving text from old prints. The OCR software chosen for the project – ABBYY Finereader – is particularly recognized for its high efficiency. While working on the project, many useful functionalities of this engine have been developed – but apparently using a closed code available only under commercial license does not seem to be a good idea for such an innovative and comprehensive project that still requires a lot of work on further development.
A similar project was implemented from 2012 to 2015 by IDHMC (Initial for Digital Humanities, Media, and Culture Sites) at the Texas A&M University. Its objective was to retrieve text from 307,000 historical documents whose overall number of pages totalled approximately 45 million. In the course of project implementation, a considerable number of difficulties and problems was encountered resulting in the development of many ingenious solutions. Finally, the project objective was almost fully achieved. Unfortunately, the effectiveness of the developed automatic process did not reach the target value of 97 percent. Digital versions of documents to be recognized were black-and-white and low resolution images, therefore the techniques developed under the project may prove to be of limited use for work on high-quality digitisations. Training data and source codes for the developed tools are widely available on the GitHub platform. For text recognition, the Tesseract freeware was used, whose source code is also available on GitHub.
The vast majority of historical objects accessible in the digital library POLONA are in the public domain. It means that their text layer, including source XML files may be shared according to the same principle; this solution gives the optimum possibility to use the corpus. Also OCR software developed on the basis of commonly available source codes may be accessed in an open repository; this is the best method to contribute to the further development of this technology.
In the task which requires greatest human effort, that is correction of the retrieved text, a large group of people may be involved under crowdsourcing model. This solution has been used in both projects – IMPACT and eMOP alike. The example of reCAPTCHA project is particularly encouraging; it used the dispersed activity of Internet users instead of thousands of jobs needed to verify the digitized text. In the case of such project, however, special emphasis shall be put on the interface and user-friendliness for those who contribute to its execution.
Today, we are witnessing dynamic development of data analysis and processing methods. For the automated editing of old prints, the methods known as deep learning may be of particular relevance. The application of LSTM multi-layer artificial neural networks (Long Short Term Memory) already brings very promising results. Such solutions provide good prospects not only for linguistic and literary research, but for the entire discipline of humanities.
This publication was prepared within the framework of the Competence Centre of the National Library with regard to the digitisation of library resources, co-funded by the Minister of Culture and National Heritage.