FRONTIER application - Spring 2017
With the FRONTIER grant, we are planning to work on two characteristic publications as case studies.
1) Jingbao 晶報 (The Crystal), 1919-1940, an important entertainment newspaper as an example for a newspaper with complex page layout.
2) Funü shibao 婦女時報 (The women’s eastern times), 1911-1917, China’s first commercial women’s journal as example of a magazine with typical 2-column layout.
The newspaper Jingbao ran more than 20 years and we have about 17.000 pages on about 9100 scans available with metadata for most of the April issues covering information on articles, images and advertisements (intensive approach) in the ECPO database. The first three years were completely manually segmented by crowdsourcing. The magazine Funü shibao was published between 1911 and 1917 and appeared in 21 issues with slightly more than 2200 pages. It is part of the WoMag database, where detailed metadata on articles, images and advertisements (intensive approach) have been recorded.
The State of the Research in the Field of Line / Character Detection and Optical Chinese Character Recognition
Background and Problem State
The surviving forms of ancient and early Chinese texts are manuscripts, carved block prints and letter/character board prints. Common problems in researches on optical character recognition (OCR) of words in ancient and early Chinese prints include irregularity of the ideograms (characters) and brokenness in the available images of the characters. Apart from these, custom writing directions in Chinese texts, these texts’ column arrangements, as well as prevalence of double line comments (by inserting smaller characters under the base text, comments often form short double lines the same width as one line of the main text), all are reasons of current OCR technology being unable to effectively conduct recognition of Chinese characters in ancient or early modern and republican era texts.
The main method used in current researches on OCR of Chinese texts in prints is the employment of character segmentation, and language models in digital corpus to calibrate the recognition results of OCR software.
Since the accuracy rate of OCR technology is higher for individual characters compared to whole passages or entire paragraphs of text, character segmentation is necessary. The character segmentation work here would thus include recognition of types of printing blocks, identification of lines of separation, correction of slanting in the flow of the writing, recognition early type of sub-line punctuation, resolving double line noting, etc. In the end, we need to put every segmented character in right order, and it’s import for text boxes within texts have different writing direction.
Meanwhile, current researches on recognition of characters in Chinese texts is the employment of language models in digital corpus to post-process recognition results of OCR software, so as to raise the rate of accuracy in them. However, the usage of Chinese has been changed through times. Language models of current Chinese corpus do not meet the needs of improving OCR recognition for Chinese texts. In addition, to collect a Chinese language corpus is not easy, which makes the OCR improvement with language models quite a challenge.
For western languages like English, the use of dictionary-base spelling suggestions is widely used for OCR error correction. However, for the Chinese language, the dictionary-base method won’t work before applying word segmentation to texts (not image character segmentation).
(Chia-ming Lee, Taida research group)
To see some examples of our approach please open the samples folder.