A new dataset for mongolian online handwritten recognition | Scientific Reports - Nature.com

1 year ago 66

Introduction

Pattern designation has contributed greatly to instrumentality imaginativeness applications. Handwriting designation falls nether the umbrella of signifier recognition. Handwriting designation is the method by which a machine strategy tin admit characters and different symbols written by individuals utilizing earthy handwriting1. With the popularity of mobile phones and integer devices, much applications of handwriting designation person emerged, specified arsenic the handwriting input method, signature recognition, and concern paper recognition. In the Inner Mongolia Autonomous Region, China, astir 4 cardinal radical talk and constitute the accepted Mongolian language. However, owing to the deficiency of datasets, the improvement of Mongolian online handwriting designation has been slow. Even though determination are immoderate reports connected Mongolian online handwriting recognition, the applicable datasets person not been published, making the examination and valuation of antithetic models oregon algorithms impractical. Undoubtedly, the show of grooming and designation highly depends connected the quantity and prime of grooming samples done heavy neural networks2. Thus, it is indispensable to physique a ample online handwritten Mongolian connection database for each researchers successful this area.

The main features that separate Mongolian from different languages are arsenic follows: arsenic an agglutinative language, its vocabulary is vast, including millions of words, and letters are seamlessly connected from apical to bottom. In Mongolian online handwriting recognition, to our knowledge, MRG-OHMW3 is the archetypal publically disposable database for online handwritten Mongolian. The main shortcoming of this dataset is that the vocabulary lone covers 946 Mongolian words, which is excessively tiny for Mongolian, and the handwriting trajectories were collected by an Anoto pen connected paper, making them antithetic from trajectories written with fingers connected a interaction screen. Thus, we purpose to physique a ample Mongolian word-level online mobile handwritten database to beforehand the improvement of related probe and applications.

Therefore, this insubstantial proposes a broad Mongolian online handwritten dataset called “MOLHW”, which whitethorn beryllium utilized arsenic a benchmark dataset for the Mongolian online handwritten designation task. The dataset was written by 200 radical successful total, and the information successful it were collected by mobile telephone application. The words were written utilizing a digit connected a interaction screen. The vocabulary contained successful MRG-OMHW3 consists of 946 words, portion that successful MOLHW is overmuch larger, astatine 40,605 words. Considering that determination are galore ways to divided and conception Mongolian words for analysis, successful this paper, we take the grapheme codification arsenic the Mongolian alphabet due to the fact that it splits Mongolian successful the astir elaborate mode and carries nary grammatical information. The grapheme codification is shown successful Fig.1 and contains 51 characters.

Figure 1
figure 1

Grapheme code.

The MOLHW dataset is present freely disposable to researchers for assorted Mongolian online text-related applications, specified arsenic Mongolian online substance recognition, handwritten substance generation, writer recognition and verification, and signature recognition. The main contributions of this insubstantial tin frankincense beryllium summarized arsenic follows.

  • The instauration of an unfastened vocabulary benchmarking dataset of a Mongolian online handwritten dataset, MOHLW, which includes 164,631 samples written by 200 writers and covers 40,605 communal Mongolian words.

  • The improvement of tools, techniques, and procedures for Mongolian online substance collection, verification, and transliteration.

  • The improvement of a projected benchmark exemplary for designation of online Mongolian handwritten words utilizing the encoder–decoder model.

  • A examination of the show of antithetic models connected this dataset.

The remainder of the insubstantial is organized arsenic follows. “Related work” presents a lit reappraisal of Mongolian online substance datasets. In “Overview of MOLHW”, we contiguous the information postulation steps utilized successful this survey and the dataset statistics. “Benchmark evaluation” details the information preprocessing and framing process. Additionally, this conception presents 3 trained and validated models and the experimental results of Mongolian online quality designation algorithm utilizing the MOHLW dataset. Then, for the experimental results of our baseline model, we supply the mistake investigation of the trial set. Finally, we contiguous the conclusions of this survey successful “Conclusion”.

Related work

Currently reported Mongolian substance designation probe tin beryllium divided into 3 categories: optical quality designation (OCR), humanities papers recognition, and handwriting recognition.

The segmentation method was adopted successful the earliest Mongolian OCR. In the archetypal step, the glyph is segmented from the image, past the diagnostic of each glyph is extracted, and yet the glyph is classified and recognized by matching with the template4. In short, Mongolian characters are seamlessly connected, truthful the segmentation of glyphs is simply a hard and challenging task. Therefore, the segmentation of characters volition greatly impact the accuracy of recognition. Due to the supra peculiar quality of Mongolian, successful caller years, galore scholars person adopted the non-segmentation strategy5,6,7. Datasets for studying Mongolian OCR are comparatively casual to obtain. Zhang et al.5 projected a exemplary based connected series to series with an attraction to admit non-segmented printed Mongolian substance successful 2017, and the designation accuracy of this experimentation has reached 89.6%. The dataset utilized successful this experimentation belongs to the writer and contains astir 20,000 words and a full of 80,000 samples. In 2019, Wang et al.6 projected end-to-end printed Mongolian substance designation based connected bidirectional agelong short-term representation (BiLSTM) and connectionist temporal classification (CTC). The occupation of Mongolian characters segmentation was not handled, but the writer focused to the occupation of series to sequence. The dataset consists of 800,000 samples collected from the dictionary and covers 20,250 words. In 2021, Cui et al.7 projected a triplet attraction Mogrifier web (TAMN) for irregular printed Mongolian substance recognition. The TAMN web uses a peculiar spatial translation method to close the distorted Mongolian image.The designation accuracy reached 90.30% connected their ain dataset, which includes 98,085 Mongolian pictures from the China Mongolian News Network and covers 6538 words.

Research connected the designation of humanities Mongolian documents has been conducted astir Mongolian Kanjur—a Mongolian encyclopedia—the contented of which involves religion, history, and literature. The main probe constituent has been keyword spotting and the holistic designation of the Woodblock–Print word8,9,10. The dataset utilized successful Mongolian humanities documents designation is from scanned Mongolian Kanjur images, spanning astir 200 pages.

The probe connected Mongolian handwriting designation tin beryllium divided into 2 categories: offline and online handwriting recognition. The work of online oregon offline handwriting datasets successful assorted languages has contributed to the improvement of their respective handwriting techniques. For example, successful the probe of offline handwriting recognition, English has the CENPARMI11, CEDAR12, and IAM13 datasets; Chinese has the HCL200014, CASIA Offline datasets15, and HIT-MW16 datasets; Japanese has the Kuchibue17 and Nakayosi18 datasets; and Arabic has IFN/ENIT19 and KHATT20. In Mongolian offline handwriting recognition, a word-level accepted Mongolian offline handwriting dataset (MHW) was developed by Daoerji21. The MHW dataset is divided into a grooming acceptable and 2 trial sets, including astir 120,000 samples written by 200 antithetic writers. The size of the grooming set, trial acceptable I, and trial acceptable II is 5000, 1000, and 939 words successful the MHW dataset, respectively. The vocabularies of the grooming and investigating acceptable person a fewer intersections. Fan and Gao22 developed a hidden Markov exemplary and heavy neural web hybrid strategy to admit offline handwritten Mongolian text, with an accuracy of 97.61% connected MHW Test acceptable I and an accuracy of 94.14% connected MHW Test acceptable II. In this work, the post-processing signifier utilized the Viterbi algorithm connected a dictionary that lone contains a subset of the vocabulary of MHW that includes astir 6734 words, and searched for the maximum imaginable result. Therefore, out-of-vocabulary (OOV) words could not beryllium recognized. A lexicon-free conversational Mongolian offline handwritten designation strategy with a two-dimensional recurrent neural web with CTC was projected in23. The sub-word connection exemplary they utilized tin admit immoderate connection and showed the champion performance, with connection mistake rates (WERs) of 18.32% and 23.22% connected MHW of the 2 trial sets. In24, an encoder–decoder with an attraction mechanics exemplary for Mongolian offline handwritten designation was proposed. The exemplary consists of 2 LSTMs and 1 attraction network. The archetypal LSTM is an encoder that consumes a framework series of a one-word image. The 2nd LSTM is simply a decoder that tin make a series of letters. The attraction web is added betwixt the encoder and decoder, which allows the decoder to absorption connected antithetic positions successful a series of frames during the decoding. The champion accuracies connected the 2 investigating sets of MHW were 90.68% and 84.16%, respectively.

In the probe of online handwriting recognition, a batch of online handwritten datasets person been published successful assorted languages. For example, Arabic has the ADAB25, AltecOnDB26, and Online-KHATT27 datasets; Chinese has the CASIA Online datasets15; Mongolian has the MRG-OHMW3 datasets. In the improvement of Mongolian online handwriting recognition, Ma et al.3 published a database called “MRG-OHMW” for online handwritten Mongolian successful 2016. There are 282,954 Mongolian connection samples successful the database, and their 300 writers travel from a Mongolian taste minority. The vocabulary of this database covers 946 Mongolian words, with the connection lenghth from 1 to fourteen Mongolian characters. The examination betwixt the MRG-OHMW database and our database is fixed successful Table 1.

Table 1 Statistics of MRG-OHMW and MOLHW.

Ma et al. projected a designation exemplary based connected a CNN web and tested it connected their ain database with the 91.2% trial accuracy. In 2016, Liu et al.28 projected an online handwritten Mongolian connection designation method based connected MWRCNN and presumption maps. On MRG-OHMW database, the method of combining aggregate classifiers achieved the highest designation accuracy of 93.24%. In 2017, Liu et al.29 projected a five-bidirectional hidden-level heavy bidirectional agelong abbreviated word representation (DBLSTM) web for online handwritten Mongolian connection recognition. The champion show of the exemplary adopted a noval sliding model method with the decoding method adopted the optimal way decoding is the connection level designation complaint of 90.35% connected the MRG-OHMW subset. In 2020, a caller method, CMA-MOHR, for online handwritten Mongolian quality designation was projected by Fan Yang et al.30. To measure the show of the model, they carried retired experiments of antithetic models connected their ain dataset and got the champion show of 76.23% connected their model. So far, however, their datasets person not been made public. Recently, the transformer web was projected by Devlin31, which is wholly based connected the attraction mechanics and recurrence and convolutions are not required successful the full web model. It is proved to beryllium precise effectual successful the tract of handwriting recognition. In the tract of online handwriting recognition, Matteo32 carried retired experiments connected an online handwritten dataset, released by STABILO, and the designation exemplary was based connected transformer structure. The experimental results amusement that transformer is simply a important breakthrough successful the series to series problem.

Overview of MOLHW

Mongolian vocabulary selection

Like English, accepted Mongolian is simply a phonetic script, with 35 letters. Unlike letters successful the Latin alphabet, Mongolian letters person antithetic shapes depending connected the presumption and discourse successful a word. In Mongolian Unicode encoding, lone 35 basal letters, called the Nominal Forms, are encoded, and determination is nary autarkic encoding for the antithetic forms of each letter, called the Presentation Forms. Therefore, the substance processing motor needs to show the close glyphs according to the context. Because Mongolian Unicode encoding cannot correspond unsocial glyphs, a Mongolian grapheme codification acceptable containing 51 elements was projected by Fan et al.22. Compared to Unicode, grapheme codes person shown amended show connected Mongolian handwritten designation tasks. Therefore, successful the MOLHW dataset, we supply some Unicode and grapheme codes labels astatine the aforesaid time.

Mongolian is considered to beryllium 1 of the astir morphologically analyzable languages. Mongolian words are formed by attaching suffixes to stems. One stem, particularly a verb stem, tin beryllium utilized to make dozens oregon hundreds of words by connecting antithetic suffixes to it. According to incomplete statistics, the Mongolian vocabulary tin scope 1 cardinal words. Thus, it is intolerable to screen each Mongolian words erstwhile gathering Mongolian-language datasets. To see each grammatical phenomena successful Mongolian, we counted the connection frequence from a Mongolian corpus containing much than 3 cardinal words and selected 40,605 words arsenic the vocabulary acceptable of this dataset. The vocabulary acceptable contains words with Unicode lengths from 1 to 20. We person counted the fig of words of antithetic lengths, and Fig. 2 shows the circumstantial results It tin beryllium seen from the fig that the magnitude of often utilized Mongolian words was concentrated successful the scope of 6–12 characters.

Figure 2
figure 2

Word frequence statistic of antithetic lengths.

Data collection

Owing to the popularity of astute phones, we decided to usage a mobile telephone interaction surface to cod handwritten trajectories. Thus, a dedicated information postulation exertion was developed. This exertion chiefly enables the pursuing tasks: information preparation, the real-time updating of information statistics, and information collection. The exertion is chiefly divided into 2 parts: a inheritance strategy and a idiosyncratic beforehand end. The inheritance strategy is chiefly liable for providing a Mongolian substance template, collecting the user’s handwriting trajectory, and tracking the clip of penning and the manual inspection results of each illustration of text. The idiosyncratic beforehand extremity is liable for providing users with a handwriting situation and an situation for the manual inspection of substance samples successful the background. The wide architecture of the strategy is shown successful Fig. 3. The relation of the mobile telephone is to enactment arsenic the front-end situation for the system, divided into 2 main parts, 1 providing the penning situation for the users and the different providing the checking situation for the reviewers. Two antithetic servers enactment arsenic the back-end of the system, the machine sever provides the volunteers with samples to write, and the information sever provides the retention and speechmaking of the penning traces.

Figure 3
figure 3

Architecture diagram of the handwriting trajectory acquisition system.

When the idiosyncratic opens the Mongolian handwriting input application, they click the login fastener to entree the handwriting interface shown successful Fig. 4. The inheritance assigns a people illustration Mongolian connection to the user. The substance astatine the apical near is the template connection provided to the user. The idiosyncratic writes the corresponding connection successful the achromatic area, and if the surface is not ample capable to constitute the word, the idiosyncratic tin usage the scroll barroom connected the close to scroll down the drafting area. If the idiosyncratic is dissatisfied with their existent penning sample, the idiosyncratic tin usage the wide fastener connected the near to wide the penning country and rewrite the word. After writing, the idiosyncratic tin click the taxable fastener connected the close to taxable the written handwriting way to the information server. After submission, the exertion provides a caller Mongolian connection for the idiosyncratic to write.

Figure 4
figure 4

Main leafage of app.

Data verification

To guarantee the penning prime of Mongolian words, we acceptable up respective peculiar adept reviewers who are autochthonal speakers of Mongolian. The extremity of this cognition was to delete misspelled oregon nonstandard writings. When an adept reviewer logs successful to the application, the inheritance sends a Mongolian substance way written by the idiosyncratic to them for their manual inspection and approval. The exertion redraws the close template connection and the way written by the idiosyncratic connected the screen. The reviewers take to judge oregon cull the trajectory based connected quality oculus comparison. Figure 5 shows the strategy cognition diagram for the reviewers. The representation successful the apical near country of the Fig. 5 shows the modular penning method corresponding to the trajectory. When the reviewer determines that a illustration meets the standard, prime the cheque container enactment successful the bottommost near country of the surface and the strategy volition see this illustration successful the information set, different prime the fork enactment and the strategy volition automatically leap to the reappraisal surface for the adjacent sample. This strategy importantly reduces the outgo of the inspection process.

Figure 5
figure 5

Review leafage of app.

We initially obtained astir 210,000 handwritten trajectory samples, and aft adept review, much than 160,000 samples remained. A ample fig of samples were deleted because, firstly, the volunteers had not been trained successful penning standards and, secondly, they were uncomfortable with penning successful Mongolian connected mobile phones, resulting successful sloppy oregon incomplete writing, frankincense causing immoderate of the samples to beryllium of mediocre quality.

Dataset statistics

The MOLHW dataset is present publically disposable astatine https://www.kaggle.com/fandaoerji/molhw-ooo to each researchers, and immoderate examples of handwriting samples are shown successful Fig. 6.

Figure 6
figure 6

Some handwriting samples successful MOLHW.

We amusement examples of antithetic styles of penning successful Fig. 7.

Figure 7
figure 7

Some antithetic handwriting benignant samples successful MOLHW.

The MOLHW dataset contains a full of 16,4631 handwriting samples and nary separation of grooming and trial sets. In this study, we randomly selected 70% arsenic the grooming set, 20% arsenic the trial set, and the remaining 10% arsenic the validation set. Users could disagreement the trial acceptable and grooming acceptable according to their ain needs. As aggregate radical were moving successful parallel, the words successful the vocabulary were not written equally. The statistic of the handwriting illustration postulation of words are shown successful Fig. 8.

Figure 8
figure 8

Word illustration statistics.

It tin beryllium seen that 11,262 words were written 3 times, which was the astir communal frequency. The MOLHW dataset contains 40,605 words successful total, which ensures that each connection has astatine slightest 1 handwriting sample, the maximum fig of collections is 14, and an mean of 4 handwriting samples are collected per word. All 200 penning volunteers were freshmen successful college, with ages ranging from 18 to 20, and the proportionality of males and females was astir equal. The unpaid who wrote the astir wrote 3374 samples, the 1 who wrote the 2nd astir wrote 2265 samples, the 1 who wrote the 3rd astir wrote 2246 samples, and the mean fig of penning samples per unpaid was 823.

The MOLHW dataset includes 7 files: “MOLHW.txt”, “MOLHW_preprocess.txt”, “MOLHW_grapheme.txt”, “MOLHW_preprocess_grapheme.txt”, “ASCII2Unicode.txt”, “dict.txt”, and “grapheme_code.png”. The “MOLHW.txt” record is an archetypal dataset record successful substance format, and each enactment is simply a handwriting illustration successful the format [Label, Author ID, Screen width, Screen height, Screen pixel density, Writing way coordinates].

  1. 1)

    Label: Mongolian connection successful Latin transcription (case sensitive). The correspondence betwixt Latin and Mongolian Unicode is explained successful ASCII2Unicode.txt. For example, the statement “abaci” tin beryllium converted to to “0x1820 0x182a 0x1834 0x1822” according to ASCII2Unicode.txt. The Mongolian connection is.

  2. 2)

    Author ID: Writer md5 encrypted ID.

  3. 3)

    Screen width: Mobile telephone surface width.

  4. 4)

    Screen height: Mobile telephone surface height.

  5. 5)

    Screen pixel density: Mobile telephone surface pixel density.

  6. 6)

    Writing way coordinates: The coordinate information arc is organized arsenic [[x,y],[x,y],....,[x,y]]. The format of 1 brace of coordinates is “[x, y]”, wherever “x” represents the coordinate of the X-axis, and “y” represents the coordinate of the Y-axis. The precocious near country of the surface is marked arsenic the root of the coordinate system, moving the x-axis to the close and moving the y-axis down. The [\(-1,-1\)] successful coordinates represents lifting a pen.

The “MOLHW_preprocess.txt” record is the dataset wherever the trajectory has been preprocessed, and the others are the aforesaid arsenic “MOLHW.txt”. The record “MOLHW_grapheme.txt” is the dataset labeled grapheme codes, and the others are aforesaid arsenic “MOLHW.txt” too. The record “MOLHW_preprocess_grapheme.txt” is the preprocessed dataset labeled grapheme codes, and the others are aforesaid arsenic “MOLHW_preprocess.txt”. The record “dict.txt” is simply a Latin transliteration of Mongolian to grapheme codification successful the Mongolian dictionary file. The record “grapheme_code.png” is the grapheme codification explanation file.

Benchmark evaluation

Mongolian online handwriting designation tin beryllium considered a mapping process from a substance trajectory series to a quality sequence. In the Mongolian designation task of this paper, based connected the online Mongolian handwriting way data, the neural web recognizes the corresponding text, which tin beryllium considered arsenic a series to series learning process. In the end-to-end task of variable-length sequences, the astir wide utilized model is encoder-decode structure, which is wherefore successful this insubstantial we take the encoder-decode model operation to physique the learning web arsenic our baseline model. Another mode of dealing with the adaptable magnitude series occupation is to usage the CTC model, truthful successful this insubstantial we besides harvester the CTC web with the agelong short-term representation (LSTM) web successful a comparative experiment. In total, 3 types of models were trained and validated. These are 2 encoder–decoder architectures. In the baseline model, we constructed an encoder–decoder33 based connected an online Mongolian handwritten designation exemplary with an attraction mechanism34. The 2nd is simply a Transformer-based model, and the 3rd is simply a LSTM with a CTC-based model. A afloat statement of these 3 architectures is fixed successful "Details of models" section. Finally, each the models were evaluated with a WER and quality mistake complaint (CER), wherever the CER is expressed arsenic the mean edit region of words. Edit distance, besides known arsenic Levenshtein distance, is simply a quantitative measurement of the quality betwixt 2 strings. It is measured by calculating however galore times it takes to alteration 1 drawstring into another.

Preprocessing and framing

To get a amended designation effect, the archetypal penning way accusation needs immoderate preprocessing operations. Owing to variations successful penning speed, the acquired points were not distributed evenly on the changeable trajectory. Interpolation and resampling operations were utilized to retrieve missing information oregon unit points to prevarication astatine azygous distances. Note that the representation successful the Fig. 9a was drawn from an archetypal trajectory data. The elaborate preprocessing steps are arsenic follows:

  1. 1.

    Interpolation. Piecewise Bezier interpolation was utilized successful the contiguous survey due to the fact that it helps to interpolate points among a fixed fig of points. This further helps region the points astatine an adjacent interval. Figure 9a,b shows a examination betwixt the archetypal trajectory information and the effect of Bezier interpolation.

  2. 2.

    Resampling. After inserting the points, we resampled the samples according to the Euclidean region betwixt 2 adjacent points defined arsenic the sampling distance. Figure 9b,c shows a examination betwixt the Bezier interpolation information and the trajectory information aft resampling. The sampling region is simply a parameter that needs to beryllium adjusted.

  3. 3.

    Deletion. By screening the dataset for samples with little than a definite fig of sampling points, a last threshold of 50 points was selected arsenic the criterion for distinguishing whether a illustration passed oregon failed, arsenic we recovered that 50 points could beryllium utilized arsenic a threshold to surface retired arsenic overmuch unqualified substance arsenic imaginable portion removing arsenic fewer close samples arsenic possible. We deleted the samples whose sampling constituent magnitude was little than 50 aft sampling.

  4. 4.

    Normalization. After the cardinal axis of the illustration information was translated to the X-axis, we normalized the illustration connected the XY-axis to guarantee that the sampled handwritten substance was successful the halfway of the canvas. The last trajectory information are shown successful Fig. 9d.

Figure 9
figure 9

Trajectory information preprocessing.

The preprocessed substance trajectory is simply a bid of two-dimensional coordinates on the penning order. We utilized a sliding model to determination on the penning bid and concatenate each the coordinates wrong the model into a framework of data. A definite overlap was maintained erstwhile the model slid, and the framing process is shown successful Fig. 10.

Figure 10
figure 10

Trajectory representation framing example.

Both the sliding model size and the overlap magnitude are successful units of trajectory points, which are exemplary hyperparameters that request to beryllium tuned.

Figure 11
figure 11

Trajectory information framing example.

The apical enactment of information successful Fig. 11 represents a illustration trajectory, which identifies the scope of a framework and the starting presumption of the adjacent framework aft a definite magnitude of offset. This fig is simply a diagram of framing for a model size of 4 and an overlap of 3. The matrix successful Fig. 11 is the effect of framing the sample, wherever each enactment represents a framework of data. This matrix is besides the input information that is fed into the model.

Details of models

Baseline model

The projected exemplary was designed arsenic a sequence-to-sequence architecture with an attraction mechanism. This exemplary is composed of a multi-layer BiGRU-based encoder and a GRU-based decoder, and a attraction web is adopted to link betwixt the encoder and the decoder. Detailed diagram of the exemplary for the circumstantial encoder and decoder implementations of baseline exemplary is shown successful Fig. 12. The architecture is shown successful Fig. 13.

Figure 12
figure 12

Architecture of the GRU based model.

Figure 13
figure 13

Architecture of the encoder–decoder model.

The encoder is liable for compressing the input series into a vector of a specified length, which tin beryllium regarded arsenic the semantics of the sequence, portion the decoder is liable for generating the specified series according to the semantic vector. In this model, the series of frames from the handwritten trajectory is passed to the encoder and converted into hidden states of the corresponding encoder. Then, the past 2 hidden states of BiGRU are added up arsenic the decoder’s archetypal hidden states. Next, astatine each measurement of the decoding process, the hidden states and encoder outputs are fed into the attraction layer, successful which an attraction value vector is calculated. Finally, the decoder receives the erstwhile prediction output (initially, it feeds the commencement awesome SOS) and the attraction discourse to make a series of letters arsenic the output of the model. The attraction furniture plays an important portion successful the projected exemplary due to the fact that it allows the decoder to absorption connected antithetic positions successful a series of frames during decoding. In this model, the hidden furniture size of GRU and the furniture number of the encoder are hyperparameters that request to beryllium tuned. The past furniture of the decoder is simply a softmax classification layer, and the fig of neurons depends connected the fig of people characters. When the Mongolian grapheme codification is utilized arsenic the people sequence, it contains a full of 52 neurons, and erstwhile Unicode is used, it contains a full of 36 neurons, including the extremity awesome EOS. In each measurement of decoding, the quality with the highest probability is selected arsenic the existent output, and the decoding is stopped erstwhile EOS is encountered.

Transformer model

Similar to the projected baseline model, the designation exemplary based connected the transformer tin besides beryllium regarded arsenic being composed of an encoder and decoder and has been shown to beryllium effectual successful galore sequence-to-sequence problems. Thus, we built a Transformer exemplary whose architecture is shown successful Fig.13, and measured its show connected our dataset.

Figure 14
figure 14

Architecture of the transformer model.

Vaswani et al.35 gave a circumstantial statement of the moving rule of the transformer, which is not repeated here. Next, we focused connected aspects related to knowing implementation of the transformer, and based connected this, we built a elemental exemplary based connected the transformer. Detailed diagram of the exemplary for the circumstantial encoder and decoder implementations of transformer exemplary is shown successful Fig. 14. Different from the baseline model, successful the transformer, determination are aggregate sub-encoders with the aforesaid operation successful the encoding module, and the input of each sub-encoder is the output of the erstwhile sub-encoder. Each sub-encoder is composed of a self-attention mechanics and a feedforward neural network. The decoder has a operation akin to its encoder. In summation to the 2 sub-layers successful each encoder layer, the decoder besides inserts a 3rd sublayer, which performs aggregate attraction connected the output of the encoder stack. The experimental results of the baseline exemplary corroborate that the information characteristics contained successful the earthy information without immoderate processing are insufficient. Therefore, the way information are divided into frames to marque the diagnostic accusation of the information much obvious. Then, the framework information are embedded, position-encoded, and sent to the encoder. After the transformer network, we foretell 1 statement astatine a time, portion the cross-entropy nonaccomplishment relation is applied successful these situations.

LSTM-CTC model

LSTM is simply a benignant of recurrent neural network. LSTMs were developed to lick the vanishing and exploding gradient problems that plague galore RNNs. Because LSTMs are overmuch amended astatine dealing with these 2 issues than earlier RNNs person been, LSTMs are suitable for tackling tasks that impact long-range dependencies successful sequential data32. CTC is wide utilized successful code recognition, substance recognition, and different fields to lick the occupation wherever the lengths of input and output sequences are antithetic and cannot beryllium aligned. In our model, the CTC is really our nonaccomplishment function. The LSTM with the CTC-based exemplary consists of an LSTM web and a CTC network, arsenic shown successful Fig. 15. The LSTM web consists of a bidirectional LSTM (BiLSTM) furniture with a 20% dropout complaint to debar overfitting. After the preprocessed framework information are sent to the BiLSTM layer, and this furniture is followed by a afloat connected furniture and past a afloat connected output furniture of 52 classes, depending connected the grapheme codification classes, which has 51 letters and an bare quality blank. After being normalized by the softmax layer, the output is sent to the CTC furniture to cipher the nonaccomplishment with existent labels of the grapheme code.

Figure 15
figure 15

Architecture of the LSTM-CTC model.

Experimental results

Below, we springiness the experimental results of the optimization process of tunable parameters of the baseline model. The examination results of the optimal show of the 3 models are fixed astatine the extremity of this section. From the preprocessed MOLHW dataset, we randomly selected 70% arsenic the grooming set, 20% arsenic the trial set, and the remaining 10% arsenic the validation set. In each consequent experiments, grapheme codes were utilized arsenic people labels, but finally, we compared the show of grapheme codes and Unicode nether the aforesaid model. The cross-entropy nonaccomplishment relation was utilized during the encoder–decoder training, and the grooming was stopped erstwhile the nonaccomplishment connected the validation acceptable was nary longer reduced. Then, the epoch with the smallest nonaccomplishment connected the validation acceptable was selected arsenic the optimal model. Each grooming epoch took astir 270 2nd connected the GPU NVIDIA Quadro P5000.

Before the exemplary tuning, we carried retired experiments with the aforesaid baseline exemplary connected the archetypal information and the preprocessed data, with the fixed parameters selected arsenic the sliding model size was 1, the overlap was 1, the hidden furniture size was 64, and the fig of encoder layers was 1. The corresponding experimental results are shown successful Table 2 . The results successful the array amusement that the preprocessed information performed amended nether the aforesaid model, truthful we carried retired consequent experiments connected the preprocessed data.

Table 2 Original and preprocessed information experimentation result.

As mentioned above, the sampling distance, sliding model size, overlap, hidden furniture size, and fig of encoder layers are each hyperparameters that had to beryllium tuned. The tuning of hyperparameters adopted a elemental hunt strategy. The archetypal hyperparameter to beryllium tuned was the sampling distance, and the hunt scope was 3, 4, 5, and 6. When tuning, the different hyperparameters selected empirical values, and the sliding model size was 50, the overlap was 10, hidden furniture size was 128, and the fig of encoder layers was 1. The results are shown successful Table 3. It tin beryllium seen from the array that the exemplary had the champion effect connected the trial acceptable with a sampling region of 4, truthful our consequent experiments each sampled the dataset with a sampling region of 4.

Table 3 Sampling region tuning experimentation result.

The 2nd hyperparameter to beryllium tuned was the sliding model size, and the hunt scope was 15, 20, 30, 40, 50, and 60. When tuning, the different hyperparameters selected empirical values, the overlap was 20, the hidden furniture size was 64, and fig of encoder layers was 1. The experimental results are shown successful Table 4. The experimental results amusement that the show was champion erstwhile the sliding model size was adjacent to 20. Next, we tuned the overlap hyperparameters, and compared the results erstwhile they were acceptable to 4, 10, and 20. The results are shown successful Table 5; the encoder–decoder web worked champion erstwhile the overlap was 10.

Table 4 Sliding model size tuning experimentation result.
Table 5 Overlap tuning experimentation result.

The optimal values of the 3 hyperparameters of sampling distance, sliding model size, and overlap successful the information preprocessing process were determined, and their values were 4, 20, and 10, respectively. In the pursuing experiments, we searched hidden layers of sizes 64, 128, and 256, and searched encoder layers 1, 2, 3, and 4 simultaneously. The results are shown successful Table 6. It tin beryllium seen that with the summation of the fig of layers, the designation effect gradually improved, but aft reaching 4 layers, the summation was nary longer obvious. When the fig of layers was 1 and two, arsenic the size of the hidden furniture increased, the designation show besides improved, but erstwhile the fig of layers was three, the show decreased erstwhile the size of the hidden furniture reached 256. The champion show was achieved erstwhile the fig of layers was 3 and the hidden furniture size was 128, with a CER of 0.471 and a WER of 24.281% connected the trial set.

Table 6 Num furniture and hidden size tuning experimentation result.

At the extremity of the experiment, we compared the show of grapheme codes and Unicode. We utilized Unicode arsenic the statement and repeated the experiment, successful which the exemplary parameters selected the optimal parameters of the grapheme code, and the results are shown successful Table 7. It tin beryllium seen that the show of grapheme codification was overmuch higher than that of Unicode encoding, successful which CER was reduced by 0.309, and WER is reduced by 17.437% connected the trial set.

Table 7 Experimental results of Unicode and grapheme codification comparison.

With the preprocessed data, the archetypal 3 rows of Table 8 shows the designation accuracy based connected 3 models. In presumption of WER and CER, the transformer exemplary performed overmuch amended than our baseline model, with a 7.312% summation successful the WER complaint and a 0.098% summation successful the CER connected the trial set. Our results again corroborate the fantabulous show of the Transformer structure-in-sequence to series problem. We tested the show of the archetypal information with the 3 models, and the results are shown successful the past 3 rows of Table 8 . In wide comparison, the preprocessed information showed beardown learnability for each 3 models which shows that our pre-processing was precise effective.

Table 8 Experimental results of antithetic exemplary comparison.

Error analysis

For the experimental results of the our baseline model, we springiness the mistake investigation connected the trial set. The fig of samples successful the trial acceptable was 32,926, and the WER was 24.81%; that is, 8168 samples had designation errors. Figure 16 shows the images with designation errors caused by antithetic mistake types. Figure 16a shows that the corresponding existent statement is ‘s az jz iz gzz lz az s ix’, and owing to the insertion of ‘hes bax’, the incorrect decoding effect of the exemplary is ‘s az jz iz gzz lz az s hes bax ix’. Figure 16b shows that the corresponding existent statement is ‘a o iz gzz hes container c iz hes bax hes box’, and due to the fact that of the deficiency of ’gzz,’ the incorrect decoding effect of the exemplary is ‘a o iz hes container c iz hes bax hes box’. Figure 16c shows that the corresponding existent statement is ‘n az iz iz lz az hes container lz c iz bos bax lx’, and due to the fact that ‘hes’ replaces ‘bos’, the incorrect decoding effect of the exemplary is ‘n az iz iz lz az hes container lz c iz hes bax lx’.

Figure 16
figure 16

Incorrectly recognized images.

For the supra 3 errors, the statistic of designation errors are shown successful Table 9. As we tin spot from the table, the astir communal occurrence successful the designation process is repalce errors, meaning that a close quality is replaced by different incorrect character. This is followed by deletion errors, wherever a quality is not recognised, resulting successful a quality being missing from the designation result, and yet insertion errors, wherever a azygous quality is incorrectly recognised arsenic much than 1 character, resulting successful an other quality successful the designation result.

Table 9 Incorrectly recognized images.

We analyzed the astir communal errors, regenerate error. The statistic of the trial acceptable regenerate mistake results are fixed successful Table 10.

Table 10 Statistics of regenerate error.

As tin beryllium seen from the table, the quality that was astir apt to beryllium replaced and replaced was ‘ax’. Because of its elemental structure, it is easy confused with different akin characters successful the lawsuit of uneven information coordinates.

Conclusion

A dataset (MOLHW) consisting of handwritten Mongolian words was described successful this paper. The dataset contained 164,631 samples of Mongolian online substance with 40,605 Mongolian words written by 200 writers. To measure the MOLHW dataset, we outlined, implemented, and trained 3 models. An attention-based encoder–decoder exemplary utilized for an online handwritten Mongolian quality designation exemplary web was utilized arsenic the baseline exemplary to marque a preliminary valuation of our dataset. The experimentation results amusement that our exemplary could efficaciously admit the corresponding grapheme codification sequences from the continuous coordinate sequences of Mongolian words handwritten online, which tin efficaciously lick the problems of primitive segmentation and OOV connection recognition, which are caused by Mongolian being an agglutinative language. Subsequently, experiments were carried retired connected our dataset based connected LSTM-CTC and the Transformer model. Our results corroborate the fantabulous show of the transformer operation successful series to series problem, which obtained the champion experimental results, with a 16.969% WER connected the trial set. Thus, we judge that the MOLHW dataset tin beryllium utilized arsenic a benchmark dataset for studies related to Mongolian online handwriting recognition, writer identification, handwritten substance generation, and related areas. This database is freely disposable upon petition for funny researchers.

Data availability

All applicable codes included successful this survey are disposable upon petition by contacting with the corresponding author. And the MOLHW dataset is present publically disposable astatine https://www.kaggle.com/fandaoerji/molhw-oootoallresearchers.

References

  1. Kaur, H. & Kumar, M. A broad survey connected connection designation for non-indic and indic scripts. Pattern Anal. Appl. 21, 897–929 (2018).

    Article  MathSciNet  Google Scholar 

  2. Gupta, M. & Agrawal, P. Compression of heavy learning models for text: A survey. ACM Trans. Knowl. Discov. Data 16, 1–55 (2022).

    Article  Google Scholar 

  3. Ma, L.-L., Liu, J. & Wu, J. A caller database for online handwritten Mongolian connection recognition. In International Conference connected Pattern Recognition (ICPR), 1131–1136 (2016).

  4. Peng, L. et al. Multi-font printed Mongolian papers designation system. Int. J. Doc. Anal. Recogn. 13, 93–106 (2010).

    Article  Google Scholar 

  5. Zhang, H., Wei, H., Bao, F. & Gao, G. Segmentation-free printed accepted Mongolian OCR utilizing series to series with attraction model. In International Conference connected Document Analysis and Recognition (ICDAR) vol. 1, pp. 585–590 (IEEE, Hohhot, 2017).

  6. Wang, W., Wei, H. & Zhang, H. End-to-end exemplary based connected bidirectional LSTM and CTC for segmentation-free accepted Mongolian recognition. In Chinese Control Conference (CCC) (ed. Fu, M.) 8723–8727 (IEEE, Hohhot, 2019).

  7. Cui, S. et al. An end-to-end web for irregular printed Mongolian recognition. Int. J. Doc. Anal. Recogn. 20, 1–10 (2021).

    Google Scholar 

  8. Wei, H. & Gao, G. A keyword retrieval strategy for humanities Mongolian papers images. Int. J. Doc. Anal. Recogn. 17, 33–45 (2014).

    Article  Google Scholar 

  9. Wei, H. & Gao, G. A holistic designation attack for woodblock-print Mongolian words based connected convolutional neural network. In International Conference connected Image Processing (ICIP), 2726–2730 (IEEE, Hohhot, 2019).

  10. Wei, H., Zhang, J. & Zhang, H. Deep features practice of connection representation for keyword spotting successful humanities Mongolian papers images. In International Conference connected Tools with Artificial Intelligence (ICTAI) (ed. Alamaniotis, M.) 413–417 (IEEE, Hohhot, 2020).

  11. Abou-zeid, H. M. et al. (eds) 46th Midwest Symposium connected Circuits and Systems, vol 2, 969–973 (2003) (IEEE, Alexandria, 2003).

  12. Hull, J. J. A database for handwritten substance designation research. IEEE Trans. Pattern Anal. Mach. Intell. 16, 550–554 (1994).

    Article  Google Scholar 

  13. Marti, U.-V. & Bunke, H. The IAM-database: An English condemnation database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5, 39–46 (2002).

    Article  MATH  Google Scholar 

  14. Zhang, H., Guo, J., Chen, G. & Li, C. Hcl2000-a large-scale handwritten island quality database for handwritten quality recognition. In 2009 10th International Conference connected Document Analysis and Recognition, 286–290 (IEEE, Beijing, 2009).

  15. Liu, C.-L., Yin, F., Wang, D.-H. & Wang, Q.-F. Casia online and offline Chinese handwriting databases. In 2011 International Conference connected Document Analysis and Recognition, 37–41 (2011).

  16. Su, T., Zhang, T. & Guan, D. Corpus-based hit-mw database for offline designation of general-purpose Chinese handwritten text. IJDAR 10, 27–38 (2007).

    Article  Google Scholar 

  17. Nakagawa, M. et al. On-line handwritten quality signifier database sampled successful a series of sentences without immoderate penning instructions. In Proceedings of the Fourth International Conference connected Document Analysis and Recognition, vol 1, 376–381 (IEEE, Tokyo, 1997).

  18. Matsumoto, K., Fukushima, T. & Nakagawa, M. Collection and investigation of on-line handwritten Japanese quality patterns. In Proceedings of Sixth International Conference connected Document Analysis and Recognition, 496–500 (IEEE, Tokyo, 2001).

  19. Pechwitz, M. et al. Ifn/enit-database of handwritten Arabic words. In Proceedings of CIFED, vol 2, 127–136 (Citeseer, Braunschweig, 2002).

  20. Mahmoud, S. A. et al. Khatt: An unfastened Arabic offline handwritten substance database. Pattern Recogn. 47, 1096–1112 (2014).

    Article  ADS  Google Scholar 

  21. Daoerji, F., Guanglai, G. & Huijuan, W. Mhw Mongolian offline handwritten dataset and applications. J. Chin. Inf. Process. 32, 89–95 (2018).

    Google Scholar 

  22. Fan, D. & Gao, G. Dnn-hmm for ample vocabulary Mongolian offline handwriting recognition. 72–77 (2016).

  23. Fan, D., Gao, G. & Wu, H. Sub-word based Mongolian offline handwriting recognition. 246–253 (2019).

  24. Wei, H., Liu, C., Zhang, H., Bao, F. & Gao, G. End-to-end exemplary for offline handwritten mongolian connection recognition. 220–230 (2019).

  25. El Abed, H., Märgner, V., Kherallah, M. & Alimi, A.M. Icdar 2009 online arabic handwriting designation competition. In 2009 10th International Conference connected Document Analysis and Recognition, 1388–1392 (IEEE, Braunschweig, 2009).

  26. Abdelaziz, I. & Abdou, S. Altecondb: A large-vocabulary Arabic online handwriting designation database. arXiv:1412.7626 (arXiv preprint) (2014).

  27. Mahmoud, S. A., Luqman, H., Al-Helali, B. M., Binmakhashen, G. & Parvez, M. T. Online-khatt: An open-vocabulary database for Arabic online-text processing. Open Cybern. Syst. J. 12, 42–59 (2018).

    Article  Google Scholar 

  28. Liu, J., Ma, L. L. & Wu, J. Online handwritten Mongolian connection designation utilizing MWRCNN and presumption maps. In International Conference connected Frontiers successful Handwriting Recognition (2016).

  29. Liu, J., Ma, L.-L. & Wu, J. Online handwritten Mongolian connection designation utilizing a caller sliding model method with recurrent neural networks. In 2017 14th IAPR International Conference connected Document Analysis and Recognition (ICDAR), Vol 01, 189–194 (2017).

  30. Yang, F., Bao, F. & Gao, G. Online handwritten Mongolian quality designation utilizing CMA-MOHR and coordinate processing. In International Conference connected Asian Language Processing (IALP), 30–33 (2020).

  31. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of heavy bidirectional transformers for connection understanding. arXiv:1810.04805 (arXiv preprint) (2018).

  32. Bronkhorst, M. A pen is each you request :online handwriting designation utilizing transformers (2021).

  33. Cho, K. et al. Learning operation representations utilizing RNN encoder–decoder for statistical instrumentality translation. arXiv:1406.1078 (arXiv preprint) (2014).

  34. Bahdanau, D., Cho, K. & Bengio, Y. Neural instrumentality translation by jointly learning to align and translate. arXiv:1409.0473 (arXiv preprint) (2014).

  35. Vaswani, A. et al. Attention is each you need. Adv. Neural Inf. Process. Syst. 30, 25 (2017).

    Google Scholar 

Download references

Acknowledgements

This enactment was funded by the National Natural Science Foundation of the Inner Mongolia Autonomous Region (Grant No. 2020MS06005) and the National Natural Science Foundation of China (Grant No. 61763034). We convey LetPub (www.letpub.com) for its linguistic assistance during the mentation of this manuscript.

Author information

Authors and Affiliations

  1. College of Electronic Information Engineering, Inner Mongolia University, Hohhot, 010021, China

    Yuecai Pan, Daoerji Fan, Huijuan Wu & Da Teng

Contributions

D.F. conceived the experiment, Y.P. and H.W. conducted the experiment, D.T. analysed the results. All authors participated successful the operation of the dataset and reviewed the manuscript.

Corresponding author

Correspondence to Daoerji Fan.

Additional information

Publisher's note

Springer Nature remains neutral with respect to jurisdictional claims successful published maps and organization affiliations.

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pan, Y., Fan, D., Wu, H. et al. A caller dataset for mongolian online handwritten recognition. Sci Rep 13, 26 (2023). https://doi.org/10.1038/s41598-022-27267-8

Download citation

  • Received: 24 August 2022

  • Accepted: 29 December 2022

  • Published: 02 January 2023

  • DOI: https://doi.org/10.1038/s41598-022-27267-8

Read Entire Article