Writer Identification Using Machine Learning Approaches a Comprehensive Review

Data Brief. 2022 Apr; 41: 107947.

Standard arabic handwritten alphabets, words and paragraphs per user (AHAWP) dataset

Received 2021 Dec 10; Accustomed 2022 Feb 8.

Abstract

This commodity presents a handwritten Arabic alphabets, words and paragraphs dataset (AHAWP). The dataset contains 65 dissimilar Arabic alphabets (with variations on begin, end, middle and regular alphabets), 10 dissimilar Arabic words (that cover all Arabic alphabets) and iii different paragraphs. The dataset was collected anonymously from 82 different users. Each user was asked to write each alphabet and word ten times. A userid uniquely only anonymously identifies the author of each alphabet, word and paragraph. In full, the dataset consists of 53199 alphabet images, 8144 words images and 241 paragraphs images. This dataset tin exist used for multiple purposes. It can be used for optical handwriting recognition of alphabets and words. Information technology can also be used for author identification (or verification) of handwritten Arabic text. It is also possible to evaluate divergence in writing styles of isolated alphabets as compared to the aforementioned alphabet written as part of the give-and-take or in paragraph by the aforementioned user using this dataset. The dataset is publicly available at https://data.mendeley.com/datasets/2h76672znt/1.

Keywords: Handwritten Arabic alphabets, Handwritten Arabic words, Handwritten Arabic paragraphs, Writer identification, Standard arabic Text recognition

Specifications Tabular array

Subject	Computer Scientific discipline
Specific subject field area	Image processing, Optical Handwritten Text Recognition, Author Identification
Blazon of data	Image
How the data were acquired	Users completed the forms (based on fixed template) with their handwriting and these forms were then scanned
Information format	Raw
Description of information drove	Information was collected anonymously in a classroom setting. A "userid" was used to uniquely merely anonymously identify the writer of each alphabet, word and paragraph. The forms were colour scanned at 300 dpi and handwritten Standard arabic alphabets, words and paragraphs were and so cropped from these scanned forms.
Data source location	College of Computer Engineering and Science, Prince Mohammad Bin Fahd University City/Boondocks/Region: Khobar State: Saudi Arabia Latitude and longitude for collected samples/information: 26.14544181406805, 50.091155268800044
Information accessibility	Repository name: Mendeley Data Data identification number: x.17632/2h76672znt.1 Straight URL to data: https://data.mendeley.com/datasets/2h76672znt/ane

Value of the Data

• This data is useful for optical handwriting recognition of Arabic text [i,two]. Information technology can besides be used for writer identification/verification of handwritten Arabic text [three], [4], [5]. This dataset contains handwritten alphabets, words and paragraphs written by the same user which makes information technology possible to develop writer identification models trained on alphabets and evaluated on words or sentences [6]
• The information can exist used by machine learning researchers/companies to develop models for recognizing handwritten text. It can also be used by researchers/companies to develop forensic models for identifying author of certain handwritten text.
• The existing Arabic alphabet identification datasets (Hijja [i], AHCD [7] datasets) do not provide any user data. The existing writer identification of Arabic text datasets (IFN/ENIT [8], KHATT [9], QUWI [10] datasets) only provide handwritten words or paragraphs and do not contain alphabets. This dataset fills in this gap.
• The existing Arabic datasets cannot be used for writer identification based on Arabic alphabets or to evaluate differences in writing styles of isolated alphabets every bit compared to the same alphabet written equally part of the discussion or paragraph by the same user. This dataset tin can be used for this purpose as it captures the user information for each alphabet, word and paragraph.

i. Data Description

The dataset contains 65 different Arabic alphabets (with variations on begin, middle, end and regular alphabets), 10 different Arabic words and 3 different paragraphs handwritten by 82 writers. Each writer was asked to write each alphabet and discussion x times. A userid uniquely but anonymously identifies the writer of each alphabet, word and paragraph. In full, the dataset consists of 53199 alphabet images, 8144 words images and 241 paragraphs images.

The writers had fluent Arabic speaking and writing background. They were not advised to use whatsoever specific type or colour of writing musical instrument. This resulted in a dataset with several variations in blazon and color of writing instruments. A unmarried writer had typically used the aforementioned type and color of writing musical instrument to write the alphabets, words and paragraphs.

Fig. 1 shows the set of Arabic alphabets nerveless in the dataset. Delight note that a single alphabet is collected to represent a similarly styled group of alphabets.

Arabic Alphabets with variations collected in the dataset.

Dataset contains ten different Arabic words (that comprehend all Standard arabic alphabets) handwritten by 82 unlike users. Fig. 2 shows the set of Arabic words collected in the dataset.

Standard arabic words nerveless in the dataset.

Dataset contains three different paragraphs handwritten by 82 dissimilar users. Fig. three shows the fixed text paragraphs collected in the dataset.

Arabic paragraphs nerveless in the dataset.

Fig. 4 shows template of the course used to collect a sample of handwritten Arabic alphabets from a user ("user001"). Fig. 5 shows template of the form used to collect a sample of handwritten Arabic words from a user ("user003"). Fig. six shows template of the form used to collect a sample of handwritten Arabic fixed text paragraphs from a user ("user005").

Arabic alphabets handwritten by a user.

Standard arabic words handwritten by a user.

Arabic paragraph handwritten by a user.

The handwritten forms from al the 82 users were color scanned at 300dpi resulting in an image resolution of 2480 × 3507 pixels. These scanned images are provided as raw data in the folder named "raw_dataset" in the public repository.

ii. Experimental Pattern, Materials and Methods

The dataset from scanned images was extracted using Python scripts. The scripts are provided with the dataset in the public repository. Following is a brief description of the scripts:

• 1a_alphabet_extractor_per_alphabet.py: This script extracts alphabets from the scanned JPEG images (equally shown in Fig. iv) and organizes them in a binder structure with ane folder per alphabet containing that alphabet written by all the users. Each file name has the format "userid_alphabetName_variationName_index" where index increases sequentially for each extracted alphabet from a single page.
• 1a_alphabet_extractor_per_user.py: This script extracts alphabets from the scanned JPEG images (as shown in Fig. 4) and organizes them in a folder structure with one folder per user containing all the alphabets written past that user. Each file name has the format "userid_alphabetName_variationName_index" where index increases sequentially for each extracted alphabet from a unmarried page.
• 2a_alphabets_pre_processing.py: This script is used to pre-process the extracted alphabets. Information technology crops the alphabets from the center of image (excluding twenty pixels on each side). This was washed to remove any borders surrounding the extracted alphabets. The surrounding whitespace around written alphabets was so removed. The resultant epitome was converted to grayscale and scaled to a pinnacle of 128 pixels (keeping the attribute ratio intact). Please note that keeping aspect ratio is important then that handwriting does not get distorted.
• 1w_word_extractor_per_user.py: This script extracts words from the scanned JPEG images (equally shown in Fig. 5) and organizes them in a folder structure with one folder per user containing all the words written past that user. Each file proper name has the format "userid_wordName_index" where index increases sequentially for each extracted word from a single folio.
• 2w_words_pre_processing.py: This script is used to pre-process the extracted words. It crops the words from the center of image (excluding v pixels on each side). This was done to remove any borders surrounding the extracted words. The surrounding whitespace around written words was then removed. The resultant image was converted to grayscale and scaled to a height of 128 pixels (keeping the attribute ratio intact). Please note that keeping aspect ratio is of import so that handwriting does not get distorted.
• 1p_paragraph_extractor_per_user.py: This script extracts paragraphs from the scanned JPEG images (as shown in Fig. 6) and organizes them in a folder construction with i folder per user containing all the paragraphs written by that user. Each file name has the format "userid_paragraphNumber_index".

Ethics Statements

An informed consent was obtained from all participants that the collected dataset will be used for enquiry purposes and can be made publicly available to research community. The consent is available on each data collection course page (in "raw_dataset" binder, under each data drove form). Moreover, the participant information has been fully anonymized.

Since the dataset was collected as function of classroom consignment of a form, it did not require whatsoever prior approval from the ethics committee.

Proclamation of Competing Involvement

The authors declare that they accept no known competing financial interests or personal relationships that could accept appeared to influence the work reported in this paper.

Acknowledgments

The author acknowledges the support provided by students of "Machine Learning" class at Prince Mohammad Bin Fahd University (PMU) to help in collecting this dataset. This inquiry did non receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References

1. Altwaijry N., Al-Turaiki I. Standard arabic handwriting recognition organization using convolutional neural network. Neural Comput. Appl. 2020;eight [Google Scholar]

2. Alkhateeb J.H. An effective deep learning approach for improving off-line standard arabic handwritten character recognition. Int. J. Softw. Eng. Comput. Syst. 2021;6:53–61. [Google Scholar]

3. Rehman A., Naz S., Razzak M.I. Writer identification using automobile learning approaches: a comprehensive review. Multimedia Tools and Applications. 2019;78:10889–10931. [Google Scholar]

4. Abdi Thou.N., Khemakhem M. A model-based approach to offline text-independent Standard arabic author identification and verification. Design Recognit. 2015;48:1890–1903. [Google Scholar]

5. He Due south., Schomaker L. Fragnet: Writer identification using deep fragment networks. IEEE Trans. Inf. Forensics Secur. 2020;15:3013–3022. [Google Scholar]

half-dozen. Thousand. A. Khan, N. Mohammad, G. B. Brahim, A. Bashar, One thousand. Latif, Author Verification of Partially Damaged Handwritten Arabic Documents based on Individual Character Shapes, PeerJ Comput. Sci. (submitted).

seven. Najadat H.M., Alshboul A.A., Alabed A.F. 2019 10th International Conference on Data and Communication Systems (ICICS) 2019. Arabic handwritten characters recognition using convolutional neural network; pp. 147–151. [CrossRef] [Google Scholar]

8. Pechwitz M., Maddouri S.S., M¨argner V., Ellouze Northward., Amiri H. In Proc. of CIFED. Vol. 2002. 2002. Ifn/enit - database of handwritten Arabic words; pp. 129–136. [Google Scholar]

9. Mahmoud S.A., Ahmad I., Alshayeb M., Al-Khatib Due west.1000., Parvez Thou.T., Fink M.A., Chiliad¨argner Five., El Abed H. Proceedings - International Workshop on Frontiers in Handwriting Recognition, IWFHR. 2012. KHATT: Arabic offline handwritten text database; pp. 449–454. [CrossRef] [Google Scholar]

ten. Maadeed S.A., Ayouby W., Hassaïne A., Aljaam J.M. 2012 International Conference on Frontiers in Handwriting Recognition. 2012. Quwi: An arabic and english handwriting dataset for offline writer identification; pp. 746–751. [CrossRef] [Google Scholar]

Articles from Data in Cursory are provided hither courtesy of Elsevier

harehatery84.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8866147/

Writer Identification Using Machine Learning Approaches a Comprehensive Review

Standard arabic handwritten alphabets, words and paragraphs per user (AHAWP) dataset

Abstract

Specifications Tabular array

Value of the Data

i. Data Description

ii. Experimental Pattern, Materials and Methods

Ethics Statements

Proclamation of Competing Involvement

Acknowledgments

References

0 Response to "Writer Identification Using Machine Learning Approaches a Comprehensive Review"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel