==Introduction==
This documentation is written so the next person, who trains Tesseract in the future, will have an easier time getting into the process. I will provide an extension to the currently existing Tesseract Wiki, because it is missing a lot, and I will include other sources for relevant information. It is still recommended to read the [[#Sources| Tesseract wiki]], also do Google searches in case I missed something.
===Downloads===
Cygwin: ([https://www.cygwin.com/ link]) both 32- and 64-bit version work ocrd-train: [https://github.com/OCR-D/ocrd-train Github repo] My version of ocrd-train/’‘’Makefile’’’: [[Media:Makefile.doc|Download]], its in .doc extension, so just copy its contents into your Makefile file. https://www.dropbox.com/s/bi7gmahjdnquom8/Makefile?dl=0 My version of ocrd-train/’‘’generate_line_box.py’’’: [[Media:generate_line_box.docx|Download]], its in .docx extension, so just copy its contents into your .py. https://www.dropbox.com/s/c7w16eb9cgt8lfb/generate_line_box.py?dl=0 Official langdata: [https://github.com/tesseract-ocr/langdata link] Official “Best” traineddata: [https://github.com/tesseract-ocr/tessdata_best link] MORE SCRIPTS: [https://github.com/Shreeshrii/tess4training link] Evaluation Tool: [https://github.com/Shreeshrii/ocr-evaluation-tools link]
==Installation & Setup==
===Windows===
On Windows by default you cannot use the training tools required by Tesseract, so “Installing” on Windows means the setup for training Tesseract, because basic functions such as doing OCR on a picture is available, even from CLI (so no third-party program is needed), but you can’t really train it. For the purpose of the training I chose Linux as this is the only confirmed OS by the Tesseract wiki that can actually run it.
Just because the setup I used is Linux based does not mean you cannot run it on Windows. My tool of choice for this is Cygwin (alternatives are: MSYS, MinGW), because Tesseract (and other required packages) are in the list of packages at setup.
For installing Cygwin and setting up (compiling) Tesseract on it, I used an outdated [[#Sources|Cygwin guide]], because I could not find a guide that was more relevant and (for reference) it works well. (It was written for 3.03-3.05, changes are significant between 3.05 and 4.*, but the way programs are compiled didn’t change.)
/My Cygwin/Cygwin
, the way to go: /My_Cygwin/Cygwin
)#* At package selection there will be an option to choose from collections of packages. From these you will need Base, Devel and Graphics (set them to Install). You can skip other collections, but I am not sure which ones, because the guide is outdated, but feel free to experiment what is not needed. (Recommendation by Paul Vorbach: skip Publishing, Gnome, and KDE.) Reason for skipping packages is simply to cut down on install time. All other packages should be left on default. # Continue the installation until it is done. Errors might come up during installation, so you might have to retry. (It happened to me, but on the second time I installed it without any errors.)
Now that we have a clean Cygwin setup, let us make sure we have all the packages we need. I just run the downloaded setup file for Cygwin and at package selection I searched for these specific packages:
’‘’Note on versioning:’’’ As of writing this, the latest version from anything is the best option you have. If you want to know exact versioning dependencies visit Tesseract wiki, but in my experience Cygwin (or whatever setup you are using) will tell you that you have conflicts, because of old version of this and that.
Now we should have all the packages needed on our Cygwin. Our next step is to build the training tools.
#Locate tessdata in Cygwin (for me it is in: (…)64
) #In tessdata (cd /usr/share/tessdata
) execute the following commands: #make
#make install
#make clean
#After the make
commands you should end up with a training folder in tessdata and in 64
you should have the following three files: #tesseract.exe #traintess.sh #traintess-utils.sh #For last step we should make sure we have access to tesseract
, tesstrain.sh
and lstmtraining
commands. (by simply writing the command, the CLI will give back the help page for these commands)
This is the end of setup. Now you should have all the initial setup you need for using and training Tesseract according to Tesseract 4.*.
===Linux===
As I mentioned above for Windows I simply compiled Tesseract as if I had a Linux machine. So instead of using Cygwin package manager, you should get the required packages according to your distribution of Linux. The documentation on Linux installation is decent on the official [https://github.com/tesseract-ocr/tesseract/wiki/Compiling#linux Tesseract wiki], but if it is confusing or just not enough I recommend this [[#Sources|linux guide]] (This is also outdated, but it should be valid for setup at least, including creation of training tools. The guide was written for 3.04. and for the beginning of 4.* ). All commands should be executed under superuser.
#* Outdated way: apt-get install tesseract-ocr
#* Ubuntu 18.x: apt install tesseract-ocr
and apt install libtesseract-dev
#* Other distribution: in the official [[#Sources|Tesseract wiki]] you should be able to find the code examples for your Linux distribution. #* ’‘’Optional’’’: get language data in the same fashion. For example: apt-get install tesseract-ocr-{LANG}
({LANG} = language code, dan, swe, eng…). But it is possible to get the language data manually ([[#Downloads|here]]). # Next step is to make sure you have all the tools needed for training Tesseract. #* imagemagick: apt-get install imagemagick
, tool for image edit and conversion #* libicu-dev, libpango1.0-dev, libcairo2-dev: These three are required libraries for Tesseract 4.’s training: apt-get install libicu-dev
, apt-get install libpango1.0-dev
, apt-get install libcairo2-dev
. ([https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#additional-libraries-required Tesseract wiki]) # To finish the setup, we need to make our tools, so go into the Tesseract directory (by default named tessdata), then execute the following commands: # make
, make training
, make training-install
#* These three commands, executed after each other, should give you access to the following commands (in proper Linux setup these commands might be available from anywhere, but it should work from Tesseract directory): tesseract
, traintess.sh
, lstmtraining
. This is the end of setup. Now you should have all the initial setup you need for using and training Tesseract according to Tesseract 4.*.
===OCR-D/ocrd-train===
It is a custom setup for Tesseract training. To install it you have to copy the master branch from [[#Downloads|github]]. On closer inspection you can see that it also includes a decent amount of training data, but what you should pay attention to is the Makefile and the ground-truth folder. The ground-truth folder is where your training files will be created. Makefile is the main script you are gonna use and by default it needs some editing (I recommend keeping the original Makefile to have my script compare to it). Depending on your python version (below python 3), you might need to edit the generate_line_box.py file ([https://github.com/OCR-D/ocrd-train/issues/18 reason]), but you can find the already edited version of it at [[#Downloads| downloads]].
Now that you copied ocrd-train into your setup (in my case Cygwin) you still have to connect these tools to Tesseract and Leptonica. To do so I simply used commands in Makefile to compile Tesseract and Leptonica inside my ocrd-train folder, but I am sure with a little editing you can make it so that the tools use your specific folder structure. To compile Tesseract and Leptonica use the following commands (order counts, Tesseract will give you an error without Leptonica) make leptonica
and make tesseract
. These commands will compile versions of Tesseract and Leptonica, version can be set as a variable inside Makefile.
To make sure you have everything set up, you should have access to make
command (but if you already compiled Tesseract and Leptonica this is already working for you).
===Folder Structure===
I will share my folder structure here, but you should know that you can make your own, but then edit the scripts for your own setup. I work with tesseract in Cygwin, so my folder structure starts there.
’‘’Bare Tesseract:’’’ I let everything set itself up, so my Tesseract and Leptonica packages are in their default place. My folder structure starts in /home/{user}/
, which can be accessed by /
. ~/langdata
: for me it was convenient to store it here /tesstutorial
~/tesstutorial/eval
: holds files specific for evaluation ~/tesstutorial/output
: holds output of Tesseract training ~/tesstutorial/train
: holds files specific for training ~/tesstutorial/final
: holds my finished “working” traineddata. (If you plan on doing a lot of training, it is really worth keeping a clean library of your traineddata) *~/tmp
: just in case you need a temporary folder
I also copy the best_traineddata I want to use into the tessdata folder (by default it should be /usr/share/tessdata
).
’‘’ocrd-train Tesseract:’’’ I put my ocrd-train into /home/{user}/
, which can be accessed by ~/
. In this case the given folder structure is as follows ~/langdata
: for me it was convenient to store it here /ocrd
~/ocrd/data
: holds other folders, but also all-boxes/lstmf, list.eval/train, radical-stroke.txt and unicharset. (These files shouldn’t be there by default they are products of the training) /ocrd/data/ground-truth
: holds sample data (during training it will hold .box and .lstmf files) ~/ocrd/data/{LANG}
: {LANG} is for your model name, usually language code. It holds the prototype traineddata. ~/ocrd/data/checkpoints
: holds checkpoint files produced by lstmtraining
/ocrd/leptonica-1.7
: you don’t have to make this if you compiled Leptonica with ocrd /ocrd/tesseract-{version}
: you don’t have to make this if you compiled Tesseract with ocrd ~/ocrd/usr
: after compiling this folder holds tessdata (cd ~/ocrd/usr/share/tessdata
) *~/tmp
: just in case you need a temporary folder
==Overview==
===Tutorial===
The scope of this tutorial is not a step by step guide for training Tesseract, but to give a basic understanding of Tesseract 4.0.0’s training. For this I will provide step by step guidance, but the purpose of each step is to understand the reasoning of why it is required. Training Tesseract 4.0.0 can be a very time-consuming exercise because it uses LSTM neural networks for training, but the results are worth it because it can produce better accuracy including any specification you might want to implement. It is also important to understand that there are not many definite answers about training, for example, “What is the best way to train for this smudge on picture?” The question cannot be answered because it depends of what environment you want to use it in. The data that is going to be run through Tesseract is the environment, e.g., “Oh, the smudge is always at the top right corner, because someone is lazy to use coaster,” so we teach Tesseract, through many samples we have, that if you see a smudge at top right corner then there is nothing under it, it’s just a smudge. It is necessary to look through your own processes to understand what and how to teach it to Tesseract.
===Training===
Tesseract 4.0.0 introduced LSTM neural networks into its OCR engine, but why is that good for us? With LSTM it is possible to achieve even higher accuracy than the older version. The LSTM drawback is that it is a lot more computation heavy and requires so much more training data (sample data). For example, for Latin-based language data they used 400000 text lines over 4500 fonts as training data. With the new computing power requirements comes also increase in training process time. In around Tesseract version 3.0.5 training took approx. from minutes to hours, but in 4.* it can take from days up until weeks or even more (This doesn’t mean every kind of training will take at least a day, this means if you want proper results it will take some time to go over the specifications.)
’‘’Training Options:’’’
’‘’Fine tune’’’: It starts from an existing model (best_traineddata is recommended). You can train with this method for minor specifications considering for example new characters or problematic fonts. The point is with this method you can focus on subtle differences. Important to note is that out of the three options this is the least training data heavy (which means it can produce improvements with comparatively small amount of sample data). [[#Fine Tuning for Font| Example 1]] [[#Fine Tuning for a few characters| Example 2]] ’‘’Cut off (the top) layer’’‘: LSTM neural network works through layers and they are specific for languages so if you want to for example rework a traineddata just remove a few top layers and rework it to your own specification, because after Fine tuning this method is your best option. It is recommended to use data that is somewhat close to the desired end result, for example “I want to train my own German data and I take the Latin data to start training on.”’‘’Important note:’’’ requires an at least basic understanding of VGSL ([https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs Tesseract wiki], but if that is unclear, you can also use the following [[#Sources|publication by Awni Hannun]], it might provide more insight about neural networks). [https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-just-a-few-layers Example]
*’‘’Retrain from Scratch’’’: As the name suggests this is the most time heavy one of the training options and without a huge set of representative samples it is a waste of your time. [https://www.endpoint.com/blog/2018/07/09/training-tesseract-models-from-scratch Example]
===Training Process===
’‘’Important note’’’: Before you begin, make sure to save and clean your working space. If you repeat the whole training multiple times (for example trying different image sizes) you need to keep aside the list.train/eval files otherwise you will compare with a different set of eval images (with a small data set this can make a big difference).
’‘’Prepare Sample data:’’’
It is essential to find the perfect amount and type of sample data according to your desired outcome of training, this will require either decreasing or increasing the number of sample data and changing the type. There are some requirements such as the type of files you can use and, in some cases, there are recommendations for the format. The basic idea of any machine learning training is following: split the data into two parts, use one for training and use the other to check the result of training. The problem of training too much on only specific problems, is that you get exceptionally good on these but you overspecialize and get worse at all the rest (this is called overfitting). So you get 99.999% accuracy on the training and 74% on the eval set. What really matters is the real world data (real world is usually a little worse than eval). Generally speaking the “best practice”, according to [https://groups.google.com/forum/#!msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ the forum] and some other [[#Sources|sources]], is tiff/txt file pairs(tiff: ~30/40px text height, txt: single line of correct text, PSM=13 (PSM.RAW_LINE), trim white borders, at least 10px of free space between the border and the text). Testing has been done in this matter and that is where Lorenzo drew his conclusion from [https://groups.google.com/forum/#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ link]. I would warn everyone, that these results should not be taken more seriously, than just a test in a specific scenario with fonts. The information discovered is still relevant, but that does not mean that it will fit your situation, so make sure you do your own tests. * It seems to be the standard that people use tiff/txt file pairs as sample data, but the quality or the format is not standardized, so to quote Lorenzo “Try different options to see what works best”.
’‘’Create training files:’’’
Working from tiff/txt file pairs (sample data) the following files can and have to be created. * ’‘’unicharset’’‘: It needs to be given a constrained list of characters you want it to recognize. Trying to make it choose out of the whole Unicode set would be computationally unfeasible. It defines the set of graphemes’ along with information about their basic properties. It is extracted from .box files (with combine_tessdata
tool). ’’‘.box files’’‘: These files are the representation of OCR segmentation on the provided picture. To understand better: “This file is the map for the given correct text on the picture”. Created by various tools (includes OCR-D, where the creation of .box files is automatic) ’’‘.lstmf files’’‘: This is a collection of training information (tailored for LSTM) derived from tiff/box and their text file. (in case of OCR-D these are created automatically). *’‘’list.train/list.eval’’’ (you might find it as {LANG}.training_files.txt): These files are a list of paths to the wanted .lstmf files. list.train is the list of files we want to do the training on. list.eval is the list of files we want to use for evaluation. Distributing your sample data between train and eval can be done manually, but OCR-d does it automatically according to a ratio (that can be set as a variable). OCR-D by default does a 90/10 split (RATIO_TRAIN := 0.90), if you have some data (1000/10000 samples) do a 80/20. If you have a lot (100k+ samples) of data 90/10 or even 95/5 may be fine. ’‘’radical-stroke.txt’’‘: Even though it is only used for Han language training, tesseract gives error if the file is not found in other languages. *’‘’all-lstmf/all-boxes’’‘: (They are specific for OCR-D) the files are very similar to list.train/list.eval. The only difference is that all-lstmf/all-boxes files have all the paths. The purpose of this is simply to have all of those file paths in one place. *’‘’Prototype-traineddata (starter training data)’’’: This traineddata file will be used during training so this is the last file you need to start the training Tesseract. It will be created in its own folder (folder name is {LANG}), alongside with it you will find {LANG}.charset_size=42 (size might differ, this is just an example), {LANG}.unicharset and finally a {LANG}.traineddata
’‘’Note:’’’ In case if you plan on doing everything manually, you should still look into OCR-D’s Makefile, because at build proto model, it shows another command to be used combine_lang_model
. To see all variables and flags simply type in combine_lang_model
. (This is a training tool, if you can’t reach it try again from your OCR-D folder or where you created your training tools)
’‘’Process of Training:’’’
With OCR-D: This custom script you will be using make training
command and if you modified/added info to the script then it will execute option 1, if you have variables related to option 1, otherwise it will execute option 2 (if-else) ** Option 1: Include old traineddata (missing reason) *** ’‘’Old-traineddata (!)’’‘: Specify the path for old traineddata (not sure but I think in my case it was best_traineddata)This is set by “old_traineddata”. ***’’‘.lstm file (!)’’‘: You get .lstm files if you extract it from traineddata file (in my case I was extracting it from my prototype traineddata with combine-tessdata
tool). This can be used on –continue_from
. ** Option 2: Just train ***’‘’net-spec’’‘: Specification for the LSTM neural network layers (written in VGSL). ***’‘’learning_rate’’‘: This variable specifies the learning rate of the neural network (by default it seems to be “20e-4”, [[#Learning Rates|more on this]]). ***’‘’model_output’’‘: You can set location and name for produced the checkpoint. ***’‘’traineddata’’‘: Location of your prototype traineddata. ***’‘’train/eval_listfile’’‘: Files containing a list of .lstmf files. (In case of OCR-D these are the list.train/list.eval files.) **’‘’max_iterations’’’: One iteration means one image, so max_iterations should be at least equal to your image count. If you have a lot of images you may see that you do not need to process all of them to reach the “saturation” point when extra training is useless, but normally you want to process all of them even a few times (until the eval score stabilizes or gets worse for a few iteration).
Bare Tesseract: There is no difference when using only Tesseract (=Bare Tesseract), you are doing the same commands as OCR-D version, but you are writing them by hand. So, my suggestion is to open OCR-D’s Makefile and look through the commands in there, and try to include variables according to your setup.
’‘’Combine files:’’’ (Not every file mentioned is required, some are optional.) In practice this is the last step of the process. This is the step where you combine all those training files you produced and ran training on. This will result in a traineddata file, which you can call your own, but in case you want to achieve something you will have to make sure your new traineddata produces the required results. This is done through evaluation. Tesseract will prompt you evaluation information during the training, but you cannot only rely on them. It makes good practice to also check the training progress through lstmeval
as that will give you a more accurate representation of evaluation.
’‘’Evaluation:’’’
*lstmeval
command: Executes an evaluation. ** ’‘’model’’‘: Model to be evaluated (not entirely sure about this), can be a checkpoint or traineddata. **’‘’traineddata’’‘: Prototype traineddata. (Optional) **’‘’eval_listfile’’’: List of evaluation files (in OCR-D its list.eval, in Bare Tesseract it is probably {LANG}.training_files.txt). This command will evaluate based on Char and Word error rate. This is a useful step to make in order to be sure that we achieved actual improvements on result, but also in between training, because you can evaluate checkpoints too.
Edit: https://github.com/Shreeshrii/ocr-evaluation-tools , Evaluating tool
’‘’Am I done training?’’’
In my opinion this is the most important step. To reach the point where you have the final traineddata, while you tested it and made 100% sure this is perfect (as close to perfect in current state of Tesseract), requires a lot of research and testing dedicated to your own specific problem which you want to teach Tesseract to correct.
==Training Examples==
===Fine Tuning for a few characters===
If you want to make sure that Tesseract recognizes special characters you will need to fine tune Tesseract for the special characters. It heavily relies on the unicharset files, so make sure every new character is included properly. According to forum OCR-D makefile script needs some changes, so don’t forget to check [[#Downloads|my version]]. Changes include a fix for recreation of list.eval/train files each run. This is bad, because this will mix your train and eval files. The fix was a slight change in lists:
, but we also had to include a new method train-lists:
. With these changes you will have to execute make train-lists
only once, when you start a new training. (Changes happened between line 83-88) Now we have training data and script ready for training. To start it, use OCR-D’s way, which is make training
command (with properly set variables in the makefile script it will execute lstmtraining
command with the folder structure you have). This way of doing the training adds nothing, but crafts the commands for you according to the variables you set, which makes it no better than the manual way, but definitely makes things easier and smoother.
Don’t forget to do regular evaluation to make sure you are training in the right direction.
As you read this you must think that it cannot be all the steps required, but it is. The difficulty does not depend on the execution of training, but in providing proper and large amount of sample data, so you can have a traineddata at the end, which gives better result than the previously used traineddata. Also it is very important to understand the output of your commands, because you will have to make changes according to them.
You will find your newly created traineddata file in ~/ocrd/data
. (According to my [[#Folder Structure|folder structure]])
===Fine Tuning for Font===
To understand the files Tesseract is using for training, lets have a look at example code from the Tesseract wiki. The first sample never ran for me, it gives the following error: /usr/bin/language-specific.sh: line 1175: FONTS: unbound variable
It is because –fontlist flag was not specified. (It is possible to run lstmtraining
without specifying any fonts, but this example is about fonts.) Now if we include –fontlist flag in the sample code then it will run as intended.
tesstrain.sh –fonts_dir /usr/share/fonts </code>
–fontlist “Arial” “Impact Condensed” </code>
–lang eng </code>
–linedata_only </code>
–noextract_font_properties </code>
–langdata_dir ~/langdata/ </code>
–tessdata_dir ./ </code>
–output_dir ~/tesstutorial/engtrain </code>
Let’s go over the flags. (Note that you can specify these flags out of order.)
’‘’font_dir’’‘: Directory where your fonts are. (In my cygwin they were in /usr/share/fonts
by default.) ’’’fontlist’’’: Specifies which fonts do we want to include in our trainingdata. (To list all available fonts: text2image –list_available_fonts –fonts_dir /usr/share/fonts
.) ’‘’lang’’‘: Specifies which language you want to use. (Use 3-character language code, such as: dan
, eng
, swe
.) ’’’linedata_only’’’: Used for LSTM training. ’‘’noextract_font_properties’’‘: I could not find any information on what it does. My guess is that this flag makes it so that the script will not extract font properties. ’’’langdata_dir’’’: Location of langdata. (Get it from [[#Downloads|here]].) ’‘’tessdata_dir’’’: Location of your tessdata. (Best traineddata (get it [[#Downloads|here]]) is recommended, you can use the default traineddata too.) ’‘’output_dir’’’: Simple folder to keep script generated files. ([[#Folder Structure|Folder Structure]] Explained in Overview.)
This command will create for us our (default) starter training data. This includes: ’‘’eng.charset_size=110.txt’’’ (Number might be different for you.) ’‘’eng.traineddata’’‘: Prototype traineddata. ’’’eng.unicharset’’’ ’‘’eng.{FONT}.exp0.lstmf’’’: Training file for each font. *’‘’eng.training_files.txt’’’: Holds location of .lstmf files.
Now that we have our starter training data let’s make our evaluation data for our starter.
tesstrain.sh –fonts_dir /usr/share/fonts </code>
–fontlist “Impact Condensed” </code>
–lang eng </code>
–linedata_only </code>
–noextract_font_properties </code>
–langdata_dir ~/langdata/
–tessdata_dir ./ </code>
–output_dir ~/tesstutorial/engeval </code>
If you look closely, the only difference is –fontlist
changed, because, while we train with certain fonts we also want to evaluate on different ones. This example is very minimalistic in this matter. (If it is still unclear, I recommend reading [[#Training Process| training process]] again. Mainly focus on the part of train and evaluate sample data.)
’‘’Note:’‘’This example is based on the Tesseract wiki, which warns that these sample codes are making a demo of the training process, but they are more than inefficient for real life situations.’‘For making a general-purpose LSTM-based OCR engine, it is woefully inadequate, but makes a good tutorial demo.’’
This command will result in the same types of files, but we will leave them alone until we reach further steps in the process of training. The next step would be starting the training, but before that we have to prepare one more file. To start the training we can use best_traineddata (from repo), but we have to extract .lstm file from that best_traineddata with combine_tessdata
, so we can start our training. Extraction: combine_tessdata -e ./eng.traineddata ~/tesstutorial/engoutput/eng.lstm
(-e
Is flag for extracting files from traineddata, after the flag first path is for the best_traineddata, second path is for the output, to see other options of this tool simply type combine_tessdata
)
Now we can start the training of Tesseract with the following command:
lstmtraining –model_output ~/tesstutorial/engoutput/impact </code>
–continue_from ~/tesstutorial/engoutput/eng.lstm </code>
–traineddata ~/tesstutorial/engtrain/eng/eng.traineddata </code>
–old_traineddata ./eng.traineddata </code>
–max_iterations 3600 </code>data ./eng.traineddata </code>
–train_listfile ~/tesstutorial/engtrain/eng.training_files.txt </code>
Let’s go over the flags. (Note that you can specify these flags out of order.)
’‘’model_output’’‘: Here you can set the output location and the naming for checkpoints. ’’’continue_from’’’: Here we specified a .lstm file (extracted from best_traineddata) but you can specify a checkpoint even, meaning if you have a very promising checkpoint you can simply work on that with this flag. ’‘’traineddata’’‘: Your starter training data we prepared. ’’’old_traineddata’’’: The traineddata we started from which was best_traineddata in this case. ’‘’max_iterations’’’ ’‘’train_listfile’’’: Previously created eng.training_files.txt
Output for this command should take more time, so be patient. After finding necessary files and stating the network specifications it will start creating checkpoints and printing them with their stats. For example:
‘’2 Percent improvement time=31, best error was 100 @ 0 At iteration 31/100/100, Mean rms=0.399%, delta=0.274%, char train=0.897%, word train=2.463%, skip ratio=0%, New best char error = 0.897 Transitioned to stage 1 wrote best model:/home/kh/tesstutorial/engoutput/impact0.897_31.checkpoint wrote checkpoint.’’
2 percent improvement time= best error was 100 @0 iteration 31/100/100 : “learning_iteration/training_iteration/sample_iteration” rms= delta= char train= word train= skip ratio= *best char error=
Now that we have checkpoints, we can finally evaluate them on the eval_trainingfiles.
lstmeval –model ~/tesstutorial/engoutput/base_checkpoint </code>
–traineddata ~/tesstutorial/engtrain/eng/eng.traineddata </code>
–eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
The evaluation is very important because that is your only tool to see if you reached your desired results or not. For example, while you fine tune on a new font(s), you get worse on all the others, so if you care about other fonts too, make sure you do evaluation on those fonts too. Once desired results are achieved you can combine your training files into a finalized traineddata. The command goes as follows:
lstmtraining –stop_training </code>
–continue_from ~/tesstutorial/engoutput/base_checkpoint </code>
–traineddata ~/tesstutorial/engtrain/eng/eng.traineddata
–model_output ~/tesstutorial/final/eng.traineddata
This will output a eng.traineddata into /tesstutorial/final
folder, which is your newly trained traineddata. (According to my [[#Folder Structure|folder structure]].)
==Additional information==
===Learning Rates===
The following guides and posts about learning rates are not specific to Tesseract, but they are neural network specific (Tesseract 4.* uses neural network to learn). So If you want to experiment with learning rates (which could result in faster and more accurate training), then as the first step you should read the following guides and posts. These links will provide a general idea of how learning rates affect the process of training and also will give test results and actual methods to find the sweet spot of learning rates.
Learning Rate Tuning in Deep Learning: A Practical Guide: [http://mlexplained.com/2018/01/29/learning-rate-tuning-in-deep-learning-a-practical-guide/ link] How can I best adjust the learning rate according to the dataset in a deep neural network?: [https://www.quora.com/How-can-I-best-adjust-the-learning-rate-according-to-the-dataset-in-a-deep-neural-network link] Setting the learning rate of your neural network: [https://www.jeremyjordan.me/nn-learning-rate/ link] How to pick the best learning rate for your machine learning project?: [https://medium.freecodecamp.org/how-to-pick-the-best-learning-rate-for-your-machine-learning-project-9c28865039a8 link]
==Sources==
Official Tesseract wiki for OS setup: [https://github.com/tesseract-ocr/tesseract/wiki link] Official Tesseract wiki for training: [https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 link] Official Tesseract Forum: [https://groups.google.com/forum/#!forum/tesseract-ocr link] Official Tesseract Technical Documentations: [https://github.com/tesseract-ocr/tesseract/wiki/Technical-Documentation link] Cygwin guide by Paul Vorbach: [https://vorba.ch/2014/tesseract-cygwin.html link] Linux guide by Ivan Vanney: [https://linuxhint.com/install-tesseract-ocr-linux/ link] Tesseract Training guide by Kamil Ciemniewski: [https://www.endpoint.com/blog/2018/07/09/training-tesseract-models-from-scratch link] Sequence Modeling With CTC(Connectionist Temporal Classification) by Awni Hannun: [https://distill.pub/2017/ctc/ link] (Hannun, “Sequence Modeling with CTC”, Distill, 2017.) This publication is relevant in order to have a better understanding of how neural networks work.