How to use Loghi to read handwritten dutch text from hundreds of years ago
Posted on: 1-3-2024
To extract old handwritten dutch text and view the results (PageXML) in PageViewer we need to follow the following steps:
- Clone the Loghi repo
- Download pretrained models
- Edit na-pipeline.sh
- Use GPU to speed things up
- Run the project
- View the results in PageViewer
- Extract text using PageXML tools
Clone the repo
We will be using this project to read old handwritten dutch text: https://github.com/knaw-huc/loghigit clone git@github.com:knaw-huc/loghi.git
Download pretrained models
Download the public pretrained models and other necessities from: surfdrive. I downloaded everything and put them in the same folder as where I put the Loghi project, but feel free to place them wherever.Edit na-pipeline.sh
na-pipeline.sh is a script provided by KNAW, Koninklijke Nederlandse Akademie van Wetenschappen, to transcribe scans/pictures. Set the following three variables, inside na-pipeline.sh, by pointing to the just installed files.(As noted in the README.md from the Loghi project I used general and generic-2023-02-15 for the detection of baselines and HTR respectively.)
LAYPABASELINEMODEL=/home/jdwaal/Workspace/Personal/machine-learning/laypa/general/baseline/config.yaml
LAYPABASELINEMODELWEIGHTS=/home/jdwaal/Workspace/Personal/machine-learning/laypa/general/baseline/model_best_mIoU.pth
HTRLOGHIMODEL=/home/jdwaal/Workspace/Personal/machine-learning/loghi-htr/generic-2023-02-15
Use GPU to speed things up
You can run the script without using the power of your GPU and use the CPU instead, but it runs very slowly. If you have a nvidia GPU you can follow this guide to speed things up:Run nvidia-smi to see if things are working, I had to restart my pc for it to give correct output.
Edit na-pipeline.sh
edit na-pipeline.sh and set the GPU variable equal to 0 if it wasn't already set to it to make sure Loghi uses your GPU.GPU=0
My GPU doesn't support blfoat16 which resulted in the following error.
RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.
To fix it I had to add the following parameter to line 104 of na-pipeline.sh MODEL.AMP_TEST.ENABLED False
This is what line 104 should look like:
--opts MODEL.WEIGHTS "" TEST.WEIGHTS $LAYPABASELINEMODELWEIGHTS MODEL.AMP_TEST.ENABLED False | tee -a $tmpdir/log.txt
Run the project
I used the following image which is a page from an old family members diary.Place this image or another image inside a directory, for example a directory called images
<path_to_na-pipeline.sh> <path_to_directory_of_images>
e.g.
./loghi/na-pipeline.sh /home/jdwaal/Workspace/Personal/machine-learning/images/
Running this command wil result in a new folder called Page which is placed inside your directory containing images. Within this Page directory, two new files will be created. The most interesting one is the .xml file. This is a PageXML file containing the transcribed text including coordinates of where in the image this text is placed.
View the result in PageViewer
Download PageViewer.cd into the just downloaded folder.
Run
java -jar JPageViewer.jar
Select the PageXML file and corresponding image
Turn on 'words' in the menu and hover over the highlighted words!
This is what PageViewer looks like:
Extract text using PageXML tools
Besides viewing the output of na-pipeline.sh in PageViewer you might also just want to get all text from your image.This is possible by using pagexml-tools.
Create a python script
touch index.py
Create a virtualenv
python3.10 -m venv venv
activate your virtualenv
source venv/bin/activate
Install dependencies
install pagexml-tools
install pagexml_slim
run pip freeze -l > requirements.txt
Write the script
Add the following content to the script as noted in the README.md from pagexml and in pretty print.from pagexml.parser import parse_pagexml_file
from pagexml.helper.pagexml_helper import pretty_print_textregion
pagexml_file = "<path-to-page-xml-file>"
page_doc = parse_pagexml_file(pagexml_file)
# iterative over text regions and lines
for tr in page_doc.text_regions:
pretty_print_textregion(tr)
Run the script
python index.py
Result
This is the result: Zondag 21 Mei 1826.
In goede gezondheid en vrolyke gemoedsstemmmy verlie¬
„ten wy (:myne vrouwen ik:/ des morgens 6 here het huis, stap¬
„ten op de Arnhemsche deligence, een oude kast, doch die vry
gemaklyk leed, en bevonden ons aldaar in gezelschap van Me¬
vrouw Thesing geb: d' Abo, eene lieve spraakzame vrouw, met
eene aartige Jongen haar Loontje by Zich, de Heer iuutkens
die toenboven Amersfoord woonden, en een Amsterdamsche
Molenaar; Aan de Miuderpoort; werdt de vragt gecomple¬1
„teerd door eene Arnhemsche farnelle, bestaande in eenen
morsigen papa, eene smerige mama, een goox klein kind, dat
echter onderweg voor 't gehoor niet lastig was, en nog een Schoen¬
„makers dochter; de laatste jong maar lyzig; Hunl: bagage
bestond in eene sparre doos, ter grote van een verhuis kist,
opgevuld met half rotte China s appelen, gedroogde schol¬
broodenz. — onderweg werdt het gezelschap by afwisseling
en kortstondig vermeerderd door een hesse roerman, eene
knappe oude vrouw, een kind met smerige handen, enz:
dles avonds half acht wre waren wy te Arnhem. Ik werdtaar
hetposthuis verwelkomt door den Capn adj: rembgrove, die zynen
Zwager de postmeester Borisius bezogt, men bewees ons daar
veel vriendelykheid. Op myn verzoek kwam spoedig eene
wagen voor, die ons ten 9 ure te velp bragt, alwaar wy afsty¬
ten by Brouwer in de Zwaan, waar het Lindelyk en goed was,
Roisjournaal op een togtje
naar Duitschland, in 1826.
x