How to use Loghi to read handwritten dutch text from hundreds of years ago

Posted on: 1-3-2024

Loghi PageXML PageViewer AI

To extract old handwritten dutch text and view the results (PageXML) in PageViewer we need to follow the following steps:

  1. Clone the Loghi repo
  2. Download pretrained models
  3. Edit na-pipeline.sh
  4. Use GPU to speed things up
  5. Run the project
  6. View the results in PageViewer
  7. Extract text using PageXML tools

Clone the repo

We will be using this project to read old handwritten dutch text: https://github.com/knaw-huc/loghi
git clone git@github.com:knaw-huc/loghi.git

Download pretrained models

Download the public pretrained models and other necessities from: surfdrive. I downloaded everything and put them in the same folder as where I put the Loghi project, but feel free to place them wherever.

Edit na-pipeline.sh

na-pipeline.sh is a script provided by KNAW, Koninklijke Nederlandse Akademie van Wetenschappen, to transcribe scans/pictures. Set the following three variables, inside na-pipeline.sh, by pointing to the just installed files.

(As noted in the README.md from the Loghi project I used general and generic-2023-02-15 for the detection of baselines and HTR respectively.)

LAYPABASELINEMODEL=/home/jdwaal/Workspace/Personal/machine-learning/laypa/general/baseline/config.yaml
LAYPABASELINEMODELWEIGHTS=/home/jdwaal/Workspace/Personal/machine-learning/laypa/general/baseline/model_best_mIoU.pth
HTRLOGHIMODEL=/home/jdwaal/Workspace/Personal/machine-learning/loghi-htr/generic-2023-02-15

Use GPU to speed things up

You can run the script without using the power of your GPU and use the CPU instead, but it runs very slowly. If you have a nvidia GPU you can follow this guide to speed things up:

nvidia and docker

Run nvidia-smi to see if things are working, I had to restart my pc for it to give correct output.

Edit na-pipeline.sh

edit na-pipeline.sh and set the GPU variable equal to 0 if it wasn't already set to it to make sure Loghi uses your GPU.
GPU=0

My GPU doesn't support blfoat16 which resulted in the following error.

RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.

To fix it I had to add the following parameter to line 104 of na-pipeline.sh MODEL.AMP_TEST.ENABLED False

This is what line 104 should look like:

--opts MODEL.WEIGHTS "" TEST.WEIGHTS $LAYPABASELINEMODELWEIGHTS MODEL.AMP_TEST.ENABLED False | tee -a $tmpdir/log.txt

Run the project

I used the following image which is a page from an old family members diary. page from an old family members diary

Place this image or another image inside a directory, for example a directory called images

<path_to_na-pipeline.sh> <path_to_directory_of_images>

e.g.

./loghi/na-pipeline.sh /home/jdwaal/Workspace/Personal/machine-learning/images/

Running this command wil result in a new folder called Page which is placed inside your directory containing images. Within this Page directory, two new files will be created. The most interesting one is the .xml file. This is a PageXML file containing the transcribed text including coordinates of where in the image this text is placed.

View the result in PageViewer

Download PageViewer.

cd into the just downloaded folder.

Run

java -jar JPageViewer.jar

Select the PageXML file and corresponding image

Turn on 'words' in the menu and hover over the highlighted words!

This is what PageViewer looks like: example of PageViewer

Extract text using PageXML tools

Besides viewing the output of na-pipeline.sh in PageViewer you might also just want to get all text from your image.

This is possible by using pagexml-tools.

Create a python script

touch index.py

Create a virtualenv

python3.10 -m venv venv

activate your virtualenv

source venv/bin/activate

Install dependencies

install pagexml-tools
install pagexml_slim
run pip freeze -l > requirements.txt

Write the script

Add the following content to the script as noted in the README.md from pagexml and in pretty print.
from pagexml.parser import parse_pagexml_file
from pagexml.helper.pagexml_helper import pretty_print_textregion

pagexml_file = "<path-to-page-xml-file>"

page_doc = parse_pagexml_file(pagexml_file)

# iterative over text regions and lines
for tr in page_doc.text_regions:
    pretty_print_textregion(tr)

Run the script

python index.py

Result

This is the result:
    Zondag 21 Mei 1826.
      In goede gezondheid en vrolyke gemoedsstemmmy verlie¬

„ten wy (:myne vrouwen ik:/ des morgens 6 here het huis, stap¬
„ten op de Arnhemsche deligence, een oude kast, doch die vry
gemaklyk leed, en bevonden ons aldaar in gezelschap van Me¬
vrouw Thesing geb: d' Abo, eene lieve spraakzame vrouw, met
eene aartige Jongen haar Loontje by Zich, de Heer iuutkens
die toenboven Amersfoord woonden, en een Amsterdamsche
Molenaar; Aan de Miuderpoort; werdt de vragt gecomple¬1
„teerd door eene Arnhemsche farnelle, bestaande in eenen
morsigen papa, eene smerige mama, een goox klein kind, dat
echter onderweg voor 't gehoor niet lastig was, en nog een Schoen¬
„makers dochter; de laatste jong maar lyzig; Hunl: bagage
bestond in eene sparre doos, ter grote van een verhuis kist,
 opgevuld met half rotte China s appelen, gedroogde schol¬
 broodenz. — onderweg werdt het gezelschap by afwisseling
en kortstondig vermeerderd door een hesse roerman, eene
knappe oude vrouw, een kind met smerige handen, enz:
   dles avonds half acht wre waren wy te Arnhem. Ik werdtaar

 hetposthuis verwelkomt door den Capn adj: rembgrove, die zynen
 Zwager de postmeester Borisius bezogt, men bewees ons daar
 veel vriendelykheid. Op myn verzoek kwam spoedig eene
 wagen voor, die ons ten 9 ure te velp bragt, alwaar wy afsty¬
 ten by Brouwer in de Zwaan, waar het Lindelyk en goed was,

  Roisjournaal op een togtje
naar Duitschland, in 1826.
    x

Links