OCR and Human Transcription: Working Together in FreePRO

At Free UK Genealogy, our tagline is “Human transcription of family history data”. It reflects our belief that accuracy, care, and human judgement are essential when making records freely available to researchers. With FreePRO – our new project to digitise and publish probate indexes from 1854–1943 – we are bringing this ethos into a new domain: printed probate registers.

Probate records are treasure troves of information. They record not just names and dates, but also occupations, addresses, values of estates, and the relationships between the deceased and their executors. The challenge is that these details are locked away in over 800 hefty volumes, covering nearly 7.5 million entries.

Why OCR matters

Unlike parish registers or census forms, probate indexes were printed, not handwritten. This makes them well-suited to Optical Character Recognition (OCR) – the process of turning scanned images into machine-readable text. Trials with software such as Tesseract have shown promising results: for some volumes, OCR can correctly identify names and addresses in the vast majority of entries.

But OCR is never perfect. As trustees discussed in our recent meeting, even good OCR output needs checking. Numbers (such as estate values) are especially error-prone, and place names can still be mangled.

Analysis of free text

The probate entries which we are OCR-ing are 'free text', but consistently-structured free text. If it were not so consistent, we couldn't do much more with OCR than replicate the text of each entry. We are developing a whole processing framework which carries out the analysis of each entry, and splits it up into individual items of data. The first step is to identify and separate out the individual entries. Errors which split one entry into two, or join two entries together, are flagged up and fixed. A second, more detailed check is then run on the edited source, looking for potential errors at the level of individual fields. Finally, we will have 'checking indexes' which tabulate alphabetically the values found for each field, clearly showing up inconsistencies at the data level.

The role of human transcribers

This is where our volunteers come in. Rather than typing out every entry from scratch, volunteers will review the OCR output text against the original page image. Their task is to:

Correct errors introduced by OCR (misread letters, skipped lines, muddled numbers)
Ensure key details such as names, relationships, and dates are captured correctly
Enter markers into the source text to cope with unusual cases, such as multi-executor entries or entries split across pages

In other words, OCR provides the first draft – humans ensure the final version is accurate, searchable, and trustworthy.

Why this fits our ethos

This “OCR-assisted transcription” approach saves time while staying true to our principles. It means:

Efficiency without compromise – OCR gets us started quickly, but accuracy still comes from human care.
Volunteer expertise is central – just as volunteers in our other projects learn to decipher difficult handwriting, FreePRO volunteers will learn to spot and fix OCR quirks.
Better data for researchers – the end result will be a richly searchable database, far more flexible than existing probate indexes.

The combination of OCR and human transcription is an innovation for Free UK Genealogy. It is also a way of honouring the trust placed in us: to make records free, accurate, and useful for generations to come.

By blending technology with human judgement, FreePRO will allow family historians to search by names, occupations, executors, and even addresses – opening up stories that have been hidden for over a century.