I think there's a point that just might need a bit of clarification for some people,
here, Gareth, if you don't mind putting my two-bob's worth in.
The NLA (and probably most other big organisations doing this sort of thing) use both OCR
and an image. So when you are looking at the Welsh journals on-line at the NLW that's
the image which you see, and good to behold it is too! But they are also OCR'ing the
same thing, so when you search a character string, that's what you search through.
Sadly, the OCR isn't as good as the human eye-brain combination, so what you see as
luminously readable material on-screen is not always recognised properly within the
digitised record. Errors are far more common with old newspapers, where the original
records are sometimes not too good to begin with.
What the NLA has done is enable us to see both the image of the records (newspapers,
journals etc) alongside their OCR interpretation of the text. That enables mugs like me to
spend time we haven't got offering corrections to the OCR version of the text based on
what we can see in the image before us. We all hope that makes searching the records
easier and more effective for others after us. Funnily enough, the records in Australian
newspapers that get the most attention for correction of the OCR text are the hatches,
matches etc, primarily because of family historians.
(There's plenty of Welsh interest there too. For example, there's a mountain of
material relating to Lewis Thomas, the bloke from Talybont, Ceredigion, who became a very
prominent coal miner in Queensland in the nineteenth century. And I've found rellies
of mine who emigrated in the 1880s and even traced descendants.)
On 18/03/2012, at 11:16 PM, Gareth wrote:
I must agree that the Optical Character Recognition (OCR) process is
unlikely to achieve 100% perfection - although I hesitate to say 'never'.
I have OCR'd a considerable amount of material into Genuki over the years
and some of the sources have presented major headaches.
I use a programme called Omnipage pro by Nuance, which is a market leader,
and it is true that when faced with a long run of work then it is worthwhile
persevering with the 'learning phase' to the point where most, if not all,
mis-transcriptions are eliminated over time.
The rest can, in theory, be dealt with by a final human proof reading of
course, but it all adds up to a lengthy business - as I found with the Hanes
Eglwysi Annibynnol Cymru project!
I read a lot of books on my Kindle and it is a constant source of surprise
that the proof reading hasn't been carried out as effectively as I would
have expected with a commercially produced book.
And some examples of 'original' Welsh pages reproduced with OCR, seen on the
net today, are gibberish.
The NLW have got it right with their digitisation of Welsh journals project,
perfect reproduction every time.
They also give you the option of switching to 'text mode' with the caution;
"This text was generated automatically from the scanned page and has not
been checked. Typical character accuracy is in excess of 99%, but this
leaves one error per 100 characters."
And, importantly, the search facility they provide has apparently worked
perfectly in the *digital* mode every time I've used it.
Of course you can only 'copy/paste' individual words in Text mode, which may
be it's biggest downside in practice.