Character Recognition Program that's Word-Compatible
Autor da sequência: BrianHayden
BrianHayden
BrianHayden
Estados Unidos da América
Russo para Inglês
Jan 2, 2014

Is there anyway I could scan the pages of a dictionary, then convert them into a (massive) file on Word? If so, what would be the cheapest and simplest way?

 
Vadim Kadyrov
Vadim Kadyrov  Identity Verified
Ucrânia
Local time: 01:33
Membro (2011)
Inglês para Russo
+ ...
Yes, you can Jan 2, 2014

The best application (I believe) is Abbyy Finereader (you can use the 8th version, it should be much cheaper than the newest one). You just scan pages into jpeg files and then use this application to OCR the images.

Still, this is an extremely time-consuming task. Even the best OCR applications won`t be able to perfectly reproduce the layout of dictionary pages.


 
BrianHayden
BrianHayden
Estados Unidos da América
Russo para Inglês
Autor do assunto
More detail... Jan 2, 2014

I should probably better explain what my plan -- feasible or unfeasible though it may be -- is. I like Microsoft Word, and I think it's fairly straightforward to use. I've been keeping a dictionary of idioms as a Word file, adding new entries as I encounter new new idioms. Keeping a dictionary of idioms and phrases in a Word file is especially convenient, since you can do a Ctrl + F search for a word within the phrase, which is easier than looking through all the words of an idiom separately in ... See more
I should probably better explain what my plan -- feasible or unfeasible though it may be -- is. I like Microsoft Word, and I think it's fairly straightforward to use. I've been keeping a dictionary of idioms as a Word file, adding new entries as I encounter new new idioms. Keeping a dictionary of idioms and phrases in a Word file is especially convenient, since you can do a Ctrl + F search for a word within the phrase, which is easier than looking through all the words of an idiom separately in a standard dictionary, which still may not list the idiom. I've recently found an especially good dictionary with a lot of idioms -- and I wanted to scan that in and add it to the Word file, somehow. Hand-typing the entries from the dictionary would be murderous. Anything that would be less laborious than hand-typing is okay in my book.

And I forgot to mention that I need a program that can read Cyrillic -- since this is a dictionary, I also need a program that can read Cyrillic with accent marks. Does Abby FineReader do that? And is it user-friendly?

[Edited at 2014-01-02 08:38 GMT]

[Edited at 2014-01-02 08:39 GMT]

[Edited at 2014-01-02 08:39 GMT]
Collapse


 
Rolf Keller
Rolf Keller
Alemanha
Local time: 00:33
Inglês para Alemão
OCR needs know-how Jan 2, 2014

[quote]Vadim Kadyrov wrote:

You just scan pages into jpeg files and then use this application to OCR the images.


This is possible, but must be done cautiously. JPG files can (and are if you use default settings) be non-lossless compressed, so that the OCR results will not be optimal. BTW, any OCR application should be able to use scanner input directly – no need to scan beforehand.

Even the best OCR applications won`t be able to perfectly reproduce the layout of dictionary pages.


??? For the mentioned purpose, you probably don't want to reproduce the original layout but a clean table (one table row per dictionary item).

In the worst case you have to mark up the columns manually (in the OCR software) and ignore all the remaining. Such markup takes about 30 seconds per page, so 240 pages take 2 hours. In many cases the OCR software will do that automatically, though.

Depending on the dictionary you might have to write a Word macro that tidies up the resulting Word table. This might take one hour or one day.


 
Vadim Kadyrov
Vadim Kadyrov  Identity Verified
Ucrânia
Local time: 01:33
Membro (2011)
Inglês para Russo
+ ...
The thing I suggested Jan 2, 2014

[quote]Rolf Keller wrote:

Vadim Kadyrov wrote:

You just scan pages into jpeg files and then use this application to OCR the images.


This is possible, but must be done cautiously. JPG files can (and are if you use default settings) be non-lossless compressed, so that the OCR results will not be optimal. BTW, any OCR application should be able to use scanner input directly – no need to scan beforehand.

Even the best OCR applications won`t be able to perfectly reproduce the layout of dictionary pages.


??? For the mentioned purpose, you probably don't want to reproduce the original layout but a clean table (one table row per dictionary item).

In the worst case you have to mark up the columns manually (in the OCR software) and ignore all the remaining. Such markup takes about 30 seconds per page, so 240 pages take 2 hours. In many cases the OCR software will do that automatically, though.

Depending on the dictionary you might have to write a Word macro that tidies up the resulting Word table. This might take one hour or one day.



The thing I suggested is a general scenario, with all the details to be discussed (or suggested) later on. The thing I assumed when I saw the message of the topic starter was his wish to reproduce the hard copy of the dictionary in electronic form (ok, some old and really precious edition of this dictionary).

In case he wants only some entries from this dictionary to be digitalized, the task becomes much easier, of course.

Some words about jpeg images. In case the resolution is high, quality-related issues of this file type no longer matter, I believe.

But these are details. I think the topic starter has already seen the "path".


 
esperantisto
esperantisto  Identity Verified
Local time: 02:33
Membro (2006)
Inglês para Russo
+ ...
LOCALIZADOR DO WEBSITE
No Jan 2, 2014

BrianHayden wrote:

Keeping a dictionary of idioms and phrases in a Word file is especially convenient, since you can do a Ctrl + F search


If you use one dictionary, this may be fine. However, a translator normally needs more than one dictionary. In such a case, using a dictionary shell is a better solution. My favorite is GoldenDict.

program that can read Cyrillic with accent marks. Does Abby FineReader do that?


No, FineReader can't produce good output for accented Cyrillic letters. The versions 8 or 9 simply produce unaccented letters, the later 10 and 11 produce recognition errors.

[Edited at 2014-01-02 11:38 GMT]


 
BrianHayden
BrianHayden
Estados Unidos da América
Russo para Inglês
Autor do assunto
Dictionary Shell? Jan 2, 2014

esperantisto wrote:

BrianHayden wrote:

Keeping a dictionary of idioms and phrases in a Word file is especially convenient, since you can do a Ctrl + F search


If you use one dictionary, this may be fine. However, a translator normally needs more than one dictionary. In such a case, using a dictionary shell is a better solution. My favorite is GoldenDict.

program that can read Cyrillic with accent marks. Does Abby FineReader do that?


No, FineReader can't produce good output for accented Cyrillic letters. The versions 8 or 9 simply produce unaccented letters, the later 10 and 11 produce recognition errors.

[Edited at 2014-01-02 11:38 GMT]


What is a dictionary shell?


 
BrianHayden
BrianHayden
Estados Unidos da América
Russo para Inglês
Autor do assunto
Accent marks... Jan 2, 2014

No, FineReader can't produce good output for accented Cyrillic letters. The versions 8 or 9 simply produce unaccented letters, the later 10 and 11 produce recognition errors.

[Edited at 2014-01-02 11:38 GMT] [/quote]

Is there any way around that? It seems that a product that complicated would have some sort of way of dealing with that, especially since in Russian accent marks are occasionally used to disambiguate words in everyday, non-dictionary texts (think of за́мок
... See more
No, FineReader can't produce good output for accented Cyrillic letters. The versions 8 or 9 simply produce unaccented letters, the later 10 and 11 produce recognition errors.

[Edited at 2014-01-02 11:38 GMT] [/quote]

Is there any way around that? It seems that a product that complicated would have some sort of way of dealing with that, especially since in Russian accent marks are occasionally used to disambiguate words in everyday, non-dictionary texts (think of за́мок, замо́к).
Collapse


 
esperantisto
esperantisto  Identity Verified
Local time: 02:33
Membro (2006)
Inglês para Russo
+ ...
LOCALIZADOR DO WEBSITE
Answers Jan 3, 2014

BrianHayden wrote:

What is a dictionary shell?


Well, a dictionary program. A program used to access dictionaries.

BrianHayden wrote:

Is there any way around that? It seems that a product that complicated would have some sort of way of dealing with that, especially since in Russian accent marks are occasionally used to disambiguate words in everyday, non-dictionary texts (think of за́мок, замо́к).


No idea. FineReader can be trained to recognize specific languages with specific characters, but I don’t know if it’s applicable to Russian accents as there are no pre-composed accented Cyrillic letters in Unicode.


 
Emma Goldsmith
Emma Goldsmith  Identity Verified
Espanha
Local time: 00:33
Membro (2004)
Espanhol para Inglês
Russian is in the drop-down list of languages in Abbyy Jan 3, 2014

esperantisto wrote:

No idea. FineReader can be trained to recognize specific languages with specific characters, but I don’t know if it’s applicable to Russian accents as there are no pre-composed accented Cyrillic letters in Unicode.


I've got no idea either, but Russian is definitely included in the list of languages that Abbyy will recognise. (Version 11.0)

You can also add a host of symbols/letters as a "user language". For example, I've added µ, α and β because Abbyy doesn't recognise them out of the box.


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Character Recognition Program that's Word-Compatible






Pastey
Your smart companion app

Pastey is an innovative desktop application that bridges the gap between human expertise and artificial intelligence. With intuitive keyboard shortcuts, Pastey transforms your source text into AI-powered draft translations.

Find out more »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »