Description of problem: gscan2pdf as released has various problems in OCR support: * encoding of gocr output is wrong (double encoded) * tesseract is not detected Version-Release number of selected component (if applicable): 1.0.6 How reproducible: always Steps to Reproduce: 0. (install gocr and tesseract) 1. launch gscan2pdf, open an image of a scanned text or scan one 2a. go to ocr tools, select gocr 3a. accented letters in output are bad 2a. go to ocr tools 3a. tesseract doesn't appear on list bug with gocr is due to the problem of perl doing a bad job handling utf8 in stdin/stdou solution = don't call gocr to write to stdout, but to a temporary file instead bug with tesseract is because the version finding changed. solution = fix the parsing of tesseract output Reproducible: Steps to Reproduce:
Created attachment 4093 [details] patch to fix gocr output and finding of tesseract version this patch fixes the issues with gocr and the finding of tesseract version (tesseract still fails, but because of a tesseract bug)
CC: (none) => pablo
Created attachment 4096 [details] improved patch, fixes also parsing of hocr (boxed ocr) output improved version of the patch; this one does: * fix output of gocr to proper utf-8 (by using a tmp file instead of perl stdin/stdou which is broken) * fix finding of tesseract version (now "tesseract -v" does it) * no need to convert to tif for tessearact if file is in tif,png,jpeg,gif (conversion done by libleptonica that tesseract is linked with) * fixed parsing of boxed hocr output of tesseract (it uses "ocrx_word" instead of "ocr_word")
Attachment 4093 is obsolete: 0 => 1
Depends on: (none) => 10403
Keywords: (none) => PATCH
Keywords: (none) => TriagedCC: (none) => fundawang, yann
Hi Pablo, I have been attempting to re-diff your patch for use in the spec, however the part relating to Gscan2pdf.pm_bak does not seem to relate to the version of Gscan2pdf.pm in the package. The other two sections appear to fit OK:- #---------------------------------------- diff -ur gscan2pdf-1.0.6_orig/lib/Gscan2pdf/Page.pm gscan2pdf-1.0.6/lib/Gscan2pdf/Page.pm --- gscan2pdf-1.0.6_orig/lib/Gscan2pdf/Page.pm 2012-07-20 21:16:42.000000000 +0100 +++ gscan2pdf-1.0.6/lib/Gscan2pdf/Page.pm 2013-06-19 23:03:52.954815346 +0100 @@ -126,7 +126,7 @@ if ( $token->[1] eq 'span' and defined( $token->[2]{class} ) and - ( $token->[2]{class} eq 'ocr_line' or $token->[2]{class} eq 'ocr_word' ) + ( $token->[2]{class} eq 'ocr_line' or $token->[2]{class} =~ m/^ocrx*_word$/ ) and defined( $token->[2]{title} ) and $token->[2]{title} =~ /bbox\ (\d+)\ (\d+)\ (\d+)\ (\d+)/x ) { diff -ur gscan2pdf-1.0.6_orig/lib/Gscan2pdf/Tesseract.pm gscan2pdf-1.0.6/lib/Gscan2pdf/Tesseract.pm --- gscan2pdf-1.0.6_orig/lib/Gscan2pdf/Tesseract.pm 2012-05-02 20:42:33.000000000 +0100 +++ gscan2pdf-1.0.6/lib/Gscan2pdf/Tesseract.pm 2013-06-19 22:53:19.637731860 +0100 @@ -159,7 +159,7 @@ my $txt = File::Temp->new( SUFFIX => $suffix ); ( my $name, my $path, undef ) = fileparse( $txt, $suffix ); - if ( $file !~ /\.tif$/x ) { + if ( $file !~ /\.(tif|png|gif|jpeg)$/x ) { # Temporary filename for new file $tif = File::Temp->new( SUFFIX => '.tif' ); #------------------------------------------------- Could you take a look please ?
CC: (none) => zen25000
Created attachment 4166 [details] patch to fix gocr output, finding of tesseract version and parsing of hocr (boxed ocr) output I did agai then patch, from the sources of 1.0.6-2; added to the spec file: Patch1: gscan2pdf-1.0.6-gocr-encoding-and-tesseract-version.patch ... %prep %setup -q %patch1 did rpmrebuild and it went fine. for the changelog you can put: - fixes mangeld output of gocr - recognizes tesseract version 3 - fixes parsing of hocr (some OCRs use 'ocrx_word' instead of 'ocr_word') I think the recognition of tesseract (the "output on error" that gscan2pdf used to parse to detect version has changed; on the other hand now tesseract has a -v switch to print the version) is fixed upstream on newer versions; the gocr and ocrx_word bugs however are probably not.
Attachment 4096 is obsolete: 0 => 1
Update Advisory ############### Package gscan2pdf has been submitted to 3/core/updates_testing - fixes mangeld output of gocr - recognizes tesseract version 3 - fixes parsing of hocr (some OCRs use 'ocrx_word' instead of 'ocr_word') I'm taking Pablo's word for that :) rpm:- gscan2pdf-1.0.6-2.1.mga3.noarch.rpm src rpm: gscan2pdf-1.0.6-2.1.mga3.src.rpm Thanks Pablo :) Cheers, Barry
Assignee: bugsquad => qa-bugs
Tested on Mageia 3 i586, the update candidate does not fix the bug for me. I have installed: gscan2pdf-1.0.6-2.1.mga3.noarch tesseract-3.02.02-3.1.mga3.i586 (from core/updates_testing too)
CC: (none) => remi
patch wasn't applied. maybe the %apply_patches macro has a problem? I used %patch1 instead when building locally; and it worked
and you used a wrong patch; take the latest one from here
Created attachment 4169 [details] patch to fix gocr output, finding of tesseract version and parsing of hocr (boxed ocr) output ok, the problem with %apply_patches is that it _requires_ _all_ patches to be applied with -p1 (mine was with -p0) Here it is gscan2pdf-1.0.6-mga-gocr-encoding-and-tesseract-version.patch again :)
Attachment 4166 is obsolete: 0 => 1
Pablo, It's really odd - your original patch applies with -p1 after the _bak suffixes are removed from the patch file destinations, but it still failed with %apply_patches IIANM. Anyway I just pushed a new version, but I'm sure something has screwed up the patch again as the patch in svn now appears different to the one in my sources. it's as though the package commit is getting the diff wrong on the patch. I will apply your new patch with a slightly different name and scrap the old name to get out of this mess. We will get there ;)
Ah - on re-checking svn it seems that my last push is OK - I think the problem was with dolphin/kate - I often see it confusing and displaying files with the same name in different paths - it's VERY dangerous and annoying :/ From the BS log the patch applied OK:- Patch #0 (gscan2pdf-1.0.6-mga-gocr-encoding-and-tesseract-version.patch): + /usr/bin/cat /home/iurt/rpmbuild/SOURCES/gscan2pdf-1.0.6-mga-gocr-encoding-and-tesseract-version.patch + /usr/bin/patch -p1 --fuzz=0 patching file lib/Gscan2pdf/Tesseract.pm patching file lib/Gscan2pdf/Page.pm patching file lib/Gscan2pdf.pm + exit 0 ...and I double checked it against your new patch and it does the same, since I edited out the '_bak's, so I'll leave it. @ Remi Sorry about that - please test gscan2pdf-1.0.6-2.2.mga3 in core/updates_testing. Barry
I have tried this update gscan2pdf-1.0.6-2.2.mga3 on Mga3 32-bit, and the results are close to useless. But no worse than previously. The +ve thing is that tesseract *is* shown (along with gocr) in the Tools/OCR list, and with a language choice. OCR O/P from gocr is poor for English, worse (unuseable) French 2-col. OCR O/P from tesseract is much better *except* that every word is boxed in gscan2pdf. Using tesseract from command line seems to yield good results. Maybe the 'boxing' under gscan2pdf would vanish if the file was saved as text. I must try that. Where do we go next? I doubt the utility of this program for OCR. For its title role - why not? Except that do scan programs not themselves offer PDF O/P?
CC: (none) => lewyssmith
(In reply to Lewis Smith from comment #12) > I have tried this update > gscan2pdf-1.0.6-2.2.mga3 > on Mga3 32-bit, and the results are close to useless. But no worse than > previously. > > The +ve thing is that tesseract *is* shown (along with gocr) in the > Tools/OCR list, and with a language choice. Confirmed in Cauldron x86_64 > OCR O/P from gocr is poor for English, worse (unuseable) French 2-col. > OCR O/P from tesseract is much better *except* that every word is boxed in > gscan2pdf. Confirmed Using tesseract from command line seems to yield good results. > Maybe the 'boxing' under gscan2pdf would vanish if the file was saved as > text. I must try that. I tried that but on my quick test the 'text' turned out to be xml, which did display fine in xxe. > > Where do we go next? I doubt the utility of this program for OCR. For its > title role - why not? Except that do scan programs not themselves offer PDF > O/P? I think it's intention is to produce pdfs with embedded text to allow searching of the text originally in images, in which case there is a use case. Where to go next? ... Well, probably upstream bug report, unless someone here can fix it - Pablo ?? ;)
well, "close to useless" is an exageration. free software OCR's are not as good as some others; however tesseract does an acceptable work (provided it knows about your language and font; you can teach it, it's a strenght, but the way to teach is complicated). As I said the upstream version I think correctly handles tesseract. The gocr<->perl charset problem was there however (probably people using gocr only use plain ASCII and didn't noticed). I reported it, with the fix; but got no response yet. The tesseract rendering with boxes over each word is not a bug but a feature. However, it is true that it would be nice to have a switch on the OCR dialog to choose boxed or unboxed output, as tesseract can provide both; that will empower the user. I reported that idea to the autor of gscan2pdf too, but got no response yet (it was on same mail). It would be better if the author adds it, as he will know how to do it better and faster. Currently OCR dialog has only one ocr-dependent configuration option: language choice. The thing to add would be, for those ocr's allowing both possibilites, a checkbox to choose output: boxed/xml or unboxed/plain text. However that boxed/unboxed is another thing, if you want open a different bug for it.
I have done more careful tests of gscan2pdf v command-line gocr & tesseract. For gocr, the O/P is the same either way - but less good than with tesseract. For tesseract, the gscan2pdf boxed O/P is disturbing. What is it meant to do? It seemed to me also that some of these boxed words showed only their initial letter, whereas the command-line O/P was intact. This may be due to my SiS video X problems. FWIW examples from command line (screen O/P basically similar): ******************************************* [1a] gocr O/P mixed font single-col English ------------------------------------------- __ E ldernowe r Ch amp agne 9 elder_lo_er _eDds i_ J_ll bloom _ !/2 lirres (J gDllonJ cold zu_ler J lemon 650 g (J !/2 l6) lo_Js__Dr _ I_6lespoons 2uhire _ine_Dr Dissalve the sugar in a little warm water and allow to cool. Squeeze [1b] tesseract O/P mixed font single-col English ------------------------------------------------ Elderï¬ower Champagne 4 elderflozver heads in full bloom 4V2 litres (I gallon) cold water 1 lemon 650 g (I V2 lb) loafsugar Z tablespoons white vinegar Dissolve the sugar in a little warm water and allow to cool. Squeeze [2a] gocr O/P 2-col French -------------------------- \ p ee p_chee5t non p&_ acc_ g_lpeDec_a_8Jt__oq_ qu_ __g_ece ecp 8 g t_8n_P__ente_ se p8rent d'une nuée de dia_8nts 8u t_on_on supé_ieu_ dtéchelle_ 8p_ès sépar&tion et [2b] tesseract O/P 2-col French ------------------------------- 8 transparentes se parent d'une nuée de diamants scintillants et fugitifs. Tout est de taille réduite. C'est un monde lilliputien, il n'y a rien d'imposant, d'écrasant ni de menaçant. Dans une petite niche `a hauteur de nos têtes, un décor minéral translucide très pur fait penser `a une vitrine d'exposition de bijoux et joyaux précieux. Doris n'est pas encore entrée dans la salle; je pose une lampe de poche allumée dans cet écrin, derrière les petites sculptures diapha~ nes en m'écriant comme stupéfait: ********************************** Not seeing any way to export the OCR'd O/P as text, I still wonder what it serves: the scanned image shows the original text. Yes, I read the Help (Ctrl/H). Re comment 14, I shall raise a bug for that; and other inconveniences I found? I agree entirely with the spirit of not criticising open-source s/w, and this OCR business is not what the program is really about. I cannot comment on the "wrong encodging wih gocr" unless given a pointer about how to discern that. Otherwise (re tesseract) I am happy to say this bug is MGA3-32 OK if others agree.
Pablo please help me understand what you mean by "3a. accented letters in output are bad". I have followed your steps and launched the OCR Engine window and for me the "GOCR" text in the pull down appears to be ok. I am testing in M3-32, Virtualbox. Thanks
CC: (none) => wilcal.int
Updating to 1.0.6-2.2.mga3 does in fact fix the "3a. tesseract doesn't appear on list" issue The updated gscan2pdf produces pure text and pdf files if you choose GOCR. If you choose Tesseract you get an boxed ocr XML text file(?) and a usable PDF file. Is this what you expect? I somewhat agree with Lewis Smith's comment 12 that the text output of these is somewhat "close to useless" but for me these things have never been quite right.
@William: you need to use a text with accented letters, eg: "é" to see it. Before the patch, you saw (when gocr recognized it) "é" instead of "é", for example. The test update package does fix the issues reported on this bug report. The problem of boxed text, is actually that way not because of a bug, but designed like that. Indeed it would be better to also allow plain text output; but that should be reported to the author (more a feature request than a bug)
(In reply to Pablo Saratxaga from comment #18) > @William: you need to use a text with accented letters, eg: "é" to see it. > Before the patch, you saw (when gocr recognized it) "é" instead of "é", for > example. > The test update package does fix the issues reported on this bug report. Got it, thanks. I've built my own LibreOffice odt document to print then scan. It already contains some bold and italic characters in there. I'll add some accented letters in there and see if it can pick any of those out. I think this is more of a performance issue then it's a successful upgrade. We could nit pick this thing forever. I've never seen a really good one. First time I've seen one that understands "é" characters. Back to this shortly.
Created attachment 4191 [details] gscan2pdf_test1.zip test scan
Created attachment 4192 [details] gscan2pdf_test2.zip test scan
Pablo please take a look at my attachments gscan2pdf_test1.zip and gscan2pdf_test2.zip. Contained in the two ZIP files are six files: gscan2pdf_test1.zip ------------------- scan.jpg scan_source.odt gscan2pdf_test2.zip ------------------- scan_gocr.pdf scan_gocr.txt scan_tesseract.pdf scan_tesseract.txt Would this be what you would expect from this application? I'm pretty happy with what I see. The PDF's seem pretty accurate. I agree that the accented characters are sketchy at best but mildly usable. We could beat this thing to death forever. It's a application performance issue not an update issue. The update seems to work just fine. With your agreement I'd push this application update. Scanner is an HP All-in-one 5510. Scanning software is XSane
Whiteboard: (none) => MGA3-32-OK
gscan2pdf-1.0.6-2.2.mga3.noarch installs just fine in MGA3-64-OK
Whiteboard: MGA3-32-OK => MGA3-32-OK MGA3-64-OK
Advisory uploaded.
Update validated Advisory: ================================= This updates gscan2pdf-1.0.6-2.mga3.noarch.rpm to gscan2pdf-1.0.6-2.2.mga3.noarch.rpm. Correcting a problem including tesseract. Updated packages in core/updates_testing: ======================== gscan2pdf-1.0.6-2.2.mga3.noarch.rpm from SRPMS: gscan2pdf-1.0.6-2.2.mga3.src.rpm Could sysadmin please push from core/updates_testing to core/updates. Tested on: Intel Core i7-2600K Sandy Bridge 3.4GHz LGA 1155 GIGABYTE GA-Z68X-UD3-B3 LGA 1155 Intel Z68 SATA 6Gb/s MoBo GIGABYTE GV-N440D3-1GI GeForce GT 440 (Fermi) CORSAIR Vengeance 16GB (4 x 4GB) Virtualbox-4.2.12-2.mga3.x86-64 Thank you!
Keywords: PATCH, Triaged => validated_updateCC: (none) => sysadmin-bugs
Update pushed: http://advisories.mageia.org/MGAA-2013-0056.html
Status: NEW => RESOLVEDCC: (none) => tmbResolution: (none) => FIXED