10402 – gscan2pdf: wrong encodging wih gocr and tesseract not seen

Bug 10402 - gscan2pdf: wrong encodging wih gocr and tesseract not seen

Summary: gscan2pdf: wrong encodging wih gocr and tesseract not seen

Status:	RESOLVED FIXED

Alias:	None

Product:	Mageia
Classification:	Unclassified
Component:	RPM Packages (show other bugs)
Version:	3
Hardware:	All Linux

Priority:	Normal Severity: normal
Target Milestone:	---
Assignee:	QA Team
QA Contact:

URL:
Whiteboard:	MGA3-32-OK MGA3-64-OK
Keywords:	validated_update

Depends on:	10403
Blocks:
	Show dependency tree / graph

Reported:	2013-06-02 22:30 CEST by Pablo Saratxaga
Modified:	2013-07-09 21:46 CEST (History)
CC List:	9 users (show)

See Also:
Source RPM:	gscan2pdf-1.0.6-2.mga3.src.rpm
CVE:
Status comment:

Attachments
patch to fix gocr output and finding of tesseract version (1.36 KB, patch) 2013-06-02 22:38 CEST, Pablo Saratxaga	Details \| Diff
improved patch, fixes also parsing of hocr (boxed ocr) output (2.25 KB, patch) 2013-06-03 15:45 CEST, Pablo Saratxaga	Details \| Diff
patch to fix gocr output, finding of tesseract version and parsing of hocr (boxed ocr) output (2.04 KB, patch) 2013-06-25 00:52 CEST, Pablo Saratxaga	Details \| Diff
patch to fix gocr output, finding of tesseract version and parsing of hocr (boxed ocr) output (2.12 KB, patch) 2013-06-25 22:54 CEST, Pablo Saratxaga	Details \| Diff
gscan2pdf_test1.zip test scan (413.70 KB, application/zip) 2013-07-05 17:22 CEST, William Kenney	Details
gscan2pdf_test2.zip test scan (737.32 KB, application/zip) 2013-07-05 17:23 CEST, William Kenney	Details
Show Obsolete (3) View All Add an attachment (proposed patch, testcase, etc.)

Description Pablo Saratxaga 2013-06-02 22:30:17 CEST

Description of problem:
gscan2pdf as released has various problems in OCR support:
* encoding of gocr output is wrong (double encoded)
* tesseract is not detected

Version-Release number of selected component (if applicable):
1.0.6

How reproducible:
always

Steps to Reproduce:
0. (install gocr and tesseract)
1. launch gscan2pdf, open an image of a scanned text or scan one

2a. go to ocr tools, select gocr
3a. accented letters in output are bad

2a. go to ocr tools
3a. tesseract doesn't appear on list


bug with gocr is due to the problem of perl doing a bad job handling utf8 in stdin/stdou
solution = don't call gocr to write to stdout, but to a temporary file instead

bug with tesseract is because the version finding changed.
solution = fix the parsing of tesseract output

Reproducible: 

Steps to Reproduce:

Comment 1 Pablo Saratxaga 2013-06-02 22:38:04 CEST

Created attachment 4093 [details]
patch to fix gocr output and finding of tesseract version

this patch fixes the issues with gocr and the finding of tesseract version

(tesseract still fails, but because of a tesseract bug)

Pablo Saratxaga 2013-06-02 22:38:11 CEST

CC: (none) => pablo

Comment 2 Pablo Saratxaga 2013-06-03 15:45:19 CEST

Created attachment 4096 [details]
improved patch, fixes also parsing of hocr (boxed ocr) output

improved version of the patch; this one does:
* fix output of gocr to proper utf-8 (by using a tmp file instead of perl stdin/stdou which is broken)
* fix finding of tesseract version (now "tesseract -v" does it)
* no need to convert to tif for tessearact if file is in tif,png,jpeg,gif (conversion done by libleptonica that tesseract is linked with)
* fixed parsing of boxed hocr output of tesseract (it uses "ocrx_word" instead of "ocr_word")

Attachment 4093 is obsolete: 0 => 1

Pablo Saratxaga 2013-06-03 15:47:38 CEST

Depends on: (none) => 10403

Pablo Saratxaga 2013-06-06 15:21:40 CEST

Keywords: (none) => PATCH

Manuel Hiebel 2013-06-08 16:56:33 CEST

Keywords: (none) => Triaged
CC: (none) => fundawang, yann

Comment 3 Barry Jackson 2013-06-20 00:18:18 CEST

Hi Pablo,
I have been attempting to re-diff your patch for use in the spec, however the part relating to Gscan2pdf.pm_bak does not seem to relate to the version of Gscan2pdf.pm in the package.

The other two sections appear to fit OK:-

#----------------------------------------
diff -ur gscan2pdf-1.0.6_orig/lib/Gscan2pdf/Page.pm gscan2pdf-1.0.6/lib/Gscan2pdf/Page.pm
--- gscan2pdf-1.0.6_orig/lib/Gscan2pdf/Page.pm	2012-07-20 21:16:42.000000000 +0100
+++ gscan2pdf-1.0.6/lib/Gscan2pdf/Page.pm	2013-06-19 23:03:52.954815346 +0100
@@ -126,7 +126,7 @@
     if ( $token->[1] eq 'span'
      and defined( $token->[2]{class} )
      and
-     ( $token->[2]{class} eq 'ocr_line' or $token->[2]{class} eq 'ocr_word' )
+     ( $token->[2]{class} eq 'ocr_line' or $token->[2]{class} =~ m/^ocrx*_word$/  )
      and defined( $token->[2]{title} )
      and $token->[2]{title} =~ /bbox\ (\d+)\ (\d+)\ (\d+)\ (\d+)/x )
     {
diff -ur gscan2pdf-1.0.6_orig/lib/Gscan2pdf/Tesseract.pm gscan2pdf-1.0.6/lib/Gscan2pdf/Tesseract.pm
--- gscan2pdf-1.0.6_orig/lib/Gscan2pdf/Tesseract.pm	2012-05-02 20:42:33.000000000 +0100
+++ gscan2pdf-1.0.6/lib/Gscan2pdf/Tesseract.pm	2013-06-19 22:53:19.637731860 +0100
@@ -159,7 +159,7 @@
  my $txt = File::Temp->new( SUFFIX => $suffix );
  ( my $name, my $path, undef ) = fileparse( $txt, $suffix );
 
- if ( $file !~ /\.tif$/x ) {
+ if ( $file !~ /\.(tif|png|gif|jpeg)$/x ) {
 
   # Temporary filename for new file
   $tif = File::Temp->new( SUFFIX => '.tif' );

#-------------------------------------------------

Could you take a look please ?

CC: (none) => zen25000

Comment 4 Pablo Saratxaga 2013-06-25 00:52:39 CEST

Created attachment 4166 [details]
patch to fix gocr output, finding of tesseract version and parsing of hocr (boxed ocr) output

I did agai then patch, from the sources of 1.0.6-2;
added to the spec file:

Patch1:         gscan2pdf-1.0.6-gocr-encoding-and-tesseract-version.patch
...

%prep
%setup -q
%patch1

did rpmrebuild and it went fine.
for the changelog you can put:

- fixes mangeld output of gocr
- recognizes tesseract version 3
- fixes parsing of hocr (some OCRs use 'ocrx_word' instead of 'ocr_word')

I think the recognition of tesseract (the "output on error" that gscan2pdf
used to parse to detect version has changed; on the other hand now tesseract has a -v switch to print the version) is fixed upstream on newer versions; the gocr and ocrx_word bugs however are probably not.

Attachment 4096 is obsolete: 0 => 1

Comment 5 Barry Jackson 2013-06-25 01:28:21 CEST

Update Advisory
###############

Package gscan2pdf has been submitted to 3/core/updates_testing

- fixes mangeld output of gocr
- recognizes tesseract version 3
- fixes parsing of hocr (some OCRs use 'ocrx_word' instead of 'ocr_word')

I'm taking Pablo's word for that :)

rpm:-
gscan2pdf-1.0.6-2.1.mga3.noarch.rpm

src rpm:
gscan2pdf-1.0.6-2.1.mga3.src.rpm

Thanks Pablo :)

Cheers,
Barry

Assignee: bugsquad => qa-bugs

Comment 6 Rémi Verschelde 2013-06-25 17:50:49 CEST

Tested on Mageia 3 i586, the update candidate does not fix the bug for me.
I have installed:
  gscan2pdf-1.0.6-2.1.mga3.noarch
  tesseract-3.02.02-3.1.mga3.i586 (from core/updates_testing too)

CC: (none) => remi

Comment 7 Pablo Saratxaga 2013-06-25 21:46:33 CEST

patch wasn't applied.

maybe the %apply_patches macro has a problem?
I used %patch1 instead when building locally; and it worked

Comment 8 Pablo Saratxaga 2013-06-25 21:53:44 CEST

and you used a wrong patch; take the latest one from here

Comment 9 Pablo Saratxaga 2013-06-25 22:54:33 CEST

Created attachment 4169 [details]
patch to fix gocr output, finding of tesseract version and parsing of hocr (boxed ocr) output

ok, the problem with %apply_patches is that it _requires_ _all_ patches to be applied with -p1 (mine was with -p0)

Here it is gscan2pdf-1.0.6-mga-gocr-encoding-and-tesseract-version.patch again :)

Attachment 4166 is obsolete: 0 => 1

Comment 10 Barry Jackson 2013-06-25 23:19:10 CEST

Pablo,
It's really odd - your original patch applies with -p1 after the _bak suffixes are removed from the patch file destinations, but it still failed with %apply_patches IIANM.
Anyway I just pushed a new version, but I'm sure something has screwed up the patch again as the patch in svn now appears different to the one in my sources.  it's as though the package commit is getting the diff wrong on the patch.
I will apply your new patch with a slightly different name and scrap the old name to get out of this mess.
We will get there ;)

Comment 11 Barry Jackson 2013-06-25 23:33:10 CEST

Ah - on re-checking svn it seems that my last push is OK - I think the problem was with dolphin/kate - I often see it confusing and displaying files with the same name in different paths - it's VERY dangerous and annoying :/

From the BS log the patch applied OK:-

Patch #0 (gscan2pdf-1.0.6-mga-gocr-encoding-and-tesseract-version.patch):
+ /usr/bin/cat /home/iurt/rpmbuild/SOURCES/gscan2pdf-1.0.6-mga-gocr-encoding-and-tesseract-version.patch
+ /usr/bin/patch -p1 --fuzz=0
patching file lib/Gscan2pdf/Tesseract.pm
patching file lib/Gscan2pdf/Page.pm
patching file lib/Gscan2pdf.pm
+ exit 0

...and I double checked it against your new patch and it does the same, since I edited out the '_bak's, so I'll leave it.

@ Remi

Sorry about that - please test gscan2pdf-1.0.6-2.2.mga3 in core/updates_testing.

Barry

Comment 12 Lewis Smith 2013-06-28 18:54:41 CEST

I have tried this update
 gscan2pdf-1.0.6-2.2.mga3
on Mga3 32-bit, and the results are close to useless. But no worse than previously.

The +ve thing is that tesseract *is* shown (along with gocr) in the Tools/OCR list, and with a language choice.
OCR O/P from gocr is poor for English, worse (unuseable) French 2-col.
OCR O/P from tesseract is much better *except* that every word is boxed in gscan2pdf. Using tesseract from command line seems to yield good results. Maybe the 'boxing' under gscan2pdf would vanish if the file was saved as text. I must try that.

Where do we go next? I doubt the utility of this program for OCR. For its title role - why not? Except that do scan programs not themselves offer PDF O/P?

CC: (none) => lewyssmith

Comment 13 Barry Jackson 2013-06-29 23:49:49 CEST

(In reply to Lewis Smith from comment #12)
> I have tried this update
>  gscan2pdf-1.0.6-2.2.mga3
> on Mga3 32-bit, and the results are close to useless. But no worse than
> previously.
> 
> The +ve thing is that tesseract *is* shown (along with gocr) in the
> Tools/OCR list, and with a language choice.
Confirmed in Cauldron x86_64
> OCR O/P from gocr is poor for English, worse (unuseable) French 2-col.
> OCR O/P from tesseract is much better *except* that every word is boxed in
> gscan2pdf. 
Confirmed
Using tesseract from command line seems to yield good results.
> Maybe the 'boxing' under gscan2pdf would vanish if the file was saved as
> text. I must try that.
I tried that but on my quick test the 'text' turned out to be xml, which did display fine in xxe.
> 
> Where do we go next? I doubt the utility of this program for OCR. For its
> title role - why not? Except that do scan programs not themselves offer PDF
> O/P?
I think it's intention is to produce pdfs with embedded text to allow searching of the text originally in images, in which case there is a use case.

Where to go next? ... Well, probably upstream bug report, unless someone here can fix it - Pablo ?? ;)

Comment 14 Pablo Saratxaga 2013-06-30 01:24:37 CEST

well, "close to useless" is an exageration.

free software OCR's are not as good as some others; however tesseract does an acceptable work (provided it knows about your language and font; you can teach it, it's a strenght, but the way to teach is complicated).

As I said the upstream version I think correctly handles tesseract.
The gocr<->perl charset problem was there however (probably people using gocr only use plain ASCII and didn't noticed).
I reported it, with the fix; but got no response yet.

The tesseract rendering with boxes over each word is not a bug but a feature.
However, it is true that it would be nice to have a switch on the OCR dialog to choose boxed or unboxed output, as tesseract can provide both; that will empower the user.
I reported that idea to the autor of gscan2pdf too, but got no response yet (it was on same mail).

It would be better if the author adds it, as he will know how to do it better and faster.
Currently OCR dialog has only one ocr-dependent configuration option: language choice.
The thing to add would be, for those ocr's allowing both possibilites, a checkbox to choose output: boxed/xml or unboxed/plain text.

However that boxed/unboxed is another thing, if you want open a different bug for it.

Comment 15 Lewis Smith 2013-06-30 09:33:35 CEST

I have done more careful tests of gscan2pdf v command-line gocr & tesseract.
For gocr, the O/P is the same either way - but less good than with tesseract.
For tesseract, the gscan2pdf boxed O/P is disturbing. What is it meant to do? It seemed to me also that some of these boxed words showed only their initial letter, whereas the command-line O/P was intact. This may be due to my SiS video X problems.

FWIW examples from command line (screen O/P basically similar):
*******************************************
[1a] gocr O/P mixed font single-col English
-------------------------------------------
__
E ldernowe r Ch amp agne
9 elder_lo_er _eDds i_ J_ll bloom
_ !/2 lirres (J gDllonJ cold zu_ler
J lemon
650 g (J !/2 l6) lo_Js__Dr
_ I_6lespoons 2uhire _ine_Dr

Dissalve the sugar in a little warm water and allow to cool. Squeeze

[1b] tesseract O/P mixed font single-col English
------------------------------------------------
Elderï¬ower Champagne

4 elderflozver heads in full bloom
4V2 litres (I gallon) cold water

1 lemon

650 g (I V2 lb) loafsugar

Z tablespoons white vinegar

Dissolve the sugar in a little warm water and allow to cool. Squeeze

[2a] gocr O/P 2-col French
--------------------------
\       p     ee   p_chee5t  non  p&_  acc_           g_lpeDec_a_8Jt__oq_  qu_  __g_ece  ecp



8                                                                                    g

t_8n_P__ente_  se  p8rent  d'une  nuée  de  dia_8nts          8u t_on_on supé_ieu_ dtéchelle_ 8p_Ã¨s sépar&tion et

[2b] tesseract O/P 2-col French
-------------------------------
8

transparentes se parent d'une nuée de diamants
scintillants et fugitifs.

Tout est de taille réduite. C'est un monde
lilliputien, il n'y a rien d'imposant, d'écrasant
ni de menaÃ§ant. Dans une petite niche `a hauteur de
nos tÃªtes, un décor minéral translucide trÃ¨s pur
fait penser `a une vitrine d'exposition de bijoux et
joyaux précieux. Doris n'est pas encore entrée dans
la salle; je pose une lampe de poche allumée dans
cet écrin, derriÃ¨re les petites sculptures diapha~
nes en m'écriant comme stupéfait:
**********************************
Not seeing any way to export the OCR'd O/P as text, I still wonder what it serves: the scanned image shows the original text. Yes, I read the Help (Ctrl/H).

Re comment 14, I shall raise a bug for that; and other inconveniences I found? I agree entirely with the spirit of not criticising open-source s/w, and this OCR business is not what the program is really about.

I cannot comment on the "wrong encodging wih gocr" unless given a pointer about how to discern that. Otherwise (re tesseract) I am happy to say this bug is MGA3-32 OK if others agree.

Comment 16 William Kenney 2013-07-04 19:25:13 CEST

Pablo please help me understand what you mean by
"3a. accented letters in output are bad". I have
followed your steps and launched the OCR Engine
window and for me the "GOCR" text in the pull
down appears to be ok. I am testing in M3-32,
Virtualbox.

Thanks

CC: (none) => wilcal.int

Comment 17 William Kenney 2013-07-04 19:46:51 CEST

Updating to 1.0.6-2.2.mga3 does in fact fix the
"3a. tesseract doesn't appear on list" issue

The updated gscan2pdf produces pure text and pdf
files if you choose GOCR. If you choose Tesseract
you get an boxed ocr XML text file(?) and a usable
PDF file. Is this what you expect? I somewhat
agree with Lewis Smith's comment 12 that the text
output of these is somewhat "close to useless" but
for me these things have never been quite right.

Comment 18 Pablo Saratxaga 2013-07-04 22:32:06 CEST

@William: you need to use a text with accented letters, eg: "é" to see it.
Before the patch, you saw (when gocr recognized it) "ÃÂ©" instead of "é", for example.
The test update package does fix the issues reported on this bug report.

The problem of boxed text, is actually that way not because of a bug, but designed like that.
Indeed it would be better to also allow plain text output; but that should be reported to the author (more a feature request than a bug)

Comment 19 William Kenney 2013-07-05 00:49:03 CEST

(In reply to Pablo Saratxaga from comment #18)

> @William: you need to use a text with accented letters, eg: "é" to see it.
> Before the patch, you saw (when gocr recognized it) "ÃÂ©" instead of "é", for
> example.
> The test update package does fix the issues reported on this bug report.

Got it, thanks. I've built my own LibreOffice odt document to print then scan.
It already contains some bold and italic characters in there. I'll add some
accented letters in there and see if it can pick any of those out.
I think this is more of a performance issue then it's a successful
upgrade. We could nit pick this thing forever. I've never seen a really
good one. First time I've seen one that understands "é" characters.
Back to this shortly.

Comment 20 William Kenney 2013-07-05 17:22:52 CEST

Created attachment 4191 [details]
gscan2pdf_test1.zip test scan

Comment 21 William Kenney 2013-07-05 17:23:30 CEST

Created attachment 4192 [details]
gscan2pdf_test2.zip test scan

Comment 22 William Kenney 2013-07-05 17:24:19 CEST

Pablo please take a look at my attachments
gscan2pdf_test1.zip and gscan2pdf_test2.zip.
Contained in the two ZIP files are six files:

gscan2pdf_test1.zip
-------------------
scan.jpg
scan_source.odt

gscan2pdf_test2.zip
-------------------
scan_gocr.pdf
scan_gocr.txt
scan_tesseract.pdf
scan_tesseract.txt

Would this be what you would expect from this application?

I'm pretty happy with what I see. The PDF's seem pretty
accurate. I agree that the accented characters are
sketchy at best but mildly usable. We could beat this
thing to death forever. It's a application performance
issue not an update issue. The update seems to work
just fine.

With your agreement I'd push this application update.

Scanner is an HP All-in-one 5510.
Scanning software is XSane

Lewis Smith 2013-07-05 21:31:59 CEST

Whiteboard: (none) => MGA3-32-OK

Comment 23 William Kenney 2013-07-06 17:58:03 CEST

gscan2pdf-1.0.6-2.2.mga3.noarch installs just fine in MGA3-64-OK

Whiteboard: MGA3-32-OK => MGA3-32-OK MGA3-64-OK

Comment 24 claire robinson 2013-07-06 18:26:04 CEST

Advisory uploaded.

Comment 25 William Kenney 2013-07-06 18:49:02 CEST

Update validated

Advisory:
=================================
This updates gscan2pdf-1.0.6-2.mga3.noarch.rpm to
gscan2pdf-1.0.6-2.2.mga3.noarch.rpm. Correcting
a problem including tesseract.


Updated packages in core/updates_testing:
========================

gscan2pdf-1.0.6-2.2.mga3.noarch.rpm

from SRPMS:
gscan2pdf-1.0.6-2.2.mga3.src.rpm


Could sysadmin please push from core/updates_testing to core/updates.

Tested on:
Intel Core i7-2600K Sandy Bridge 3.4GHz LGA 1155
GIGABYTE GA-Z68X-UD3-B3 LGA 1155 Intel Z68 SATA 6Gb/s MoBo
GIGABYTE GV-N440D3-1GI GeForce GT 440 (Fermi)
CORSAIR Vengeance 16GB (4 x 4GB)
Virtualbox-4.2.12-2.mga3.x86-64

Thank you!

William Kenney 2013-07-06 18:49:51 CEST

Keywords: PATCH, Triaged => validated_update
CC: (none) => sysadmin-bugs

Comment 26 Thomas Backlund 2013-07-09 21:46:53 CEST

Update pushed:
http://advisories.mageia.org/MGAA-2013-0056.html

Status: NEW => RESOLVED
CC: (none) => tmb
Resolution: (none) => FIXED

Note You need to log in before you can comment on or make changes to this bug.