Bug 7514 - case insensitive [a-z] in sed
Summary: case insensitive [a-z] in sed
Status: RESOLVED INVALID
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 2
Hardware: x86_64 Linux
Priority: Normal major
Target Milestone: ---
Assignee: Shlomi Fish
QA Contact:
URL: http://pastebin.com/LHHSHXew
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-18 17:12 CEST by Yves Brissaud
Modified: 2014-05-08 18:05 CEST (History)
2 users (show)

See Also:
Source RPM: sed-4.2.1-4.mga1
CVE:
Status comment:


Attachments

Description Yves Brissaud 2012-09-18 17:12:56 CEST
Description of problem:

With mageia 1 or 2, in x86_64, [a-z] in a sed replacement is case insensitive.
On a debian squeeze, [a-z] is case sensitive.
Both with the same version of sed (4.2.1)


Version-Release number of selected component (if applicable):

4.2.1-4.mga1

How reproducible:

use [a-z] like expression in sed replacement

Steps to Reproduce:

With mageia :
$ echo "A" | sed 's/[a-z]/b/'
> b

With debian squeeze :
$ echo "A" | sed 's/[a-z]/b/'
> A

(I set the severity as major as the problem can impact some scripts like post install, configuration, etc)
Nicolas Vigier 2012-09-18 17:18:19 CEST

CC: (none) => boklm

Comment 1 Yves Brissaud 2012-09-18 17:59:00 CEST
to add some tests :

* the [.-.] is not good ([a-b], [a-z], ...)

$ echo "B" | sed 's/[b-c]/a/'
> a

* the [..] is good ([abc] by example)

$ echo "B" | sed 's/[bc]/a/'
> B
Comment 2 Arnaud Pharasyn 2012-09-20 19:56:54 CEST
All this in apparence strange behaviour is due to the LC_COLLATE environment variable, which is affecting sed and other commands.

For instance, try in your bash shell (and i guess it would be the same on Debian, which probably has simply no LC_COLLATE defined by default):

$ export LC_COLLATE=fr_FR.UTF8
$ echo "A" | sed 's/[a-z]/b/'
b

$ export LC_COLLATE=C
$ echo "A" | sed 's/[a-z]/b/'
A


You can see that the sorting order has some importance in the [.-.] form, and of course not in the [..] form where you explicitely specify the characters to test for. To get the sorting order in a given locale:

$ export LC_COLLATE=fr_FR.UTF8
$ echo $(printf '%s\n' {A..z} | sort)
` ^ _ [ ] a A b B c C d D e E f F g G h H i I j J k K l L m M n N o O p P q Q r R s S t T u U v V w W x X y Y z Z

$ export LC_COLLATE=C
$ echo $(printf '%s\n' {A..z} | sort)
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z

This means that in some locale, when you say [a-z], you end up as though you were saying: [aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYz], which explains the behaviour you observe. It is not directly a question of a case insensitivity. You can also check this by typing, in your original locale:

$ echo "Y" | sed 's/[a-z]/b/'
b
$ echo "Z" | sed 's/[a-z]/b/'
Z

CC: (none) => eonwir.ardamire+mageia

Manuel Hiebel 2012-09-20 20:48:16 CEST

CC: (none) => mageia
Assignee: bugsquad => shlomif

Comment 3 Yves Brissaud 2012-09-20 21:30:54 CEST
Thanks for all informations.
But I don't really understand if it's normal or not.

On a macos :

$ locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=
$ echo "A" | sed 's/[a-z]/b/'
A

$ export LC_COLLATE=fr_FR.UTF8
$ echo $(printf '%s\n' {A..z} | sort)
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z

I'll try on a debian tomorrow but I guess the result will be identical, isn't it ?
Comment 4 Shlomi Fish 2012-09-20 21:48:13 CEST
From what I know it is dependent on one's locale, so not a bug.
Comment 5 Yves Brissaud 2012-09-20 22:59:49 CEST
But I reproduce the case only in mageia. Not in debian, ubuntu or macox by example. Each 4 with the same locale (fr_FR.UTF8)

The initial test of the bug report is executed with the same locale on each computer.

Perhaps is the mageia package of sed compiled with a specific flag ?
Comment 6 Shlomi Fish 2012-09-21 09:40:17 CEST
Hello Yves,

(In reply to comment #5)
> But I reproduce the case only in mageia. Not in debian, ubuntu or macox by
> example. Each 4 with the same locale (fr_FR.UTF8)
> 
> The initial test of the bug report is executed with the same locale on each
> computer.
> 
> Perhaps is the mageia package of sed compiled with a specific flag ?

Mageia's sed is not built with any special flags:

%configure2_5x	--bindir=/bin
%make LDFLAGS=-s
%make html
%make check

I've now build GNU sed from sources under ~/apps/temp-sed and it yields the same results:

shlomif[rpms]:$mageia/sed$ ( echo "A" | ~/apps/temp-sed/bin/sed 's/[a-z]/b/' )
b
shlomif[rpms]:$mageia/sed$ ( export LC_ALL=C ; echo "A" | ~/apps/temp-sed/bin/sed 's/[a-z]/b/' )
A

So I don't think the problem is in the sed package.

Regards,

-- Shlomi Fish
Comment 7 Yves Brissaud 2012-09-21 09:55:49 CEST
Hi,

Finally I reproduce it on a debian with a french locale (my test would be wrong).
So it's really a locale problem.

Thanks for the time spent on my problem, the bug can be closed as it's not a real sed bug.

Regards,
Yves
Comment 8 Arnaud Pharasyn 2012-09-21 10:06:22 CEST
Hi,

I was doing more or less the same thing as Shlomi, compiling sed from the source without any special flags, and i saw too that it was exactly the same behaviour as the default sed shipped with Mageia.

Then i read the following page:
http://www.gnu.org/software/sed/manual/html_node/Reporting-Bugs.html
Especially at the end:

---- begin included text ----
Here are a few commonly reported bugs that are not bugs.

(...)

[a-z] is case insensitive
    You are encountering problems with locales. POSIX mandates that [a-z] uses the current locale's collation order â in C parlance, that means using strcoll(3) instead of strcmp(3). Some locales have a case-insensitive collation
order, others don't.

    Another problem is that [a-z] tries to use collation symbols. This only
happens if you are on the GNU system, using GNU libc's regular expression
matcher instead of compiling the one supplied with GNU sed. In a Danish locale,
for example, the regular expression ^[a-z]$ matches the string âaaâ, because
this is a single collating symbol that comes after âaâ and before âbâ; âllâ
behaves similarly in Spanish locales, or âijâ in Dutch locales.

    To work around these problems, which may cause bugs in shell scripts, set
the LC_COLLATE and LC_CTYPE environment variables to âCâ. 
---- end included text ----


In a second step, i did compile two versions from the same original source, one with the --with-included-regex and one with the --without-included-regex. And here we can see the difference on the test case you mentioned:

$ echo "A" | ./sed_with_included_regex 's/[a-z]/b/'
b

$ echo "A" | ./sed_without_included_regex 's/[a-z]/b/'
A

I guess that's a quite subtle difference of behaviour, to be aware of when writing scripts using the [.-.] syntax.

Cheers,
Arnaud
Comment 9 Yves Brissaud 2012-09-21 10:11:25 CEST
Hi,

Thanks for this informations.

> In a second step, i did compile two versions from the same original source, one
with the --with-included-regex and one with the --without-included-regex. And
here we can see the difference on the test case you mentioned:
> 
> $ echo "A" | ./sed_with_included_regex 's/[a-z]/b/'
> b
> 
> $ echo "A" | ./sed_without_included_regex 's/[a-z]/b/'
> A

Ok, that can explain (perhaps) why my mac with LC_COLLATE=fr_FR not the same result as my mageia.

Thank you all!

Yves
Comment 10 Shlomi Fish 2012-10-21 12:09:53 CEST
Resolving as invalid then. Thanks for the report.

Status: NEW => RESOLVED
Resolution: (none) => INVALID

Nicolas Vigier 2014-05-08 18:05:39 CEST

CC: boklm => (none)


Note You need to log in before you can comment on or make changes to this bug.