Bug 33144 - badblocks returns wrong diagnostics
Summary: badblocks returns wrong diagnostics
Status: UPSTREAM
Alias: None
Product: Mageia
Classification: Unclassified
Component: RPM Packages (show other bugs)
Version: 9
Hardware: All Linux
Priority: Normal normal
Target Milestone: ---
Assignee: Thierry Vignaud
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-26 00:20 CEST by Yves DEMUR
Modified: 2024-06-01 10:30 CEST (History)
2 users (show)

See Also:
Source RPM: e2fsprogs-1.47.0-2.mga9
CVE:
Status comment:


Attachments
tarball containing my developement files (328.24 KB, application/x-compressed-tar)
2024-04-26 00:23 CEST, Yves DEMUR
Details
screen print of github e2fsprogs issue 185 (110.81 KB, application/pdf)
2024-05-26 19:25 CEST, Yves DEMUR
Details
screen print of github tytso/e2fsprog issues (168.54 KB, application/pdf)
2024-05-27 23:32 CEST, Yves DEMUR
Details
Providing small incremental changes starting with badblocks-1.47.0.c (136.19 KB, application/x-compressed-tar)
2024-06-01 00:24 CEST, Yves DEMUR
Details

Description Yves DEMUR 2024-04-26 00:20:49 CEST
Description of problem:
While checking by badblocks in write mode a USB key corrupted whith bad blocks, the numbers returned are not the offsets of the blocks. While using the same checked zone, changing the -c parameter changes the numbers output.

I checked the source file and found other bugs.

The good news is if badblocks returns (0,0,0) errors (and the media is honest), all the media has been correctly checked and is fully good.
The bad news is if there is any error (I/O or corruption), the program may
 - not verify some read block versus the good one written
 - not verify some blocks
 - reports wrong blocks offsets
 in short, the diagnostics are false
... just that ...

I have some USB keys with badblocks. Some of them were returning (0,0,0) errors every 2 badblocks executions (partial media check). The other execution of badblocks found errors. I think remapping or buffering of "fake size" USB keys is sly.

I use a "stable badblocks" TF card with a USB-A adapter.
It is supposed 16384256 blocks of 4096 bytes. I found a zone of badblocks starting at offset 7725542
The badblocks program allows to test a portion of a media by using the last_block first_block optional arguments.

[root@a55 misc]# badblocks -w -b 4096 -c 512 -t 0x55 /dev/sdc 7725542 7725542 # only one block, bad = 7725542
7725543
-----> The returned number has +1 => this number is out of range

[root@a55 misc]# badblocks -w -b 4096 -c 512 -t 0x55 /dev/sdc 7725543 7725532 # 12 blocks, bads = 7725542 7725543
7725554
7725555
-----> The returned numbers have +12 (the number of read blocks) ; the numbers are out of range

[root@a55 misc]# badblocks -w -b 4096 -c 1 -t 0x55 /dev/sdc 7725543 7725532 # 12 blocks, bads = 7725542 7725543
7725543
7725544
-----> The returned numbers have +1

So, changing the -c parameter changes the result, while checking the same zone !
I have found the explanation in the source code. I have found too other bugs which need multiple write or read errors. As These errors are not easy to produce, I have simulated them in the code in order to debug.

I made partially corrected versions to highlight the bugs with the error simulation and the debug prints.
I preserved the possibility of allow to see the differences by only modifying what is necessary and providing intermediate versions.
The proposed version may be cleared of the debug lines.


Version-Release number of selected component (if applicable):
e2fsprogs-1.45.6-6.1.mga8 but I think these bugs have been in the code for a long time
I have called badblocks-1.45.6.c the source version.
I have too used the 1.47.0 updated version ; I reported the differences 1.47.0/1.45.6.c .


How reproducible:
See above


Steps to Reproduce:
1.
2.
3.

Proposed solution:

I have modified the badblocks.c source code and created multiple versions.
The joined badb.tgz contains all my production.

Create a directory named badb, cd into it and untar badb.tgz .
$ cd /tmp ; mkdir badb ; cd badb ; tar -xpvaf .../badb.tgz

The Readme.txt file continues these explanations.

I wish you a good reading ...

Yves DEMUR -- Thu Apr 25 23:30:35 CEST 2024
yves.demur@m4am.net

If you want to download the complete tarball containing the executable binaries, it's here :
http://yves.demur.free.fr/badb.tgz (~10MiB)
Comment 1 Yves DEMUR 2024-04-26 00:23:12 CEST
Created attachment 14515 [details]
tarball containing my developement files
Comment 2 sturmvogel 2024-04-26 05:46:45 CEST
This needs to be reported upstream if it is an upstream code problem. Please create an issue at https://github.com/tytso/e2fsprogs

Source RPM: e2fsprogs-1.45.6-6.1.mga8 e2fsprogs-1.47.0-2.mga9 => e2fsprogs-1.47.0-2.mga9
Status: NEW => UPSTREAM

Comment 3 Yves DEMUR 2024-05-26 17:08:27 CEST
Issue created :
https://github.com/tytso/e2fsprogs/issues/185
No feedback for a month...
Comment 4 Yves DEMUR 2024-05-26 17:40:24 CEST
While waiting for it to be taken into account by the maintainer, I can provide a version 2 (badblocks2.c and badblocks2.8.in), but badblocks2.c must be compiled in the e2fsprogs/misc environment
Comment 5 Morgan Leijström 2024-05-26 17:57:19 CEST
That link https://github.com/tytso/e2fsprogs/issues/185 gives a 404.
Wrong bug number?

CC: (none) => fri

Comment 6 Yves DEMUR 2024-05-26 18:17:37 CEST
I get the 404 error too, when I am not signed in on github.
All other issues are seen but not the 185.
When I sign in, I see it !
I'm not familiar with how to use github. Could you try the link after signing in ?
Comment 7 Morgan Leijström 2024-05-26 19:03:27 CEST
For me it does not help to sign in, but maybe there is yet another parameter, such as being member of the project or registered in the bug.
I too am not familiar with github, only have been commenting a couple bugs...
Comment 8 Yves DEMUR 2024-05-26 19:25:01 CEST
Created attachment 14548 [details]
screen print of github e2fsprogs issue 185

Here is what I see at
https://github.com/tytso/e2fsprogs/issues/185
Comment 9 Morgan Leijström 2024-05-26 19:48:16 CEST
OK thanks.
Please update this bug with any progress upstream.

Assigning this to e2fsprogs maintianer.

Assignee: bugsquad => yvesbrungard

Comment 10 papoteur 2024-05-27 09:06:15 CEST
Assigning to the registered maintainer
mgarepo maintdb get e2fsprogs
tv

Assignee: yvesbrungard => thierry.vignaud

Comment 11 Morgan Leijström 2024-05-27 14:44:52 CEST
Thank you for correcting.

Interesting: why do
http://90.7.108.76/show?distribution=9&architecture=x86_64&graphical=0&rpm=e2fsprogs
say papoteur is the maintainer?
And for Cauldron it say wally.
Comment 12 sturmvogel 2024-05-27 17:09:32 CEST
The github issue with #185 is not visible to the maintainers of e2fsprogs. There is no issue #185. The actual list ends at #184...thats the reason why there is no answer yet...
Comment 13 Morgan Leijström 2024-05-27 17:21:37 CEST
(In reply to Morgan Leijström from comment #11)
> Interesting: why do
> http://90.7.108.76/
> show?distribution=9&architecture=x86_64&graphical=0&rpm=e2fsprogs
> say papoteur is the maintainer?
> And for Cauldron it say wally.

Papoteur answered on qa-list that madb use name of last packager submitting.
Comment 14 Yves DEMUR 2024-05-27 23:24:49 CEST
(In reply to sturmvogel from comment #12)
> The github issue with #185 is not visible to the maintainers of e2fsprogs.
> There is no issue #185. The actual list ends at #184...thats the reason why
> there is no answer yet...

I created issue #185 and I show it in attachment 14548 [details] of comment 8. What should it be done and by whom for it to be visible?
Comment 15 Yves DEMUR 2024-05-27 23:32:30 CEST
Created attachment 14552 [details]
screen print of github tytso/e2fsprog issues

here is what I see at
https://github.com/tytso/e2fsprogs/issues
when I am logged in on github
Comment 16 papoteur 2024-05-28 07:53:33 CEST
I sent a mail to the owner of the repository. He replies, thus I report the exchange:
> As a Mageia's packager, I received a report [1] of errors, including
> demonstration and fix about the badblocks program.
>
> I didn't check anything, but I consider that the analysis is serious enough
> to report it.
>
> The report is also on the github repository [2].
>
> [1] https://bugs.mageia.org/show_bug.cgi?id=33144
>
> [2] https://github.com/tytso/e2fsprogs/issues/185

Thanks for your note.  I've looked at [1], and I have a couple
observations.

First of all, [2] is not visible to me either.  As near as I can
tell, github doesn't think it exists.  It's not deleted; if you create
an issue and then delete it, then you get something like this[3].

[3] https://github.com/tytso/e2fsprogs/issues/187

So I'm not sure what's going on, but it's not something that I as the
github project project administrator can change or fix or look at,
either.

Secondly, badblocks as a program which is at best 3rd tier priority
for me.  It's not a core part of the e2fsprogs functionality, and I've
been quite tempted to just remove it from the package altogether.

Thirdly, the author made a huge number of extraneous changes that has
obscured whatever change he thought was necessary, and I really don't
care enough to try to spend a day or more trying to puzzle through his
modified code to figure out what he might have done.  Essentially, he
violated all of the rules about how to submit changes: (1) don't make
changes to whitespace; (2) make one small change at a time; (3) at a
minimum use diff to send a smallest possible change, if you can't use
git, etc.  There's a basic summary of the guidelines used for
submitting patches[4] to the kernel, which is generally applicable
here as well.

[4] https://www.kernel.org/doc/html/latest/process/submitting-patches.html

Also helpful would be if someone actually gave a reliable reproducer
to the problem which the user is seeing.  A good way to do it in this
case is to sue the dm-flakey[5] target in devicemapper, with the
probability of an error set to 100%.

[5] https://docs.kernel.org/admin-guide/device-mapper/dm-flakey.html

In any case, this is not something that I plan to spend much time on
since I am super busy and super overloaded.  But sending a huge source
file and asking someone to just blindly replace it in either upstream
or in your Linux distribution is not a good idea; consider the recent
the xz security debacle[6].  If someone wants to send me a clean
reproducer, and a simple succint patch, I'll consider it.  Otherwise,
this is not important enough for me to spend my volunteer/personal
time on.

[6] https://www.akamai.com/blog/security-research/critical-linux-backdoor-xz-utils-discovered-what-to-know

Best regards,

						- Ted

CC: (none) => yvesbrungard

Comment 17 Yves DEMUR 2024-05-28 15:53:05 CEST
I thank all the contributors who intervened in this bug report. I recognize that out of ignorance I did not do things the ideal way. I'm open to suggestions on how to help better.

It was out of curiosity that I looked into the reason why the badblocks program returns wrong bad block numbers. I think this bug and wrong I/O error handling has been present since badblocks was first written. This is serious because "mke2fs -c -c ..." and "e2fsck -c ..." run badblocks and use the returned block list.
Temporarily adding print-debug and error simulation instructions to badblocks.c allowed me to catch these programming errors.

Depending on the requested option, badblocks launches one of the three functions
- test_ro (read-only test of the media), or
- test_rw (write test which destroys the media data), or
- test_nd (write test which restores the media unchanged, nd=non_destructive).
These three functions have duplicate parts of code.

According to the consultation of the source and the tests (see Readme.txt), it turns out that
- test_ro has no bugs,
- test_rw has a bug on the numbers returned and on the handling of I/O errors, these two bugs can be corrected simply (see version badblocks_bu2.c),
- test_nd has a bug in the processing of I/O errors, impossible to correct without profoundly modifying the algorithm.

I proposed the badblocks_yd1.c version (and its badblocks_yd1.8.in) which includes the modifications of badblocks_bu2.c but which deactivates the use of test_nd.
The differences between badblocks-1.47.0.c and badblocks_yd1.c are explained in Readme.txt, over the versions that led to badblocks_bu2.c.

To have the non-destructive testing functionality, I was forced to review the algorithm and took the opportunity to simplify by removing code duplication. I then added a few options that I found useful and easy to code. The result remains fully compatible with badblocks-1.47.0.c (default values, options and results) except for the correction of returned numbers and the handling of I/O errors.
Tests confirm the correct functioning of the patched version, as well as backward compatibility with badblocks-1.47.0.c.
Comment 18 Yves DEMUR 2024-05-28 23:16:30 CEST
After checking the versions in github, it turns out that the error about bad block numbers in test_rw was introduced in the version of badblocks.c from 2011-02-20. The line "currently_testing += got;" was placed before (L684) the corruption test loop, which uses the currently_testing+i value as an argument to call bb_output (L699). In the version of 2010-09-25 "currently_testing += got;" is in L661, after the call to bb_output in L659, which is correct.
So the block number error was introduced in 2011 and not during initial programming as I assumed in the previous post.
Comment 19 Morgan Leijström 2024-05-29 07:43:09 CEST
This sounds like a great input for upstream to ponder :)
Comment 20 Yves DEMUR 2024-05-29 11:29:25 CEST
"mke2fs -c -c ..." launches badblocks and its test_rw function; mke2fs then receives an erroneous list of bad corrupted blocks.
The correction of the test_rw source is relatively light:
- use currently_testing+i-got instead of currently_testing+i in the call to bb_output in CORRUPTION_ERROR,
- remove the lines involving the recover_block variable and replace them with processing using the currently_buffer variable (see badblocks_bu2.c) to correctly synchronize pointing to the media and pointing to the buffer.
This can be the subject of a classic corrective patch.

"e2fsck -c -c ..." launches badblocks and its test_nd function; if there are I/O errors, e2fsck then receives an incomplete list of corrupted bad blocks: the I/O error on a block causes the following blocks in the slice to not be checked, which are declared good by default.
I haven't found a simple fix for the test_nd source. In my opinion, the processing involving the recover_block and granularity variables is flawed. This is shown in the test with print-debugs and error simulation.
The only path I saw was to redesign the algorithm, but perhaps another tinkerer would see a simpler way out. The redesign merged the three test_* functions into one.
These are important changes that affect the core processing of badblocks. This could be the subject of a new major version, or even a new command: badblocks2.

If badblocks is a low priority program, the fact that it is used by mke2fs and e2fsck may prompt a review of its classification.
Comment 21 Yves DEMUR 2024-05-30 01:12:27 CEST
The problem of I/O errors in test_nd is complex. It should be seen that the non-destructive test is obtained by the following sequence of operations on each slice (set of blocks_at_once blocks):
1-save the slice in memory (read action),
2-write the test pattern on the media,
3-reading of the media and comparison with the test pattern,
4-restore of the slice (write action).

The error reading a block during 1-save should result in the preservation of the media by preventing 2-write and 4-restore operations on this block. Likewise, it is useless to perform the 3-reading operation on a block whose 2-write operation is in error.

Blocks with I/O errors divide the slice into zones of contiguous blocks without I/O errors. Each operation must only concern one zone, in order to ignore I/O error blocks from previous operations. We see a recursive type of operation emerging.

I therefore developed a recursive function, which divides the processed zone into sub-zones of contiguous blocks without I/O error, and which restarts itself on each sub-zone for the next operation. The 4-restore will be initiated by the 1-save after the return of the 2-write (which initiated the 3-reading).
This function can also be used for test_ro and test_rw without using recursion.

This involves a major modification of the code and therefore significant verification work before validating it.
Comment 22 sturmvogel 2024-05-30 04:41:01 CEST
You should try to open an upstream issue (again) and provide your input or better a valid patch. Make sure that the issue can be seen also after logging out from github.

There is nothing what can be done here.
Comment 23 Yves DEMUR 2024-06-01 00:24:52 CEST
Created attachment 14555 [details]
Providing small incremental changes starting with badblocks-1.47.0.c

to fix bugs of test_rw and make small improvements to the code
Comment 24 Yves DEMUR 2024-06-01 00:29:43 CEST
OK
I'll stop boring you with my comments. I think it's time to close this bug report.

Currently the only safe way to use media is... not to use it if badblocks returns at least one error. It's all or nothing. The absence of errors is reliable. Setting aside bad blocks is not possible. We just know that there are some, without knowing where they really are.

I looked at github. It takes a minimal investment in know-how and good manners to be able to collaborate properly. This amount of work seems excessive to me just to get the badblocks maintainer to modify/add 32 lines in test_ro and test_rw. In addition, you will have to be very diplomatic for him to agree to sacrifice a few minutes of his life to check and validate two bug fixes for a program that he wishes to delete. As for submitting an algorithm modification to rectify test_nd: not even in a dream!

I think we're wasting our time if the maintainer doesn't react to the fact that
- mke2fs and e2fsck are affected by badblocks bugs (mkfs.ext4 is a link to mke2fs),
- for 13 years mke2fs has received from test_rw a false list of bad blocks to discard,
- e2fsck receives from test_nd a fanciful list of bad blocks (e2fsck checks ext2/ext3/ext4).

However, if an experienced githuber feels the patience to try to rectify at least test_ro and test_rw, just to correct the 2 bugs, I provide useful elements in the attachment #14555 [details] (see the Readme.txt). If successful, users of "mke2fs -c -c ..." will benefit from a relevant verification of the media (thus putting aside the real blocks in error), and having optimization until the end of I/O provided by blocks_at_once.
Too bad for users of "e2fsck -c -c ..." who have media with errored blocks, test_nd will remain messed up. I tried hard to understand the programmer's intention of test_nd, but I couldn't. In my opinion the solution requires more than sporadic corrections.

For my part, I created a badblocks2 executable (and its man) which I use to check the newly acquired media. I offer it for download on my site http://yves.demur.free.fr . The version is that of badblocks_yd4.c (see attachment #14515 [details]). This version works identically to badblocks_yd2.c, which has been extensively tested (see Readme.txt of attachment #14515 [details]).

Thank you all.
Comment 25 Morgan Leijström 2024-06-01 10:30:14 CEST
It is very good you bring this problem up.

This is clearly an upstream issue, a you say we can not do much here.

We could close it as wontfix but that would hurt my mind to know an important tool is broken in a way it may couse users to loose data (and get grey hair) and we seem not to care.

If you have the energy to it, maybe you can create interest with some experienced people with good reputation and who likes to devote time/get paid to work on it.  So maybe ask some heavy actor like Redhat or similar; for example post on their forum describing the problem to start the ball rolling.

Note You need to log in before you can comment on or make changes to this bug.