Bug 28091

Summary: freedups ends with error message "Can't use an undefined value as an ARRAY reference at /usr/bin/freedups line 487"
Product: Mageia Reporter: andré blais <andr999>
Component: RPM PackagesAssignee: Mageia Bug Squad <bugsquad>
Status: RESOLVED WORKSFORME QA Contact:
Severity: normal    
Priority: Normal CC: lewyssmith
Version: 7Keywords: UPSTREAM
Target Milestone: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Source RPM: freedups-0.6.14-12.mga7.src.rpm CVE:
Status comment:

Description andré blais 2021-01-14 02:21:51 CET
Description of problem:

Error message :
"Can't use an undefined value as an ARRAY reference at /usr/bin/freedups line 487."

Version-Release number of selected component (if applicable):
see source rpm

How reproducible:

Unknown.

Freedups processed normally, for a few hours, on a partition with many thousands of files.
There was very frequent display of numbers associated with files being compared.
There were frequent messages indicating 2 files had been linked.
The error message occured just after a message that 2 files had been linked.
(i.e. there was no subsequent display of numbers associated with files being compared.)

Maybe the error occurred instead of exiting normally when it finished ?

Files were compared by name and save date only, and not byte-by-byte comparison.
Comment 1 Lewis Smith 2021-01-14 16:23:13 CET
Thank you for this report. I am intrigued that any system really has duplicated files, but the application site: http://www.stearns.org/freedups/ claims Gigabytes of them are possible.
The higher level page is interesting & important: http://www.stearns.org/
----------------------------------------------
Before looking at the bug per se, worth noting:
This command does not have a man page; do:
 $ freedups -h

> Files were compared by name and save date only, and not byte-by-byte
> comparison
does not look possible. The site page says:
"What has to be true for two files to get linked together?
	- They have to be files (i.e. not character or block devices, no
pipes, no directories, no symlinks).
	- They have to have at least one byte.  I don't want to link
all 0 byte files on the system together.
	- They have to have the same size.
	- They have to have the same user owner, group owner and mode.
	- They have to be readable by the current user.
	- The contents of the files have to be identical.
	- They have to be on the same partition.
	- That partition must support hardlinks.  Ext2, ext3 and
reiserfs do."
None of that is modifiable.
What you cite are *extra* constraints:
"       - Optionally (--minsize=1000), the files have to be larger than
the given number of bytes.
	- Optionally (--datesequal=yes), the files have to have identical
modification timestamps.
	- Optionally (--filenamesequal=yes), the filenames have to be 
identical (in different directories, obviously)."

It wants the '-a' parameter to actually do linking; otherwise it does a dummy run.
----
The error message reported is real enough, and begs a question:
* What happens if you re-run the same command? If it had finished the first time, it should do nothing. If it continues to process duplicate files, presumably it had not finished. From its page:
"Can this be safely run more than once?
	Definitely.  Freedups is smart enough to recognize that two
files are already linked together and just moves on to the next pair."

Most likely this is an application problem about which Mageia can do nothing. Normally we would say "Raise a bug upstream", but I see no place for doing so. 
However, the higher-level page says:
"If you need to get a hold of me, try:
email: wstearns@pobox.com
    I'll have this address forever."
If the problem reccurs, try that; explain carefully the exact command you used, the total number of files involved (approximately), for example, from the starting directory:
 $ tree -a | wc -l
 $ find -type f | wc -l
and the number linked:
 $ find -type f -links +1 | wc -l

CC: (none) => lewyssmith

Comment 2 andré blais 2021-01-18 02:43:39 CET
(In reply to Lewis Smith from comment #1)
> Thank you for this report. I am intrigued that any system really has
> duplicated files, but the application site: http://www.stearns.org/freedups/
> claims Gigabytes of them are possible.

I have saved a G or more of space on occasion.
Note that some libraries have duplicate files in different directories.
As well, since I use freedups from time to time after backups from all regular partitions to a single removable backup partition, some duplicates will come from different partitions.

> > Files were compared by name and save date only, and not byte-by-byte
> > comparison

There is a --paranoia or -p option which defaults on, and allows surpressing byte-by-byte comparison.  Although I found on retesting that I wasn't using it, and it is no longer recognized as a legitimate option.

The process creates a check-sum of all files before comparing them, and would only do a byte-by-byte comparison for files which seem to be identical.
By adding requiring the same name and change date, fewer files would be compared byte-by-byte.  From past experience, without these requrements, many more files would be seen as identical.

...
> What you cite are *extra* constraints:
...
> 	- Optionally (--datesequal=yes), the files have to have identical
> modification timestamps.
> 	- Optionally (--filenamesequal=yes), the filenames have to be 
> identical (in different directories, obviously)."
> 
> It wants the '-a' parameter to actually do linking; otherwise it does a
> dummy run.
> ----
> The error message reported is real enough, and begs a question:
> * What happens if you re-run the same command? If it had finished the first
> time, it should do nothing. If it continues to process duplicate files,
> presumably it had not finished. From its page:
> "Can this be safely run more than once?
> 	Definitely.  Freedups is smart enough to recognize that two
> files are already linked together and just moves on to the next pair."

Many files were already linked during the backup process, so it is important that freedups is able to recognize this.

I did rerun the command, with :
freedups -a -d -f /run/media/andr/u500

This first time rerun, it showed linking 2 files, immediately followed by exactly the same error message.

This second time rerun, it printed a few dots, then exited almost immediately, without an error message.

Note that the partition in question contains over 12 million files, with most of 10 years of backups.  Each backup is done to a different directory, with each source partition to its' own subdirectory.
Many older files no longer relevant have been removed, since the media is about 80% full.

I ran fsck, just in case, and there were no errors.

> 
> Most likely this is an application problem about which Mageia can do
> nothing. Normally we would say "Raise a bug upstream", but I see no place
> for doing so. 

I have run freedups many times in the last few years, without an error message.
The last version from the developer (and that installed) is from 2014.

I suspect that the error is due to some change in packaging.
Otherwise maybe some sort of overflow due to the large number of files.
Comment 3 Lewis Smith 2021-01-18 21:01:55 CET
> the partition in question contains over 12 million files
Ah! Does the number of files increase with time?

> I have run freedups many times in the last few years, without an error
> message
> I suspect that the error is due to some change in packaging
The package has not changed since Mageia 5-6-7 nor for Mageia 8.
It has worked *up to now*, so it is clearly OK; and you have hit some limit in the application not seen before.

So I repeat the suggestion:
"If you need to get a hold of me, try:
email: wstearns@pobox.com
    I'll have this address forever."
Just for interest. You won through.

> There is a --paranoia or -p option which defaults on, and allows
> supressing byte-by-byte comparison
"Options (default value in parentheses; 1=Enabled, 0=Disabled)
  --paranoid|-p			Recheck all file stats and completely compare every byte of the files just before linking.  This should definitely be left on unless you are _positive_ that the md5 checksum cache is correct and there's no chance that files will be modified behind freedups' back. (1)"

Whether or not it is a valid option now (since 2014), unless you specifically disabled it, it does imply byte-by-byte comparison of (and only for) files with the same MD5 checksum.

> By adding requiring the same name and change date, fewer files would
> be compared byte-by-byte.
From its site page (comment 1), I saw these parameters as being extra constraints, which would limit linking of otherwise identical files. But I see your point: if it speeds things up for you, these checks must be done before the others.

But these are not the issue. You won through.

Please make the e-mail enquiry of the author, and post the result if you get a reply. In the meantime, can we close this 'works for me' ? The bug remains for reference.

Keywords: (none) => UPSTREAM

Comment 4 andré blais 2021-01-19 02:03:48 CET
(In reply to Lewis Smith from comment #3)
> > the partition in question contains over 12 million files
> Ah! Does the number of files increase with time?

Yes .. every time a file with changes, as happens with regular updates, there is one more file to compare.
It has to calculate a checksum for every inode at least (if not doing it for every file), before it starts the comparison.
If that is causing the problem, it isn't using all the memory available, since I have 12G.  But not surprising, since the last update to freedups was in 2014.
I will be removing many files that I am sure I won't be needing.
 
...

> > By adding requiring the same name and change date, fewer files would
> > be compared byte-by-byte.
> From its site page (comment 1), I saw these parameters as being extra
> constraints, which would limit linking of otherwise identical files. But I
> see your point: if it speeds things up for you, these checks must be done
> before the others.

Before I added the -d date/time and -f file name constraints, it was linking many small identical files with very different dates and file names.

> Please make the e-mail enquiry of the author, and post the result if you get
> a reply. In the meantime, can we close this 'works for me' ? The bug remains
> for reference.

I'll close the bug here, since on the third try it did work without error.
It was much faster the second time, and even more so the third time.

Status: NEW => RESOLVED
Resolution: (none) => WORKSFORME

Comment 5 Lewis Smith 2021-01-19 19:56:36 CET
Thank you for agreeing to close it.

> It was much faster the second time, and even more so the third time
Because it had done most of the work, and was just mopping up what was left.

Summary: freedups terminates with the following error message => freedups ends with error message "Can't use an undefined value as an ARRAY reference at /usr/bin/freedups line 487"