diff --git a/ChangeLog b/ChangeLog index eacdbd3..00b9a26 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,8 @@ +2021-04-21 Eoghan Murray + * change terminology to prefer 'exclude' over 'remove' when talking + about the file list, in case there's confusion between 'remove' and + 'delete' + * as per above, rename option -removeidentinode to -excludeidentinode 2018-11-12 Paul Dreik * release of 1.4.1 * fixes build failure on 32 bit platforms diff --git a/README.md b/README.md index 646c203..5af1e96 100644 --- a/README.md +++ b/README.md @@ -43,13 +43,13 @@ Look for duplicate files in directory /home/pauls/bilder: $ rdfind /home/pauls/bilder/ Now scanning "/home/pauls/bilder", found 3301 files. Now have 3301 files in total. - Removed 0 files due to nonunique device and inode. - Now removing files with zero size...removed 3 files + Excluded 0 files due to nonunique device and inode. + Now excluding files with zero size...excluded 3 files Total size is 2861229059 bytes or 3 Gib - Now sorting on size:removed 3176 files due to unique sizes.122 files left. - Now eliminating candidates based on first bytes:removed 8 files.114 files left. - Now eliminating candidates based on last bytes:removed 12 files.102 files left. - Now eliminating candidates based on md5 checksum:removed 2 files.100 files left. + Now sorting on size:excluding 176 files due to unique sizes.122 files left. + Now eliminating candidates based on first bytes:excluded 8 files.114 files left. + Now eliminating candidates based on last bytes:excluded 12 files.102 files left. + Now eliminating candidates based on md5 checksum:excluded 2 files.100 files left. It seems like you have 100 files that are not unique Totally, 24 Mib can be reduced. Now making results file results.txt @@ -77,15 +77,15 @@ Rdfind uses the following algorithm. If N is the number of files to search throu 2. For each argument, list the directory contents recursively and assign it to the file list. Assign a directory depth number, starting at 0 for every argument. 3. If the input argument is a file, add it to the file list. 4. Loop over the list, and find out the sizes of all files. -5. If flag -removeidentinode true: Remove items from the list which already are added, based on the combination of inode and device number. A group of files that are hardlinked to the same file are collapsed to one entry. Also see the comment on hardlinks under ”caveats below”! -6. Sort files on size. Remove files from the list, which have unique sizes. +5. If flag -excludeidentinode true: Exclude items already added, based on the combination of inode and device number. A group of files that are hardlinked to the same file are collapsed to one entry. Also see the comment on hardlinks under ”caveats below”! +6. Sort files on size. Exclude files which have unique sizes. 7. Sort on device and inode(speeds up file reading). Read a few bytes from the beginning of each file (first bytes). -8. Remove files from list that have the same size but different first bytes. +8. Exclude files that have the same size but different first bytes. 9. Sort on device and inode(speeds up file reading). Read a few bytes from the end of each file (last bytes). -10. Remove files from list that have the same size but different last bytes. +10. Exclude files that have the same size but different last bytes. 11. Sort on device and inode(speeds up file reading). Perform a checksum calculation for each file. -12. Only keep files on the list with the same size and checksum. These are duplicates. -13. Sort list on size, priority number, and depth. The first file for every set of duplicates is considered to be the original. +12. Exclude files with different size and checksum. The rest are duplicates. +13. Sort remaining duplicates on size, priority number, and depth. The first file for every set of duplicates is considered to be the original. 14. If flag ”-makeresultsfile true”, then print results file (default). 15. If flag ”-deleteduplicates true”, then delete (unlink) duplicate files. Exit. 16. If flag ”-makesymlinks true”, then replace duplicates with a symbolic link to the original. Exit. @@ -153,7 +153,7 @@ Here is a small benchmark. Times are obtained from ”elapsed time” in the tim ### Caveats / Features -A group of hardlinked files to a single inode are collapsed to a single entry if `-removeidentinode true`. If you have two equal files (inodes) and two or more hardlinks for one or more of the files, the behaviour might not be what you think. Each group is collapsed to a single entry. That single entry will be hardlinked/symlinked/deleted depending on the options you pass to `rdfind`. This means that rdfind will detect and correct one file at a time. Running multiple times solves the situation. This has been discovered by a user who uses a ”hardlinks and rsync”-type of backup system. There are lots of such backup scripts around using that technique, Apple time machine also uses hardlinks. If a file is moved within the backuped tree, one gets a group of hardlinked files before the move and after the move. Running rdfind on the entire tree has to be done multiple times if -removeidentinode true. To understand the behaviour, here is an example demonstrating the behaviour: +A group of hardlinked files to a single inode are collapsed to a single entry if `-excludeidentinode true`. If you have two equal files (inodes) and two or more hardlinks for one or more of the files, the behaviour might not be what you think. Each group is collapsed to a single entry. That single entry will be hardlinked/symlinked/deleted depending on the options you pass to `rdfind`. This means that rdfind will detect and correct one file at a time. Running multiple times solves the situation. This has been discovered by a user who uses a ”hardlinks and rsync”-type of backup system. There are lots of such backup scripts around using that technique, Apple time machine also uses hardlinks. If a file is moved within the backuped tree, one gets a group of hardlinked files before the move and after the move. Running rdfind on the entire tree has to be done multiple times if -excludeidentinode true. To understand the behaviour, here is an example demonstrating the behaviour: $ echo abc>a $ ln a a1 @@ -171,7 +171,7 @@ A group of hardlinked files to a single inode are collapsed to a single entry if Everything is as expected. - $ rdfind -removeidentinode true -makehardlinks true ./a* ./b* + $ rdfind -excludeidentinode true -makehardlinks true ./a* ./b* $ stat --format="name=%n inode=%i nhardlinks=%h" a* b* name=a inode=58930 nhardlinks=4 name=a1 inode=58930 nhardlinks=4 diff --git a/Rdutil.cc b/Rdutil.cc index f098d2c..d1915ea 100644 --- a/Rdutil.cc +++ b/Rdutil.cc @@ -298,7 +298,7 @@ Rdutil::sort_on_depth_and_name(std::size_t index_of_first) } std::size_t -Rdutil::removeIdenticalInodes() +Rdutil::excludeIdenticalInodes() { // sort list on device and inode. auto cmp = cmpDeviceInode; @@ -319,7 +319,7 @@ Rdutil::removeIdenticalInodes() } std::size_t -Rdutil::removeUniqueSizes() +Rdutil::excludeUniqueSizes() { // sort list on size auto cmp = cmpSize; @@ -341,7 +341,7 @@ Rdutil::removeUniqueSizes() } std::size_t -Rdutil::removeUniqSizeAndBuffer() +Rdutil::excludeUniqSizeAndBuffer() { // sort list on size const auto cmp = cmpSize; @@ -420,7 +420,7 @@ std::size_t Rdutil::cleanup() { const auto size_before = m_list.size(); - auto it = std::remove_if(m_list.begin(), m_list.end(), [](const Fileinfo& A) { + auto it = std::exclude_if(m_list.begin(), m_list.end(), [](const Fileinfo& A) { return A.deleteflag(); }); @@ -432,7 +432,7 @@ Rdutil::cleanup() } #if 0 std::size_t -Rdutil::remove_small_files(Fileinfo::filesizetype minsize) +Rdutil::exclude_small_files(Fileinfo::filesizetype minsize) { const auto size_before = m_list.size(); const auto begin = m_list.begin(); @@ -440,9 +440,9 @@ Rdutil::remove_small_files(Fileinfo::filesizetype minsize) decltype(m_list.begin()) it; if (minsize == 0) { it = - std::remove_if(begin, end, [](const Fileinfo& A) { return A.isempty(); }); + std::exclude_if(begin, end, [](const Fileinfo& A) { return A.isempty(); }); } else { - it = std::remove_if(begin, end, [=](const Fileinfo& A) { + it = std::exclude_if(begin, end, [=](const Fileinfo& A) { return A.is_smaller_than(minsize); }); } diff --git a/Rdutil.hh b/Rdutil.hh index b39e2e9..f7309ef 100644 --- a/Rdutil.hh +++ b/Rdutil.hh @@ -44,21 +44,21 @@ public: /** * for each group of identical inodes, only keep the one with the highest * rank. - * @return number of elements removed + * @return number of elements excluded */ - std::size_t removeIdenticalInodes(); + std::size_t excludeIdenticalInodes(); /** - * remove files with unique size from the list. + * exclude files with unique size from the list. * @return */ - std::size_t removeUniqueSizes(); + std::size_t excludeUniqueSizes(); /** - * remove files with unique combination of size and buffer from the list. + * exclude files with unique combination of size and buffer from the list. * @return */ - std::size_t removeUniqSizeAndBuffer(); + std::size_t excludeUniqSizeAndBuffer(); /** * Assumes the list is already sorted on size, and all elements with the same @@ -70,14 +70,14 @@ public: */ void markduplicates(); - /// removes all items from the list, that have the deleteflag set to true. + /// excludes all items from the list that have the deleteflag set to true. std::size_t cleanup(); /** - * Removes items with file size less than minsize - * @return the number of removed elements. + * Excludes items with file size less than minsize + * @return the number of excluded elements. */ - std::size_t remove_small_files(Fileinfo::filesizetype minsize); + std::size_t exclude_small_files(Fileinfo::filesizetype minsize); // read some bytes. note! destroys the order of the list. // if lasttype is supplied, it does not reread files if they are shorter diff --git a/rdfind.1 b/rdfind.1 index d390370..f338525 100644 --- a/rdfind.1 +++ b/rdfind.1 @@ -72,8 +72,8 @@ is disabled. .BR \-followsymlinks " " \fItrue\fR|\fIfalse\fR Follow symlinks. Default is false. .TP -.BR \-removeidentinode " " \fItrue\fR|\fIfalse\fR -Removes items found which have identical inode and device ID. Default +.BR \-excludeidentinode " " \fItrue\fR|\fIfalse\fR +Excludes items found which have identical inode and device ID. Default is true. .TP .BR \-checksum " " \fImd5\fR|\fIsha1\fR|\fIsha256\fR diff --git a/rdfind.cc b/rdfind.cc index facdda7..b7fee25 100644 --- a/rdfind.cc +++ b/rdfind.cc @@ -58,7 +58,7 @@ usage() << " -maxsize N (N=0) ignores files with size N " "bytes and larger (use 0 to disable this check).\n" << " -followsymlinks true |(false) follow symlinks\n" - << " -removeidentinode (true)| false ignore files with nonunique " + << " -excludeidentinode (true)| false ignore files with nonunique " "device and inode\n" << " -checksum md5 |(sha1)| sha256\n" << " checksum type\n" @@ -101,7 +101,7 @@ struct Options bool deleteduplicates = false; // delete duplicate files bool followsymlinks = false; // follow symlinks bool dryrun = false; // only dryrun, dont destroy anything - bool remove_identical_inode = true; // remove files with identical inodes + bool exclude_identical_inode = true; // exclude files with identical inodes bool usemd5 = false; // use md5 checksum to check for similarity bool usesha1 = false; // use sha1 checksum to check for similarity bool usesha256 = false; // use sha256 checksum to check for similarity @@ -163,7 +163,10 @@ parseOptions(Parser& parser) } else if (parser.try_parse_bool("-n")) { o.dryrun = parser.get_parsed_bool(); } else if (parser.try_parse_bool("-removeidentinode")) { - o.remove_identical_inode = parser.get_parsed_bool(); + // backwards compatibility + o.exclude_identical_inode = parser.get_parsed_bool(); + } else if (parser.try_parse_bool("-excludeidentinode")) { + o.exclude_identical_inode = parser.get_parsed_bool(); } else if (parser.try_parse_bool("-deterministic")) { o.deterministic = parser.get_parsed_bool(); } else if (parser.try_parse_string("-checksum")) { @@ -334,9 +337,9 @@ main(int narg, const char* argv[]) // list. gswd.markitems(); - if (o.remove_identical_inode) { - // remove files with identical devices and inodes from the list - std::cout << dryruntext << "Removed " << gswd.removeIdenticalInodes() + if (o.exclude_identical_inode) { + // exclude files with identical devices and inodes from the list + std::cout << dryruntext << "Excluded " << gswd.excludeIdenticalInodes() << " files due to nonunique device and inode." << std::endl; } @@ -344,7 +347,7 @@ main(int narg, const char* argv[]) << " bytes or "; gswd.totalsize(std::cout) << std::endl; - std::cout << "Removed " << gswd.removeUniqueSizes() + std::cout << "Excluded " << gswd.excludeUniqueSizes() << " files due to unique sizes from list. "; std::cout << filelist.size() << " files left." << std::endl; @@ -375,8 +378,8 @@ main(int narg, const char* argv[]) // read bytes (destroys the sorting, for disk reading efficiency) gswd.fillwithbytes(it[0].first, it[-1].first, o.nsecsleep); - // remove non-duplicates - std::cout << "removed " << gswd.removeUniqSizeAndBuffer() + // exclude non-duplicates + std::cout << "excluded " << gswd.excludeUniqSizeAndBuffer() << " files from list. "; std::cout << filelist.size() << " files left." << std::endl; } diff --git a/testcases/checksum_speedtest.sh b/testcases/checksum_speedtest.sh index 205ee28..44e535c 100755 --- a/testcases/checksum_speedtest.sh +++ b/testcases/checksum_speedtest.sh @@ -23,7 +23,7 @@ fi for checksumtype in md5 sha1 sha256; do dbgecho "trying checksum $checksumtype" - time $rdfind -removeidentinode false -checksum $checksumtype speedtest/largefile1 speedtest/largefile2 > rdfind.out + time $rdfind -excludeidentinode false -checksum $checksumtype speedtest/largefile1 speedtest/largefile2 > rdfind.out done dbgecho "all is good in this test!"