Skip to content

Ported to Ubuntu (Linux Mint Cinnamon) #5

@roddyongithub

Description

@roddyongithub

Hello Tomer,
I have started porting the find-dupes script to Ubuntu. Completely changed the parsing, and some minor changes. It's working - yay! But I'm a complete newbie to awk.
I would love to add some functionality such as sorting by size and printing the size within the report, but sorting via asort and asorti seems to mess up the arrays - while arrays are my hugest problem, for example I have no clue what this line is doing:
file_size[$5, ++file_size[$5, "length"]] = dir "/" fname
Can you please support? I'd love to cleanup my 9TB of data ;-)
And what is the best way to publish my Ubuntu port?
Thanks a lot.
Yes, it's still a mess ...

# Call like this, while in the folder to scan
# shopt -s globstar && ls -aldp ./**/* | grep -v /$ 
#  | awk '{print $1,"\t",$2,"\t",$3,"\t",$4,"\t",$5,"\t",$6,"\t",$7,"\t",$8,"\t",$9,$10,$11,$12,$13,$14,$15;}' 
#  | awk -F $'\t' -f [path]/find-dupes.awk

BEGIN {
    OFS = "\t"
    md5_exec = "md5sum"
}

## Parse tab separated input
NF {
    gsub(/^[ ]+|[ ]+$/, "", $9)                              # remove tr/lea spaces
    n = split($9, a, "/"); fname = a[n];                     # get name of file
    dir = substr($9, 1, length($9)-length(fname)-1)          # get name of folder
    file_size[$5, ++file_size[$5, "length"]] = dir "/" fname # array file & size
    if(file_size[$5, "length"] > 1 && $5 > 35)               # when duplicate found
        sizes[$5]                                            # create size in sizes
}

END {
    ## Find the files that have identical sizes, and then get their MD5 hash:
    for(size in sizes)
        for(i=1; i<=file_size[size, "length"]; i++) {
            file = file_size[size, i]
                FS= " = "
                (md5_exec " '" file "'") | getline
                    split($1, a, " "); hash = a[1];
                    print hash " -" size "bytes: " file
                    file_hash[hash, ++file_hash[hash, "length"]] = file
                    if (file_hash[hash, "length"] > 1)
                        hashes[hash]
        }

    ## Report files that have identical MD5 hashes:                  
    print "\n#### Duplicates ###"
    for(hash in hashes) {
        print "MD5 " hash ":"
        for(i=1; i<=file_hash[hash, "length"]; i++)
            print OFS file_hash[hash, i]
    }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions