Help on writing a check script for duplicate file entries

Discussion in 'Linux' started by ikajaste, Jan 5, 2009.

  1. ikajaste

    ikajaste

    Joined:
    Oct 9, 2008
    Messages:
    17
    Likes Received:
    0
    As there seem to be some potential problems with the storage expansion system (aufs), I want to write a script to check whether the source paths of the union file system contain files with same name. Details on the problem can be read from the thread "Details on the memory extension (aufs)".

    So the idea is to check whether paths /home/user and /media/disk contain files with same name. I'm not very good with command line, so help is appreciated. Here's what I have so far:

    "find /home/user", and "find /media/disk" output all the files in the paths. "diff" would compare the contents of these files. However, I have some unsolved problems with this approach:

    1) Output of find contains the exact path (for example "/media/disk/somedir/some.file" instead of just "somedir/some.file"), so the two paths will always look different.

    2) Diff doesn't check if the files contain same lines, but different lines. I only want output if there are any same lines in the two outputs. (I'm thinking diff might not be the correct tool for comparison - any suggestions?)

    3) Diff compares also the order of the lines, not just whether there are similar lines.

    4) How to I combine the output from the two finds and pass that to diff directly? (Currently I'm using two files as an intermediate step, but that's not good.)
     
    ikajaste, Jan 5, 2009
    #1
  2. ikajaste

    daldred

    Joined:
    Aug 25, 2008
    Messages:
    887
    Likes Received:
    0
    You might be best to install fdupes: it will identify duplicate files by size and md5sum, so will pick up any near-miss on file naming as well.
     
    daldred, Jan 5, 2009
    #2
  3. ikajaste

    ikajaste

    Joined:
    Oct 9, 2008
    Messages:
    17
    Likes Received:
    0
    Problem with using fdupes is that I need to find same named files, regardless of their content, since the given the right conditions you might actually end up having an different file with the same name, hidden by aufs. I didn't see an option to turn the fdupes MD5 and byte-by-byte comparing off.
     
    ikajaste, Jan 5, 2009
    #3
  4. ikajaste

    ronime

    Joined:
    Nov 3, 2008
    Messages:
    486
    Likes Received:
    0
    Location:
    West Yorkshire, UK
    Code:
    cd /tmp
    find /home/user | sed 's+^.*/++' > files.list
    find /media/disk | sed 's+^.*/++' >> files.list
    sort files.list -o files.list
    uniq -d files.list > dupes.list
    for f in $(cat dupes.list)
    do
    find /home/user -name $f
    find /media/disk -name $f
    done
    rm files.list dupes.list
    
     
    ronime, Jan 5, 2009
    #4
Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.