2008/9/29 Timothy Murphy <gayleard at eircom.net>:
> What is the best way of eliminating duplicate photos
> on a number of machines, all running Linux (Fedora or CentOS)?
>> I suppose one could ask the same question about files generally;
> how to tag or delete duplicates.
Brute force?
On each machine:
find . -type f -exec md5sum \{} \; | sed "s/$/ "$(hostname)/ > filelist.$$
will create an md5 checksum of each file examined, two spaces, the
filename, space, the hostname. For filenames including newline
characters, you're on your own.
"$$" is "hopefully unique enough for this small sample". Use something
distinct on each machine for safety.
Gather those files together and print duplicates:
sort filelist.* | uniq -w 32 -D
which will print each line where any two lines have the same first 32
characters -- pick a different number if you prefer sha1sum or cksum
or sum instead of md5sum.
>From that list, pick which of the matching filename-hostname pairs you
want to get rid of. Check that they really are the same, and rm.
This refers to byte-identical files (within the limit of the
checksum). "duplicate photos" may not match that, if someone has
messed with metadata or anything else internal.
If you're worried about that, you could strip exif data before
checksumming and have a slightly better chance of catching more
repeats.
Good luck,
f
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!