rsync
with (some) steroids
I came up with a way of using rsync
to be able to compare folders and list the files which actually have different contents, and in that case, which folder has the newest version. This means that, given two folders A/
and B/
, all the files with differences are newer in (say) folder B/
, I can just copy them over to folder A/
, instead of doing individual merges. This is super useful to merge files that exist on both desktop and laptop, but that are not under version control. #rsync #diff #merge #bash #script
Say you have a folder foobar
in two locations: local/foobar
and remote/foobar
. We are going to cook up a bash
script to: list the files that exist only on the local side, do the same for the remote side, list all files that exist on both sides, but with different sizes (and hence, different content), and list all files that exist on both sides, with the same size, but differing contents. In the last two scenarios, the script will also indicate which file (the one on the local or the one in the remote side) is the newest, i.e., has the latest modification timestamp. Note that this will be a purely informational script: no actual file transfers will take place.
Notational remark. As I mention in the tagline, I initially developed this script to aid me in synchronising unversioned files between my desktop and laptop computers—hence, the designations above of local
and remote
versions of a given folder. However, as I also mention in the tagline, this script can also be applied to two different folders in the same machine. For this reason, although below I speak only of the “local side” and the “remote side,” these can also be taken to mean the “left folder” (i.e., the source folder), and the “right folder” (i.e., the destination folder).1
More concretely, the list of files will contain at the beginning of each line a “status indicator,” as follows:
L
: the file exists only on the local side.R
: the file exists only on the remote side.Sl
: the file exists on both sides, but with different sizes (and thus, different contents), and the most recent version is the one on the local side.Sr
: like the above, but the most recent version is the one on the remote side.Cl
: the file exists on both sides, with the same size, but different contents, and the most recent version is the one on the local side.Cr
: like the above, but the most recent version is the one on the remote side.
In what follows, we will designate the local side by $LOCAL_DIR
and the remote side by $REMOTE_DIR
. While most of the shell commands to be listed hereinafter could be used at the command line, they will likely be more useful when used in a script.2
Files only on local side. The list of files that exist only on the local side, can be obtained as follows:
rsync -rn --ignore-existing --out-format='%n' "$LOCAL_DIR"/ "$REMOTE_DIR"/ \
| grep -v "skipping non-regular file" | grep -E -v '/$' | sed -e 's/^/L /'
The -r
option is a shortcut for --recursive
, and instructs rsync
to go into directories’ content. -n
is a shortcut for --dry-run
, which means just simulate, don’t actually transfer any files. --ignore-existing
instructs rsync
to ignore files that exist on both sides. --out-format
is used to make rsync
output just the name of the file. The first grep
is to hide errors to due to “special files” being found (e.g. sockets, etc.), and the second is to hide the listing of directories—we show only files.
By default rsync
will consider two files equal if they have the same size and modification time—and different otherwise. Those files deemed different are the ones that are marked up for transfer from $LOCAL_DIR
to $REMOTE_DIR
. As the above command ignores files that exist on both sides, it will list the files that exist only on $LOCAL_DIR
. The final sed
command prefix an L
to each line of output of the rsync
command.
Files only on remote side. The above explanation should also hints us on how to get the list of files that exists only on $REMOTE_DIR
: simply reverse $LOCAL_DIR
and $REMOTE_DIR
(and modify the sed
line to prefix an R
). Here’s the result:
rsync -rn --ignore-existing --out-format='%n' "$REMOTE_DIR"/ "$LOCAL_DIR"/ \
| grep -v "skipping non-regular file" | grep -E -v '/$' | sed -e 's/^/R /'
Files on both sides, with different sizes. We begin by obtaining the list of files that exist on both sides, but are different. This will include files with different sizes, but also files with the same size, but different content. This is done as follows:
readarray -t common_file_list < <(rsync -rn --existing --out-format='%n' \
"$LOCAL_DIR"/ "$REMOTE_DIR"/ | grep -v "skipping non-regular file")
Before explaining the rsync
command proper, we must explain readarray
: it takes the output of the command inside the parenthesis—in this case, rsync
—and places it into the array common_file_list
. This is done because the output of this command will be needed below. As for the rsync
command, it is similar to the two previous ones, except that the --ignore-existing
option is replaced with --existing
. It does what you would expect: it checks only files that exist on both sides, i.e., it ignores files that exist only on the local, or only on the remote side. Thus we are left with a list of files that exist on both sides, and are considered different. Hence, they may be files with different sizes (and thus, necessarily different content), or they may be files with the same size but different content and/or different last modification timestamps.
We next modify the above rsync
command to skip—i.e., consider equal—files that have the same size. This is done with the --size-only
option:
readarray -t common_files_with_diff_sizes_list < <(rsync -rn --existing --out-format='%n' \
--size-only "$LOCAL_DIR"/ "$REMOTE_DIR"/ | grep -v "skipping non-regular file")
As the name of the array—common_files_with_diff_sizes_list
—might indicate, this yields a list of files that exist on both sides, but with different sizes. The only thing left is to find out for each of those files, which version is newer: the one on the local side, or the one on the remote side. This requires two things. First, instructing rsync
to use only the files listed in the common_files_with_diff_sizes_list
array—which is accomplished with the --files-from
option. And second, the --update
option, which makes rsync
only transfer a file from the source location to the destination if, besides being different, the file in the source is newer (i.e. has a later modification time) than the file in the destination.
Here is the code to discover the files with different sizes on both sides, having the file on the local side newer than that on the remote side:
readarray -t common_files_with_diff_sizes_list_newest_on_local \
< <(rsync -rn --update --out-format='%n' \
--files-from=<( printf "%s\n" "${common_files_with_diff_sizes_list[@]}" ) \
"$LOCAL_DIR"/ "$REMOTE_DIR"/ | grep -v "skipping non-regular file")
Reversing $LOCAL_DIR
and $REMOTE_DIR
gets us the opposite, viz., the files for which the version on the remote side is newer:
readarray -t common_files_with_diff_sizes_list_newest_on_remote \
< <(rsync -rn --update --out-format='%n' \
--files-from=<( printf "%s\n" "${common_files_with_diff_sizes_list[@]}" ) \
"$REMOTE_DIR"/ "$LOCAL_DIR"/ | grep -v "skipping non-regular file")
To print this information, alongside the correct prefix, is simple:3
if [[ -n ${common_files_with_diff_sizes_list_newest_on_local[@]} ]]; then
printf "Sl %s\n" "${common_files_with_diff_sizes_list_newest_on_local[@]}"
fi
if [[ -n ${common_files_with_diff_sizes_list_newest_on_remote[@]} ]]; then
printf "Sr %s\n" "${common_files_with_diff_sizes_list_newest_on_remote[@]}"
fi
Files on both sides, with equal size but different content. To discover if the files that have the same size on both sides actually have the same contents or not, we have to compute the list of files that have the same size on both sides. We already have a list of files that exist on both sides (common_file_list
), along with the (sub)list of files that have different sizes on each side—common_files_with_diff_sizes_list
. So we rely on a for
loop to remove from the former, the elements of the latter:
common_files_with_equal_size_list=()
for f in "${common_file_list[@]}"
do
if printf '%s\0' "${common_files_with_diff_sizes_list[@]}" \
| grep -Fxqz -- "$f"; then
# If the current file has different sizes on both sides, skip it.
continue;
fi
common_files_with_equal_size_list+=( "$f" )
done
The common_files_with_equal_size_list
array contains the list of files for which rsync
must actually check file contents—which requires the option --checksum
:
readarray -t common_files_with_same_size_and_diff_content \
< <(rsync -n --checksum --out-format='%n' \
--files-from=<( printf "%s\n" "${common_files_with_equal_size_list[@]}" ) \
"$LOCAL_DIR"/ "$REMOTE_DIR"/)
And to discover which version, local or remote, is newer, we again use the --update
option. So to get the list of files that are newer on the local side, we do:
rsync -rn --update --out-format='%n' \
--files-from=<( printf "%s\n" "${common_files_with_same_size_and_diff_content[@]}" ) \
"$LOCAL_DIR"/ "$REMOTE_DIR"/ | sed -e 's/^/Cl /'
And finally—as the reader is probably guessing—to get the files newer on the remote side, we reverse $LOCAL_DIR
and $REMOTE_DIR
:
rsync -rn --update --out-format='%n' \
--files-from=<( printf "%s\n" "${common_files_with_same_size_and_diff_content[@]}" ) \
"$REMOTE_DIR"/ "$LOCAL_DIR"/ | sed -e 's/^/Cr /'
And we are done! The RsyncSteroids.tar.gz file provides a script with all the code, and two directories, A/
and B/
, with a simple file with different contents. You can use it to experiment with the code, in particular by adding more files to the directories, with different sizes, modification times, etc. I hope it will be useful!
June 15, 2023. Got feedback? See the contact page.