rsync with (some) steroids

I came up with a way of using rsync to be able to compare folders and list the files which actually have different contents, and in that case, which folder has the newest version. This means that, given two folders A/ and B/, all the files with differences are newer in (say) folder B/, I can just copy them over to folder A/, instead of doing individual merges. This is super useful to merge files that exist on both desktop and laptop, but that are not under version control.    #rsync #diff #merge #bash #script

Say you have a folder foobar in two locations: local/foobar and remote/foobar. We are going to cook up a bash script to: list the files that exist only on the local side, do the same for the remote side, list all files that exist on both sides, but with different sizes (and hence, different content), and list all files that exist on both sides, with the same size, but differing contents. In the last two scenarios, the script will also indicate which file (the one on the local or the one in the remote side) is the newest, i.e., has the latest modification timestamp. Note that this will be a purely informational script: no actual file transfers will take place.

Notational remark. As I mention in the tagline, I initially developed this script to aid me in synchronising unversioned files between my desktop and laptop computers—hence, the designations above of local and remote versions of a given folder. However, as I also mention in the tagline, this script can also be applied to two different folders in the same machine. For this reason, although below I speak only of the “local side” and the “remote side,” these can also be taken to mean the “left folder” (i.e., the source folder), and the “right folder” (i.e., the destination folder).1

More concretely, the list of files will contain at the beginning of each line a “status indicator,” as follows:

In what follows, we will designate the local side by $LOCAL_DIR and the remote side by $REMOTE_DIR. While most of the shell commands to be listed hereinafter could be used at the command line, they will likely be more useful when used in a script.2

Files only on local side. The list of files that exist only on the local side, can be obtained as follows:

rsync -rn --ignore-existing --out-format='%n' "$LOCAL_DIR"/ "$REMOTE_DIR"/ \
| grep -v "skipping non-regular file" | grep -E -v '/$' | sed -e 's/^/L   /'

The -r option is a shortcut for --recursive, and instructs rsync to go into directories’ content. -n is a shortcut for --dry-run, which means just simulate, don’t actually transfer any files. --ignore-existing instructs rsync to ignore files that exist on both sides. --out-format is used to make rsync output just the name of the file. The first grep is to hide errors to due to “special files” being found (e.g. sockets, etc.), and the second is to hide the listing of directories—we show only files.

By default rsync will consider two files equal if they have the same size and modification time—and different otherwise. Those files deemed different are the ones that are marked up for transfer from $LOCAL_DIR to $REMOTE_DIR. As the above command ignores files that exist on both sides, it will list the files that exist only on $LOCAL_DIR. The final sed command prefix an L to each line of output of the rsync command.

Files only on remote side. The above explanation should also hints us on how to get the list of files that exists only on $REMOTE_DIR: simply reverse $LOCAL_DIR and $REMOTE_DIR (and modify the sed line to prefix an R). Here’s the result:

rsync -rn --ignore-existing --out-format='%n' "$REMOTE_DIR"/ "$LOCAL_DIR"/ \
| grep -v "skipping non-regular file" | grep -E -v '/$' | sed -e 's/^/R   /'

Files on both sides, with different sizes. We begin by obtaining the list of files that exist on both sides, but are different. This will include files with different sizes, but also files with the same size, but different content. This is done as follows:

readarray -t common_file_list < <(rsync -rn --existing --out-format='%n' \
  "$LOCAL_DIR"/ "$REMOTE_DIR"/ | grep -v "skipping non-regular file")

Before explaining the rsync command proper, we must explain readarray: it takes the output of the command inside the parenthesis—in this case, rsync—and places it into the array common_file_list. This is done because the output of this command will be needed below. As for the rsync command, it is similar to the two previous ones, except that the --ignore-existing option is replaced with --existing. It does what you would expect: it checks only files that exist on both sides, i.e., it ignores files that exist only on the local, or only on the remote side. Thus we are left with a list of files that exist on both sides, and are considered different. Hence, they may be files with different sizes (and thus, necessarily different content), or they may be files with the same size but different content and/or different last modification timestamps.

We next modify the above rsync command to skip—i.e., consider equal—files that have the same size. This is done with the --size-only option:

readarray -t common_files_with_diff_sizes_list < <(rsync -rn --existing --out-format='%n' \
  --size-only "$LOCAL_DIR"/ "$REMOTE_DIR"/ | grep -v "skipping non-regular file")

As the name of the array—common_files_with_diff_sizes_list—might indicate, this yields a list of files that exist on both sides, but with different sizes. The only thing left is to find out for each of those files, which version is newer: the one on the local side, or the one on the remote side. This requires two things. First, instructing rsync to use only the files listed in the common_files_with_diff_sizes_list array—which is accomplished with the --files-from option. And second, the --update option, which makes rsync only transfer a file from the source location to the destination if, besides being different, the file in the source is newer (i.e. has a later modification time) than the file in the destination.

Here is the code to discover the files with different sizes on both sides, having the file on the local side newer than that on the remote side:

readarray -t common_files_with_diff_sizes_list_newest_on_local \
  < <(rsync -rn --update --out-format='%n' \
  --files-from=<( printf "%s\n" "${common_files_with_diff_sizes_list[@]}" ) \
  "$LOCAL_DIR"/ "$REMOTE_DIR"/ | grep -v "skipping non-regular file")

Reversing $LOCAL_DIR and $REMOTE_DIR gets us the opposite, viz., the files for which the version on the remote side is newer:

readarray -t common_files_with_diff_sizes_list_newest_on_remote \
  < <(rsync -rn --update --out-format='%n' \
  --files-from=<( printf "%s\n" "${common_files_with_diff_sizes_list[@]}" ) \
  "$REMOTE_DIR"/ "$LOCAL_DIR"/ | grep -v "skipping non-regular file")

To print this information, alongside the correct prefix, is simple:3

if [[ -n ${common_files_with_diff_sizes_list_newest_on_local[@]} ]]; then
  printf "Sl  %s\n" "${common_files_with_diff_sizes_list_newest_on_local[@]}"
fi
if [[ -n ${common_files_with_diff_sizes_list_newest_on_remote[@]} ]]; then
  printf "Sr  %s\n" "${common_files_with_diff_sizes_list_newest_on_remote[@]}"
fi

Files on both sides, with equal size but different content. To discover if the files that have the same size on both sides actually have the same contents or not, we have to compute the list of files that have the same size on both sides. We already have a list of files that exist on both sides (common_file_list), along with the (sub)list of files that have different sizes on each side—common_files_with_diff_sizes_list. So we rely on a for loop to remove from the former, the elements of the latter:

common_files_with_equal_size_list=()
for f in "${common_file_list[@]}"
do
  if printf '%s\0' "${common_files_with_diff_sizes_list[@]}" \
    | grep -Fxqz -- "$f"; then
    # If the current file has different sizes on both sides, skip it.
    continue;
  fi

  common_files_with_equal_size_list+=( "$f" )
done

The common_files_with_equal_size_list array contains the list of files for which rsync must actually check file contents—which requires the option --checksum:

readarray -t common_files_with_same_size_and_diff_content \
  < <(rsync -n --checksum --out-format='%n' \
  --files-from=<( printf "%s\n" "${common_files_with_equal_size_list[@]}" ) \
  "$LOCAL_DIR"/ "$REMOTE_DIR"/)

And to discover which version, local or remote, is newer, we again use the --update option. So to get the list of files that are newer on the local side, we do:

rsync -rn --update --out-format='%n' \
  --files-from=<( printf "%s\n" "${common_files_with_same_size_and_diff_content[@]}" ) \
  "$LOCAL_DIR"/ "$REMOTE_DIR"/ | sed -e 's/^/Cl  /'

And finally—as the reader is probably guessing—to get the files newer on the remote side, we reverse $LOCAL_DIR and $REMOTE_DIR:

rsync -rn --update --out-format='%n' \
  --files-from=<( printf "%s\n" "${common_files_with_same_size_and_diff_content[@]}" ) \
  "$REMOTE_DIR"/ "$LOCAL_DIR"/ | sed -e 's/^/Cr  /'

And we are done! The RsyncSteroids.tar.gz file provides a script with all the code, and two directories, A/ and B/, with a simple file with different contents. You can use it to experiment with the code, in particular by adding more files to the directories, with different sizes, modification times, etc. I hope it will be useful!

June 15, 2023. Got feedback? Great, email is your friend!