Creating differential backups with hard links and rsync

You can use a hard link in Linux to create two file names that both point to the same physical location on a hard disk. For instance, if I type:

> echo xxxx > a
> cp -l a b
> cat a
xxxx
> cat b
xxxx

I create a file named “a” that contains the string “xxxx”. Then I create a hard link “b” that also points to the same spot on the disk. Now if I write to the file “a” whatever I write also appears in file “b” and vice versa:

> echo yyyy > b
> cat b
yyyy
> cat a
yyyy
> echo zzzz > a
> cat a
zzzz
> cat b
zzzz

Copying to a hard link updates the data on the disk that each hard link points to:

> rm a b c
> echo xxxx > a
> echo yyyy > c
> cp -l a b
> cat a b c
xxxx
xxxx
yyyy

“a” and “b” point to the same file on disk, “c” is a separate file. If I copy a file “c” to “b” that also updates “a”:

> cp c b 
> cat a b c
yyyy
yyyy
yyyy
> echo zzzz > c
> cat a b c
yyyy
yyyy
zzzz 

What most people don’t know is that rsync is an exception to this rule. If you use rsync to sync two files, and it sees that the target file is a hard link, it will create a new target file but only if the contents of the two files are not the same:

> rm a
> rm b
> echo xxxx > a
> cp -l a b
> cat a
xxxx
> cat b
xxxx
> echo yyyy > c
> cat c
yyyy
> rsync -av c b
sending incremental file list
c
sent 87 bytes  received 31 bytes  236.00 bytes/sec
total size is 5  speedup is 0.04
> cat b
yyyy
> cat c
yyyy
> cat a
xxxx

File “b” is no longer a hard link of “a”, it’s a new file. If I update “a” it no longer updates “b”:

> echo zzzz > a
> cat a b c
zzzz
yyyy
yyyy

However, if the file that I’m rsync-ing is the same as “b”, then rsync does NOT break the hard link, it leaves the file alone:

> rm a
> rm b
> rm c
> echo xxxx > a
> cp -al a b
> cp -p a c
> cat a b c
xxxx
xxxx
xxxx

At this point “a” and “b” both point to the same file on the disk, which contains the string “xxxx”. “c” is a separate file that also contains the string “xxxx” and has the same permissions and timestamp as “a”.

> rsync -av c b
sending incremental file list
sent 39 bytes  received 12 bytes  102.00 bytes/sec
total size is 5  speedup is 0.10
> cat a b c
xxxx
xxxx
xxxx

At this point I’ve rsynced file “c” to “b”, but since c has the same contents and timestamp as “a” and “b” rsync does nothing at all. It doesn’t break the hard link. If I change “b” it still updates “a”:

> echo yyyy > b
> cat a b c
yyyy
yyyy
xxxx

This is how many modern file system backup programs work. On day 1 you make an rsync copy of your entire file system:

backup@backup_server> DAY1=`date +%Y%m%d%H%M%S`
backup@backup_server> rsync -av -e ssh earl@192.168.1.20:/home/earl/ /var/backups/$DAY1/

On day 2 you make a hard link copy of the backup, then a fresh rsync:

backup@backup_server> DAY2=`date +%Y%m%d%H%M%S`
backup@backup_server> cp -al /var/backups/$DAY1 /var/backups/$DAY2
backup@backup_server> rsync -av -e ssh --delete earl@192.168.1.20:/home/earl/ /var/backups/$DAY2/

“cp -al” makes a hard link copy of the entire /home/earl/ directory structure from the previous day, then rsync runs against the copy of the tree. If a file remains unchanged then rsync does nothing — the file remains a hard link. However, if the file’s contents changed, then rsync will create a new copy of the file in the target directory. If a file was deleted from /home/earl then rsync deletes the hard link from that day’s copy.

In this way, the $DAY1 directory has a snapshot of the /home/earl tree as it existed on day 1, and the $DAY2 directory has a snapshot of the /home/earl tree as it existed on day 2, but only the files that changed take up additional disk space. If you need to find a file as it existed at some point in time you can look at that day’s tree. If you need to restore yesterday’s backup you can rsync the tree from yesterday, but you don’t have to store a copy of all of the data from each day, you only use additional disk space for files that changed or were added.

I use this technique to keep 90 daily backups of a 500GB file system on a 1TB drive.

One caveat: The hard links do use up inodes. If you’re using a file system such as ext3, which has a set number of inodes, you should allocate extra inodes on the backup volume when you create it. If you’re using a file system that can dynamically add inodes, such as ext4, zfs or btrfs, then you don’t need to worry about this.