Creating differential backups with hard links and rsync

You can use a hard link in Linux to create two file names that both point to the same physical location on a hard disk. For instance, if I type:

> echo xxxx > a
> cp -l a b
> cat a
xxxx
> cat b
xxxx

I create a file named “a” that contains the string “xxxx”. Then I create a hard link “b” that also points to the same spot on the disk. Now if I write to the file “a” whatever I write also appears in file “b” and vice versa:

> echo yyyy > b
> cat b
yyyy
> cat a
yyyy
> echo zzzz > a
> cat a
zzzz
> cat b
zzzz

Copying to a hard link updates the data on the disk that each hard link points to:

> rm a b c
> echo xxxx > a
> echo yyyy > c
> cp -l a b
> cat a b c
xxxx
xxxx
yyyy

“a” and “b” point to the same file on disk, “c” is a separate file. If I copy a file “c” to “b” that also updates “a”:

> cp c b 
> cat a b c
yyyy
yyyy
yyyy
> echo zzzz > c
> cat a b c
yyyy
yyyy
zzzz 

What most people don’t know is that rsync is an exception to this rule. If you use rsync to sync two files, and it sees that the target file is a hard link, it will create a new target file but only if the contents of the two files are not the same:

> rm a
> rm b
> echo xxxx > a
> cp -l a b
> cat a
xxxx
> cat b
xxxx
> echo yyyy > c
> cat c
yyyy
> rsync -av c b
sending incremental file list
c
sent 87 bytes  received 31 bytes  236.00 bytes/sec
total size is 5  speedup is 0.04
> cat b
yyyy
> cat c
yyyy
> cat a
xxxx

File “b” is no longer a hard link of “a”, it’s a new file. If I update “a” it no longer updates “b”:

> echo zzzz > a
> cat a b c
zzzz
yyyy
yyyy

However, if the file that I’m rsync-ing is the same as “b”, then rsync does NOT break the hard link, it leaves the file alone:

> rm a
> rm b
> rm c
> echo xxxx > a
> cp -al a b
> cp -p a c
> cat a b c
xxxx
xxxx
xxxx

At this point “a” and “b” both point to the same file on the disk, which contains the string “xxxx”. “c” is a separate file that also contains the string “xxxx” and has the same permissions and timestamp as “a”.

> rsync -av c b
sending incremental file list
sent 39 bytes  received 12 bytes  102.00 bytes/sec
total size is 5  speedup is 0.10
> cat a b c
xxxx
xxxx
xxxx

At this point I’ve rsynced file “c” to “b”, but since c has the same contents and timestamp as “a” and “b” rsync does nothing at all. It doesn’t break the hard link. If I change “b” it still updates “a”:

> echo yyyy > b
> cat a b c
yyyy
yyyy
xxxx

This is how many modern file system backup programs work. On day 1 you make an rsync copy of your entire file system:

backup@backup_server> DAY1=`date +%Y%m%d%H%M%S`
backup@backup_server> rsync -av -e ssh earl@192.168.1.20:/home/earl/ /var/backups/$DAY1/

On day 2 you make a hard link copy of the backup, then a fresh rsync:

backup@backup_server> DAY2=`date +%Y%m%d%H%M%S`
backup@backup_server> cp -al /var/backups/$DAY1 /var/backups/$DAY2
backup@backup_server> rsync -av -e ssh --delete earl@192.168.1.20:/home/earl/ /var/backups/$DAY2/

“cp -al” makes a hard link copy of the entire /home/earl/ directory structure from the previous day, then rsync runs against the copy of the tree. If a file remains unchanged then rsync does nothing — the file remains a hard link. However, if the file’s contents changed, then rsync will create a new copy of the file in the target directory. If a file was deleted from /home/earl then rsync deletes the hard link from that day’s copy.

In this way, the $DAY1 directory has a snapshot of the /home/earl tree as it existed on day 1, and the $DAY2 directory has a snapshot of the /home/earl tree as it existed on day 2, but only the files that changed take up additional disk space. If you need to find a file as it existed at some point in time you can look at that day’s tree. If you need to restore yesterday’s backup you can rsync the tree from yesterday, but you don’t have to store a copy of all of the data from each day, you only use additional disk space for files that changed or were added.

I use this technique to keep 90 daily backups of a 500GB file system on a 1TB drive.

One caveat: The hard links do use up inodes. If you’re using a file system such as ext3, which has a set number of inodes, you should allocate extra inodes on the backup volume when you create it. If you’re using a file system that can dynamically add inodes, such as ext4, zfs or btrfs, then you don’t need to worry about this.

Share Button

4 thoughts on “Creating differential backups with hard links and rsync

  1. I am so glad that I use ZFS (on FreeBSD) and can use zfs send/receive along with snapshots and have it all done pretty simply. Daily snaps on the primary FS, send/receive to the second FS which is on an iSCSI array.

    I used to use the hard link workflow to back up a large web site (hundreds of thousands of files? Over a million? I cannot recall exactly) and I recall that a LOT of space was used just to keep the links/inodes.

  2. btrfs will do much the same thing as zfs, as for the rsync, OS X’s Time Machine uses basically the same thing as rsync, but OS X allows hard links for directories, so it saves a bunch of space over rsync.

  3. Rsync will handle the hardlinks for you so you no longer need to do the “cp -al”.

    Just create a new destination folder (example: the “today’s date” directory). And use the rsync –link-dest=”yesterday’s date” (previous backup) directory. Rsync will compare against the “link-dest” folder and if the file has not changed it will place a hard-link to the link-dest file in your destination folder. If the file is changed or new it will do the normal copy to your destination folder.

    The result is same, but you do not need the –delete since you just created the destination folder and it starts off blank.

    Rsync is an excellent tool. And since it works at the file level it is file-system independent.

    If you have choice in the file system you use, like Bill said ZFS is great at backup and replication and may remove your need to use a tool like rsync.

    Cheers,

    Marty

  4. There’s also tools like ‘rbackup’ and ‘rsnapshot’ that can totally automate this for you. rsnapshot gives you consistent 4-hourly/daily/weekly/monthly/yearly snapshots (configurable timeframes by editing the cron jobs and config) while using rsync as the backend. Will even fetch stuff for you over SSH tunnels if you configure both sides appropriately, plus can call external scripts if you’re up to writing a few scripts for specific things (eg: back up raw DB’s by locking all tables, back up the raw DB files, then release locks, or you can just do dumps per DB or even table, however you like).

    PS: There are also some quite useful de-dupe tools like ‘hardlink’ which will scan a bunch of dirs and de-dupe them by hardlinking the files for you. Great if you’ve already got a bunch of straight backups you want to build some hardlink rsync history off.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>