Creating differential backups with hard links and rsync

You can use a hard link in Linux to create two file names that both point to the same physical location on a hard disk. For instance, if I type:

> echo xxxx > a
> cp -l a b
> cat a
xxxx
> cat b
xxxx

I create a file named “a” that contains the string “xxxx”. Then I create a hard link “b” that also points to the same spot on the disk. Now if I write to the file “a” whatever I write also appears in file “b” and vice versa:

> echo yyyy > b
> cat b
yyyy
> cat a
yyyy
> echo zzzz > a
> cat a
zzzz
> cat b
zzzz

Copying to a hard link updates the data on the disk that each hard link points to:

> rm a b c
> echo xxxx > a
> echo yyyy > c
> cp -l a b
> cat a b c
xxxx
xxxx
yyyy

“a” and “b” point to the same file on disk, “c” is a separate file. If I copy a file “c” to “b” that also updates “a”:

> cp c b 
> cat a b c
yyyy
yyyy
yyyy
> echo zzzz > c
> cat a b c
yyyy
yyyy
zzzz 

What most people don’t know is that rsync is an exception to this rule. If you use rsync to sync two files, and it sees that the target file is a hard link, it will create a new target file but only if the contents of the two files are not the same:

> rm a
> rm b
> echo xxxx > a
> cp -l a b
> cat a
xxxx
> cat b
xxxx
> echo yyyy > c
> cat c
yyyy
> rsync -av c b
sending incremental file list
c
sent 87 bytes  received 31 bytes  236.00 bytes/sec
total size is 5  speedup is 0.04
> cat b
yyyy
> cat c
yyyy
> cat a
xxxx

File “b” is no longer a hard link of “a”, it’s a new file. If I update “a” it no longer updates “b”:

> echo zzzz > a
> cat a b c
zzzz
yyyy
yyyy

However, if the file that I’m rsync-ing is the same as “b”, then rsync does NOT break the hard link, it leaves the file alone:

> rm a
> rm b
> rm c
> echo xxxx > a
> cp -al a b
> cp -p a c
> cat a b c
xxxx
xxxx
xxxx

At this point “a” and “b” both point to the same file on the disk, which contains the string “xxxx”. “c” is a separate file that also contains the string “xxxx” and has the same permissions and timestamp as “a”.

> rsync -av c b
sending incremental file list
sent 39 bytes  received 12 bytes  102.00 bytes/sec
total size is 5  speedup is 0.10
> cat a b c
xxxx
xxxx
xxxx

At this point I’ve rsynced file “c” to “b”, but since c has the same contents and timestamp as “a” and “b” rsync does nothing at all. It doesn’t break the hard link. If I change “b” it still updates “a”:

> echo yyyy > b
> cat a b c
yyyy
yyyy
xxxx

This is how many modern file system backup programs work. On day 1 you make an rsync copy of your entire file system:

backup@backup_server> DAY1=`date +%Y%m%d%H%M%S`
backup@backup_server> rsync -av -e ssh earl@192.168.1.20:/home/earl/ /var/backups/$DAY1/

On day 2 you make a hard link copy of the backup, then a fresh rsync:

backup@backup_server> DAY2=`date +%Y%m%d%H%M%S`
backup@backup_server> cp -al /var/backups/$DAY1 /var/backups/$DAY2
backup@backup_server> rsync -av -e ssh --delete earl@192.168.1.20:/home/earl/ /var/backups/$DAY2/

“cp -al” makes a hard link copy of the entire /home/earl/ directory structure from the previous day, then rsync runs against the copy of the tree. If a file remains unchanged then rsync does nothing — the file remains a hard link. However, if the file’s contents changed, then rsync will create a new copy of the file in the target directory. If a file was deleted from /home/earl then rsync deletes the hard link from that day’s copy.

In this way, the $DAY1 directory has a snapshot of the /home/earl tree as it existed on day 1, and the $DAY2 directory has a snapshot of the /home/earl tree as it existed on day 2, but only the files that changed take up additional disk space. If you need to find a file as it existed at some point in time you can look at that day’s tree. If you need to restore yesterday’s backup you can rsync the tree from yesterday, but you don’t have to store a copy of all of the data from each day, you only use additional disk space for files that changed or were added.

I use this technique to keep 90 daily backups of a 500GB file system on a 1TB drive.

One caveat: The hard links do use up inodes. If you’re using a file system such as ext3, which has a set number of inodes, you should allocate extra inodes on the backup volume when you create it. If you’re using a file system that can dynamically add inodes, such as ext4, zfs or btrfs, then you don’t need to worry about this.

19 thoughts on “Creating differential backups with hard links and rsync

  1. I am so glad that I use ZFS (on FreeBSD) and can use zfs send/receive along with snapshots and have it all done pretty simply. Daily snaps on the primary FS, send/receive to the second FS which is on an iSCSI array.

    I used to use the hard link workflow to back up a large web site (hundreds of thousands of files? Over a million? I cannot recall exactly) and I recall that a LOT of space was used just to keep the links/inodes.

  2. btrfs will do much the same thing as zfs, as for the rsync, OS X’s Time Machine uses basically the same thing as rsync, but OS X allows hard links for directories, so it saves a bunch of space over rsync.

  3. Rsync will handle the hardlinks for you so you no longer need to do the “cp -al”.

    Just create a new destination folder (example: the “today’s date” directory). And use the rsync –link-dest=”yesterday’s date” (previous backup) directory. Rsync will compare against the “link-dest” folder and if the file has not changed it will place a hard-link to the link-dest file in your destination folder. If the file is changed or new it will do the normal copy to your destination folder.

    The result is same, but you do not need the –delete since you just created the destination folder and it starts off blank.

    Rsync is an excellent tool. And since it works at the file level it is file-system independent.

    If you have choice in the file system you use, like Bill said ZFS is great at backup and replication and may remove your need to use a tool like rsync.

    Cheers,

    Marty

  4. There’s also tools like ‘rbackup’ and ‘rsnapshot’ that can totally automate this for you. rsnapshot gives you consistent 4-hourly/daily/weekly/monthly/yearly snapshots (configurable timeframes by editing the cron jobs and config) while using rsync as the backend. Will even fetch stuff for you over SSH tunnels if you configure both sides appropriately, plus can call external scripts if you’re up to writing a few scripts for specific things (eg: back up raw DB’s by locking all tables, back up the raw DB files, then release locks, or you can just do dumps per DB or even table, however you like).

    PS: There are also some quite useful de-dupe tools like ‘hardlink’ which will scan a bunch of dirs and de-dupe them by hardlinking the files for you. Great if you’ve already got a bunch of straight backups you want to build some hardlink rsync history off.

  5. Pingback: Backup de un Servidor Linux | LuisPa

  6. Earl C. Ruby III,
    I don’t get it. You just say :
    ” … , but you don’t have to store a copy of all of the data from each day, you only use additional disk space for files that changed or were added.

    I use this technique to keep 90 daily backups of a 500GB file system on a 1TB drive.”

    I don’t know how you do that, I know [rsync] can mirror src and dest side with only copy few modified files, but It’s still full copy not incremental copy. Isn’t it?

    If I have a incremental then I had to have 3 argument but rsync only have 2??

    Can you explain more detail?

    • The key paragraph in the article is:

      “cp -al” makes a hard link copy of the entire /home/earl/ directory structure from the previous day, then rsync runs against the copy of the tree. If a file remains unchanged then rsync does nothing — the file remains a hard link. However, if the file’s contents changed, then rsync will create a new copy of the file in the target directory. If a file was deleted from /home/earl then rsync deletes the hard link from that day’s copy.

      If you want to test this, manually do all of the steps up to the DAY2 rsync, check your available disk space, then do the DAY2 rsync. When rsync is done you should still have two copies of the backup but no additional space will be used. Change one file and do the DAY2 rsync again. The changed file will appear in the DAY2 backup, and your disk space will decrease by an amount equal to the size of the changed file minus the amount of space used by the deleted hard link.

      • I’m think I have more undertand than before, the key is “hard link”, and copying a “hard link file” doesn’t increase the allocate size of disk.

        But if I want to move those daily backups to another filesystem will be another issue, I don’t think hard link structure could be clone between filesystem.

        Sorry for my poor English, and thank you for solved my question.

        • Correct, copying a hard link just uses up enough disk space for the link, and they’re tiny.

          Hard links can only point to a file on the same file system that the link is on — you can’t have a hard link on one file system pointing to a file on another file system.

          Your backup file system has one copy of the original file and multiple hard links that point to the file. If you want to copy the entire backup directory to a new disk use rsync with the “-H” (preserve hard links) parameter, e.g.:

          rsync -avH /old/backup/file/system/ /new/backup/file/system/
          

          The first copy of a file will be a full copy, but the hard links will be copied as hard links, so the files on the new file system will take the same amount of space as they did on the old file system.

  7. Thanks for a great article.

    I just have a quick question. So I have replaced the paths with mine and it seems to work. But I just don’t know what commands should be scheduled to run? Should I run all of them every day? Obviously not the ones that set the day, but the copy of Day 1 and Day 2 Rsync commands?

    These are my paths:

    DAY1=`date +%Y%m%d%H%M%S`
    rsync -av /home/cle/rsync-source/ /home/cle/rsync-backups/$DAY1/

    DAY2=`date +%Y%m%d%H%M%S`
    cp -al /home/cle/rsync-backups/$DAY1 /home/cle/rsync-backups/$DAY2
    rsync -av –delete /home/cle/rsync-source/ /home/cle/rsync-backups/$DAY2/

    So should I run these daily?

    rsync -av /home/cle/rsync-source/ /home/cle/rsync-backups/$DAY1/
    cp -al /home/cle/rsync-backups/$DAY1 /home/cle/rsync-backups/$DAY2
    rsync -av –delete /home/cle/rsync-source/ /home/cle/rsync-backups/$DAY2/

    Thanks in advance.

    Regards
    Chris

    • If you’re looking for a ready-made script that does all this it’s a bit more complicated than that, because you don’t know when the last backup was or if there was a previous backup. You really want to do something like this every day:

      TODAY=`date +%Y%m%d%H%M%S
      BACKUP_COUNT=`ls -d1 /home/cle/rsync-backups/backup-* 2> /dev/null | wc -l`
      if [ $BACKUP_COUNT == 0 ]; then
          rsync -av –delete /home/cle/rsync-source/ /home/cle/rsync-backups/backup-$TODAY/
      else
          latest_directory_regardless_of_when_it_was_backed_up=`ls -d1 /home/cle/rsync-backups/backup-* | tail -1`
          cp -al $latest_directory_regardless_of_when_it_was_backed_up /home/cle/rsync-backups/backup-$TODAY
          rsync -av –delete /home/cle/rsync-source/ /home/cle/rsync-backups/backup-$TODAY/
      fi
      

      Or just use install rsnapshot and use that.

  8. Any idea how to “shrink” days 1-N into folder N optimally (i.e. to parse file by file and once it is in folder N to remove its traces in 1..N-1 folders. The goal is to create a full backup in day N using all he previous days by not using additional disk space.

    • Each file is only stored once on disk. If a file appears in every folder only the first copy takes space, the other copies are just hard links.

      Each directory contains a full backup on the date it was made.

  9. Hi Earl – what to do when the backup drive is full?
    I don’t think I can simply delete backup files older that say 6 months …
    That might create a situation where I have only hard links to a file, but the file itself is gone?
    Is that true?

    • The only way to recover space on a full hard drive is to delete data.

      You can think of hard links as pointers to the same chunk of data on your hard drive. If you have 3 links pointing to the data, and you delete 1, you still have two pointers and the data. Delete another link and you have 1 pointer and the data. Delete the last link and you effectively delete the data.

      If you delete your oldest set of hard links, the only disk space you will recover is for chunks of data where those links were the last links pointing to the data. In your theoretical case, if you made a backup, then accidentally deleted a file from your main hard drive, then continued to make backups for 6 months, you’d have 6 months to find and restore the file. After six months the file would be gone when you deleted the last pointer to the file.

      It really depends on how far back you want to keep backups. I wrote a script that I currently use which backs up about ~1.2TB of critical files to an encrypted external 2TB hard drive once a day. Once a month it takes yesterday’s backup and renames it to a “monthly archival” backup. Daily backups are deleted after I have 20 days of backups. When the disk hits 95% full the script deletes the oldest monthly archival backups until the disk is only 90% full. Every now and then I swap in a second 2TB drive, so I have 2 physical media with backups. Using this method I have daily backups going back 20 days and monthly backups going back a year.

      For me this works. If I accidentally delete a file I can recover it from yesterday’s backup. If I lose a whole drive I can restore all critical files from the previous day. If I delete a file and don’t notice for 11 months I can find the file and recover it. If I delete a file and don’t notice for a year it’s probably gone for good, but if I haven’t noticed for a year that it’s gone I probably won’t miss it.

      If I wanted to keep every copy of every file around forever, I could just make daily backups until the drive got full, then swap in a new drive, repeat. That’s not my use case, but it might be yours.

      Hope that helps.

  10. Hi Earl,
    Is there a way to pre check that backup disk (mounted nfs ) has sufficient disk space to avoid “No disk space issue”. I don’t want to use rsync –dry-run as it takes lots of time.

    • The script I wrote and use checks to see if the volume is 95% full or more. If it is, it starts deleting the oldest backups until free space is > 5%. I’m using external 2TB drive and I’ve got backups of the last 20 days and monthly backups going back almost a year.

      If you want an exact check then no, I’m not aware of any way other than –dry-run.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.