The world needs optimists and pessimists. The optimists create things like operating systems, and the pessimists create the backup software. - T.E. Ronneberg

the brown-dragon blog

Taking Backups from the Command Line

2010-10-11

Some people aren't built for GUI's

The Story

In 1999, I bought a Nero CD Writer. At first I used it occasionally, only cutting CD's when really needed to transfer my work and occasionally creating a music mix.

But the habit grew on me. I began to copy more and more. A couple of years later found me backing up my entire PC pretty regularly - and it ran into a dozen or so CD's every time. If you've ever cut CD's you will know it's a tiresome and boring process and so, after another few years, I grew tired of it and bought an external hard disk.

The hard disk spoilt me. Backups were too easy. The habit grew and grew until now it is so bad that I run a daily backup onto a 2TB external HDD and a weekly backup onto an 'offsite disaster HDD'.

So you can see, I am pretty committed to my backups. They give me a warm, fuzzy, feeling of comfort and I love 'em. Ahhhh...

The Worm in The Apple

Things were going swimmingly along but they say good times never last. In my case, the good times ended when I moved to Windows 7.

To explain why, you need to understand that I love The Command Line. I do most of my work without opening any fancy "Windows" or clicking any of these new-fangled "Buttons" and Keep Off My Lawn You Pesky Kids. So naturally, my backup was a rsync script running under cygwin. It was simple, fast, and I loved it!

Windows 7 destroyed all that.

It turns out that for some reason, stat() is horribly slow on cygwin under Win 7. Since rsync uses stat() (a lot!), my simple, fast script became a horribly slow script. Instead of taking 4-5 minutes, my backups now took over 30 minutes!

SyncBack to the rescue!

A good friend of mine then suggested I try SyncBack. Not being a fan of GUI applications I wasn't very keen but I quickly got over my reservations. SyncBack is great! It was simple to use and fast! So I was back in business!

Until that fateful day...

Some people aren't built for GUI's

One problem I still struggled with SyncBack was going through the list of differences.

SyncBack Differences Screen

The list can get pretty large and it is difficult to do anything useful with. On the command line I could do all sorts of things to manage this list (filter, grep, etc), but here I was forced to try and scroll through a thousand items just to check that I wasn't overwriting anything by mistake.

Of course, you can guess what happened. I got tired of looking through the list and figured it was ok. I mean, what could go wrong? Right?

What went wrong

    Step 1: Delete a file by mistake.
    Step 2: Think: "Yikes!"
    Step 3: Panic subsides. Remember backup. File safe. Relief.
    Step 4: Mental note-to-self : copy file from backup
    Step 5: Work on other things.
    Step 6: Forget mental note-to-self.
    Step 7: Run backup. Ignore impossible to analyse list of changes.
    Step 8: When backup finishes deleting backup copy, remember mental note.
    Step 9: Tear hair out. Weep uncontrollably.

bd-sync

So after I lost a file, I bit the bullet and just wrote my own backup system. It's a lovely bit of C code that works with a couple of sed scripts for filtering and executing. My process now looks like this:

    (generate list of items to be copied/updated/deleted in a nice text file)
        ==>(filter/grep/review. Edit if necessary) <== CRUCIAL STEP
                ==> (if happy, continue. List gets converted to copy and delete instructions)
                        ==> (run copy script)
                        ==> (run delete script) <== SEPARATED OUT FOR FURTHER SAFETY

The process looks a bit long but it's actually very very quick. My code runs in under 3 minutes, and the entire process (with the simple review) generally takes no more than another minute. The whole flow works like this:

    bd-sync <src> <dest>
        ==> sed -f filter-dir.sed
                ==> ./g
                        ==> ./g.sh (generated copy script)
                        ==> ./d.sh (generated delete script)

I still recommend SyncBack for anyone looking for a backup solution who isn't such a command line freak. It's a clean and beautiful program and I have no hesitation in giving them my seal of approval.

But as for me, I'm back to the command line - and loving it!

bd-sync.c

/*
 * bd-sync
 *  Create a simple-to-process list of differences that need to be mirrored
 *  across a source and backup destination.
 *
 *  This program is needed because the stat() call in Windows7/cygwin is
 *  terribly slow. As a comparison, it takes 23 minutes to run rsync on
 *  my backup while this program takes roughly 2 minutes on the same.
 *
 * Notes:
 * 1. Ignore time differences upto 2 seconds (NTFS write error margin).
 *    NB. Does not handle Daylight Savings Time or FAT times
 *        simply because I don't have use for either.
 *        (Should be easy enough to add if needed)
 *
 * http://the-brown-dragon.com/
 */
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <sys/stat.h>
#include <time.h>
#include <unistd.h>
#include <windows.h>

typedef void*          vptr_t;
typedef int8_t         byte_t;
typedef int8_t*        byteptr_t;
typedef char           char_t;
typedef char_t*        strptr_t;
typedef int            bool_t;

/**
 * Ignore time differences upto 2 seconds (NTFS write error margin).
 */
#define SLACKTIME 50000000

/**
 * Simple error handling.
 */
static char_t errmsg [2048];
#define ERRRET(r,...) do {\
    if (errno) {sprintf(errmsg,__VA_ARGS__);perror(errmsg);}\
    else {fprintf (stderr,__VA_ARGS__);fprintf(stderr,"\n");}\
    return (r);\
} while (0);

/**
 * Statistics for improving performance.
 */
#ifdef STATS

    #define NEWCLOCK(t) clock_t t
    #define STARTCLOCK(t) t = clock ()
    #define STOPCLOCK(d,t) d = (clock () - t)
    #define ACCUMCLOCK(a,t) do { a##_a = ((a##_a * a##_c) + (clock () - t))/(++(a##_c)); } while (0)
    #define DEFACCUM(w) static double w##_a; static uint32_t w##_c
    #define GETACCUM(w) w##_a

#else

    #define NEWCLOCK(t)
    #define STARTCLOCK(t)
    #define STOPCLOCK(d,t)
    #define ACCUMCLOCK(a,t)
    #define DEFACCUM(w)
    #define GETACCUM(w)

#endif

DEFACCUM (OPEN);
DEFACCUM (READ);
DEFACCUM (CLOSE);
DEFACCUM (STAT);
DEFACCUM (MAKEPATH);
DEFACCUM (ALLOCFI);
DEFACCUM (PRINT);
DEFACCUM (SORTFNAMES);

#ifdef STATS
static inline void dump_stats (void)
{
    printf ("Open:%f\n", GETACCUM(OPEN));
    printf ("Read:%f\n", GETACCUM(READ));
    printf ("Close:%f\n", GETACCUM(CLOSE));
    printf ("Stat:%f\n", GETACCUM(STAT));
    printf ("MakePath:%f\n", GETACCUM(MAKEPATH));
    printf ("AllocFI:%f\n", GETACCUM(ALLOCFI));
    printf ("Print:%f\n", GETACCUM(PRINT));
    printf ("Sort:%f\n", GETACCUM(SORTFNAMES));
}
#endif

/**
 * Basic allocator.
 */
static inline void* xmalloc (size_t pSize)
{
        vptr_t  v;
    if ((v = malloc (pSize))) return v;
    perror ("Out of Memory!");
    exit (EXIT_FAILURE);
}

/**
 * We need to call windows API to get the performance we need.
 * Hence we need windows path support.
 */
static inline strptr_t cygwin_to_win32_path (strptr_t pCygPath)
{
    if (strstr (pCygPath, "/cygdrive/") != pCygPath) return pCygPath;

    pCygPath [9] = pCygPath [10];
    pCygPath [10] = ':';

    return pCygPath + 9;
}

/**
 * Gets file information by calling the windows API.
 */
static inline int get_file_attr (strptr_t pFName,
                                 WIN32_FILE_ATTRIBUTE_DATA *pFAttrs)
{
    if (GetFileAttributesExA(pFName,GetFileExInfoStandard,pFAttrs)) return 0;
    { /* Maybe it's in unicode */
            wchar_t buf[1024];
        if (MultiByteToWideChar (CP_UTF8, 0, pFName, -1, buf, 1024)) {
            if (GetFileAttributesExW(buf,GetFileExInfoStandard,pFAttrs)) return 0;
        }
    }

    switch (GetLastError()) {
        case ERROR_ACCESS_DENIED:
        case ERROR_SHARING_VIOLATION:
        case ERROR_LOCK_VIOLATION:
        case ERROR_SHARING_BUFFER_EXCEEDED:
            return EACCES;
        case ERROR_BUFFER_OVERFLOW:
            return ENAMETOOLONG;
        case ERROR_NOT_ENOUGH_MEMORY:
            return ENOMEM;
        default:
            return ENOENT;
    }
}

/**
 * Walks a given directory.
 */
int walk_dir (strptr_t pDir,
              int (*walk_func) (strptr_t pDirElem, vptr_t pCtx),
              vptr_t pCtx)
{
        DIR           *d;
        struct dirent *de;
        int            r;
        NEWCLOCK(t);

    STARTCLOCK(t);
    if (!(d = opendir (pDir))) ERRRET (1, "Cannot open %s!", pDir);
    ACCUMCLOCK(OPEN,t);

    STARTCLOCK(t);
    while ((errno = 0, de = readdir (d)) != NULL) {
        ACCUMCLOCK(READ,t);
        if ((r = walk_func (de->d_name, pCtx))) return r;
        STARTCLOCK(t);
    }

    if (errno) ERRRET (2, "Error while walking %s!", pDir);

    STARTCLOCK(t);
    (void)closedir (d);
    ACCUMCLOCK(CLOSE,t);

    return 0;
}

/**
 * Here is where we store the file information for comparison.
 */
struct tFileInfo {
    char_t   uFlag;
    strptr_t uFName;
    uint64_t uTime;
    uint64_t uSize;

    struct tFileInfo *uNext;
};

/**
 * Root of the comparison path;
 */
struct tRoot {
    strptr_t          uRoot;
    struct tFileInfo *uHead;
};

/**
 * Application contains source and destination.
 */
struct tAppInfo {
    struct tRoot uSrc;
    struct tRoot uDst;
};

/**
 * Actual walk to save file information.
 */
int walk_save (strptr_t pDirElem, vptr_t pCtx)
{
        WIN32_FILE_ATTRIBUTE_DATA  a;
        struct tRoot              *r = pCtx;
        struct tFileInfo          *fi, *p;
        strptr_t                   o;
        NEWCLOCK(t);

    if ((!strcmp (pDirElem, "."))
            || (!strcmp (pDirElem, ".."))) return 0;

    STARTCLOCK(t);
    fi = xmalloc (sizeof (struct tFileInfo));
    memset (fi, 0, sizeof (struct tFileInfo));
    ACCUMCLOCK(ALLOCFI,t);
    STARTCLOCK(t);
    fi->uFName = xmalloc ((strlen (r->uRoot) + strlen (pDirElem) + 2) * sizeof (char_t));
    sprintf (fi->uFName, "%s/%s", r->uRoot, pDirElem);
    ACCUMCLOCK(MAKEPATH,t);

    STARTCLOCK(t);
    p = r->uHead;
    if (p && (strcmp (p->uFName, fi->uFName) < 0)) {
        fi->uNext = r->uHead;
        r->uHead  = fi;
    } else {
        while (p->uNext && (strcmp (p->uNext->uFName, fi->uFName) > 0)) p = p->uNext;
        fi->uNext = p->uNext;
        p->uNext  = fi;
    }
    ACCUMCLOCK(SORTFNAMES,t);

    STARTCLOCK(t);
    if ((errno = get_file_attr (fi->uFName, &a))) ERRRET (0, "Cannot read %s!", fi->uFName);
    fi->uTime  = ((uint64_t)a.ftLastWriteTime.dwLowDateTime) | (((uint64_t)a.ftLastWriteTime.dwHighDateTime) << 32);
    fi->uSize  = ((uint64_t)a.nFileSizeLow) | (((uint64_t)a.nFileSizeHigh) << 32);
    ACCUMCLOCK(STAT,t);

    o = r->uRoot; r->uRoot = fi->uFName;
    fi->uFlag = (a.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) ? 'd' : 'f';
    if (fi->uFlag == 'd') (void)walk_dir (fi->uFName, walk_save, pCtx);
    r->uRoot = o;

    return 0;
}

/**
 * Useful dump function.
 */
static inline void dump_fileinfo (struct tFileInfo *pFI)
{
        SYSTEMTIME ft;
        FILETIME   ftt;

    ftt.dwLowDateTime  = (0xFFFFFFFF & pFI->uTime);
    ftt.dwHighDateTime = (pFI->uTime >> 32);

    if (!(FileTimeToSystemTime (&ftt, &ft))) memset (&ft, 0, sizeof (SYSTEMTIME));
    printf ("%c   %d-%02d-%02dT%02d:%02d:%02d.%02d\t%lld\t%s\n",
            pFI->uFlag,
            (int)ft.wYear, (int)ft.wMonth, (int)ft.wDay,
            (int)ft.wHour, (int)ft.wMinute, (int)ft.wSecond, (int)ft.wMilliseconds,
            pFI->uSize, pFI->uFName);
}

/**
 * Dump application information header
 */
static inline void dump_appinfo (struct tAppInfo *pAI)
{
        struct tFileInfo *fi;

    printf ("#!bd-sync\n");

    printf ("#S[ource]     : %s\n", pAI->uSrc.uRoot);
    for (fi = pAI->uSrc.uHead;fi;fi = fi->uNext) dump_fileinfo (fi);

    printf ("#D[estination]: %s\n", pAI->uDst.uRoot);
    for (fi = pAI->uDst.uHead;fi;fi = fi->uNext) dump_fileinfo (fi);
}

/**
 * Dump file difference.
 */
static inline void dump_filediff (struct tFileInfo *pSrc, struct tFileInfo *pDst)
{
    printf ("%c %s*%s\n", pSrc ? pSrc->uFlag : pDst->uFlag,
                          pSrc ? pSrc->uFName : "",
                          pDst ? pDst->uFName : "");
}

/**
 * Comparision results.
 */
enum eCmpInfo {
    INFO_IDENTICAL,
    INFO_NOMATCH,
    INFO_NOSRC,
    INFO_NODST,
};

/**
 * Simple offsets into path for comparison.
 */
static int sSrcOffset, sDstOffset;

/**
 * Comparison function.
 */
static inline enum eCmpInfo cmp_info (struct tFileInfo *pSrc,
                                      struct tFileInfo *pDst)
{
        int      ci;
        uint64_t d;

    if (!pSrc) return INFO_NOSRC;
    if (!pDst) return INFO_NODST;

    ci = strcmp (pSrc->uFName + sSrcOffset, pDst->uFName + sDstOffset);

    if (ci > 0) return INFO_NOSRC;
    else if (ci < 0) return INFO_NODST;

    if ((pSrc->uFlag == 'd') && (pDst->uFlag == 'f')) return INFO_NOSRC;
    if ((pSrc->uFlag == 'f') && (pDst->uFlag == 'd')) return INFO_NODST;

    if ((pSrc->uFlag == 'd') && (pDst->uFlag == 'd')) return INFO_IDENTICAL;

    d = pSrc->uTime > pDst->uTime ? pSrc->uTime - pDst->uTime :
            pDst->uTime - pSrc->uTime;

    if ((pSrc->uSize == pDst->uSize) && (d < SLACKTIME)) return INFO_IDENTICAL;

    return INFO_NOMATCH;
}

/**
 * Dumps the entire diff information.
 */
static inline int dump_diffinfo (struct tAppInfo *pAppInfo)
{
        struct tFileInfo *src  , *dst;
        enum eCmpInfo     ci;

    printf ("#!bd-sync\n");
    printf ("#S[ource]     : %s\n", pAppInfo->uSrc.uRoot);
    printf ("#D[estination]: %s\n", pAppInfo->uDst.uRoot);

    // Setup closure
    sSrcOffset = strlen (pAppInfo->uSrc.uRoot);
    sDstOffset = strlen (pAppInfo->uDst.uRoot);

    // Walk source and dest
    src = pAppInfo->uSrc.uHead;
    dst = pAppInfo->uDst.uHead;
    while (src || dst) {
        switch ((ci = cmp_info (src, dst))) {
            case INFO_IDENTICAL:
                // Match! Advance both
                src = src->uNext;
                dst = dst->uNext;
                break;
            case INFO_NOMATCH:
                // No match! Display both files
                dump_filediff (src, dst);
                src = src->uNext;
                dst = dst->uNext;
                break;
            case INFO_NOSRC:
                // Only display dest
                dump_filediff (NULL, dst);
                dst = dst->uNext;
                break;
            case INFO_NODST:
                // Only display src
                dump_filediff (src, NULL);
                src = src->uNext;
                break;
            default:
                ERRRET (1, "Comparison not understood (%d)!", ci);
        }
    }

    // All done!
    return 0;

}

/**
 * Setup the root paths.
 */
static inline int setup_root (struct tRoot *pRoot, strptr_t pDir)
{
        char  buf [MAX_PATH];

    /* No cleanup - this memory is needed for lifetime of app anyway */
    pRoot->uHead = xmalloc (sizeof (struct tFileInfo));
    memset (pRoot->uHead, 0, sizeof (struct tFileInfo));
    pRoot->uHead->uFName   = xmalloc (MAX_PATH);

    if (!getcwd (buf, MAX_PATH)) ERRRET (0, "Failed to get full path of current dir!");
    if (chdir (pDir)) ERRRET (0, "Failed to open directory %s!", pDir);
    if (!getcwd (pRoot->uHead->uFName, MAX_PATH)) ERRRET (0, "Failed to get full path of %s!", pDir);
    if (chdir (buf)) ERRRET (0, "Failed to return to current dir!");
    pRoot->uRoot =
    pRoot->uHead->uFName = cygwin_to_win32_path (pRoot->uHead->uFName);

    return 1;
}

/**
 * Parses the arguments.
 */
static inline struct tAppInfo* parse_args (int argc, char *argv[])
{
        struct tAppInfo *ai;

    if (argc != 3) ERRRET (NULL, "Please provide <src> <dest>");

    ai = xmalloc (sizeof (struct tAppInfo));
    memset (ai, 0, sizeof (struct tAppInfo));

    if (!setup_root (&ai->uSrc, argv [1])) goto err;
    if (!setup_root (&ai->uDst, argv [2])) goto err;

    return ai;

err:
    free (ai);
    return NULL;
}

/**
 * Load a particular root (src or dest).
 */
static inline int load (struct tRoot *pRoot)
{
    return walk_dir (pRoot->uHead->uFName, walk_save, pRoot);
}

/**
 * Reverses the gathered list to get it into correct order.
 */
static struct tFileInfo* reverse_list (struct tFileInfo *pNode)
{
        struct tFileInfo *pr, *nxt;

    if (!pNode || !pNode->uNext) return pNode;

    pr = NULL;
    while (pNode) {
        nxt          = pNode->uNext;
        pNode->uNext = pr;
        pr           = pNode;
        pNode        = nxt;
    }
    return pr;
}

/**
 * Populates a particular root (src or dest).
 */
static inline int populate (struct tRoot *pRoot)
{
        int        r;
    if (!(r = load (pRoot))) pRoot->uHead = reverse_list (pRoot->uHead);
    return r;
}

/**
 * Main entry point.
 */
int main (int argc, char *argv[])
{
        struct tAppInfo *ai;

    if (!(ai = parse_args (argc, argv))) return EXIT_FAILURE;

    if (populate (&ai->uSrc)) return EXIT_FAILURE;
    if (populate (&ai->uDst)) return EXIT_FAILURE;

    if (dump_diffinfo (ai)) return EXIT_FAILURE;

    if (getenv ("DEBUGDUMP")) {
        printf ("\n\n#---- DEBUG DUMP ----\n");
        dump_appinfo (ai);
    }

#ifdef STATS
    dump_stats ();
#endif

    return EXIT_SUCCESS;
}

filter-dir.sed

# Filters out the entire list of paths
# provided by bd-sync into a neat list
# of unique directories. (much easier
# to review)
#
# Author: http://the-brown-dragon.com/
#######################################

# Remove any comments in input list
/^#/d
# Remove destination path (same as src)
s/^\(....*\*\)..*$/\1/
# File? Then find containing directory
/^f/s,\(.*\)/[^/]*$,\1,
# Cleanup markers (file/directory and src/dest)
s/..//;s/\*$//
# Don't bother with the details under .git repositories
s,.git/.*,.git,
# Filter for unique paths
G;s/$/\n/
/^\([^\n]*\)\n.*\1\n/!{
s/\n.*//
H
}
# Display all filtered paths
$!d
g
s/\n//

g

#!/usr/bin/bash
#
# This is a simple bash script to convert
# the list provided by bd-sync into instructions
# for copying and deleting.
#
# NB: Scripts are generated in two stages -
#     The copying script itself generates
#     the deletion script. This is for two
#     reasons:
#       1. Allows a review of what is about
#          to be deleted, which is generally
#          irreversible so needs confirmation.
#       2. The flow of copying and deleting
#          is in opposite directions:
#          Copying: FIRST create main directories
#                   THEN create sub-dirs and files
#          Delete : FIRST delete sub-dirs and files
#                   THEN delete main directories
#                   Which is why we have 'the
#                   reversing step' below.
#
# Author: http://the-brown-dragon.com/
#######################################

LATESTLOG="/b_d/e.d/sync/latest"
[ -z $1 ] || LATESTLOG="$1"
OUTFILE="g.sh"
DELFILE="d.sh"
SEDFILE=delthis.sed

cat > $SEDFILE <<SEDFILEENDS
1{s/.*/#!\/usr\/bin\/bash\n> $DELFILE/;p;d}
s/\\\$/\\\\\$/g
/#S\[ource\]     : /{
s/#S\[ource\]     : //
h;d
}
/#D\[estination\]: /{
s/#D\[estination\]: //
H;d
}
/^#/d
/^d/{
s/..//
/\*$/{
G
s/\(.*\)\n\(.*\)\n\(.*\)/\2\n\3\n\1/
s/\(.*\)\n\(.*\)\n\1\(.*\)/\2\3/
s/\\\$/\\\\\\\\\$/g
s/!/\\\\!/g
s/\(.*\)\*$/mkdir -p "\1" || echo "Failed to create directory \1"/
}
/^\*/{
s/\\\$/\\\\\\\\\$/g
s/!/\\\\!/g
s/\*\(.*\)/echo rmdir "\\\\"\1\\\\"" >> $DELFILE/
}
p;d
}
s/..//
/\*$/{
s/\(.*\)\*$/\1\n\1/
G
s/\(.*\)\n\(.*\)\n\(.*\)\n\(.*\)/\3\n\4\n\1\n\2/
s/\(.*\)\n\(.*\)\n\(.*\)\n\1\(.*\)/cp -P "\3" "\2\4" || echo "Failed to copy \3"\ntouch -h -r "\3" "\2\4"/
}
/^\*/{
s/\\\$/\\\\\\\\\$/g
s/!/\\\\!/g
s/\*\(.*\)/echo "rm \\\\"\1\\\\"" >> $DELFILE/
}
/..*\*..*/s/\(.*\)\*\(.*\)/cp -P "\1" "\2" || echo "Failed to copy \1"\ntouch -h -r "\1" "\2"/


# The reversing step! Refer main script comment for reason.
\$s/.*/&\nsed -i '1!G;h;\$!d;s\/^\/#!\\\\\/usr\\\\\/bin\\\\\/bash\\\\n\/' $DELFILE/
SEDFILEENDS

sed -f $SEDFILE $LATESTLOG > $OUTFILE

rm $SEDFILE

Other Posts

(ordered by Tags then Date)