cpio VS tar and cp cpio VS tar and cp bash bash

cpio VS tar and cp


I see no reason to use cpio for any reason other than ripping opened RPM files, via disrpm or rpm2cpio, but there may be corner cases in which cpio is preferable to tar.

History and popularity

Both tar and cpio are competing archive formats that were introduced in Version 7 Unix in 1979 and then included in POSIX.1-1988, though only tar remained in the next standard, POSIX.1-20011.

Cpio's file format has changed several times and has not remained fully compatible between versions. For example, there is now an ASCII-encoded representation of binary file information data.

Tar is more universally known, has become more versatile over the years, and is more likely to be supported on a given system. Cpio is still used in a few areas, such as the Red Hat package format (RPM), though RPM v5 (which is admittedly obscure) uses xar instead of cpio.

Both live on most Unix-like systems, though tar is more common. Here are Debian's install stats:

#rank  name    inst    vote    old  recent  no-files  (maintainer)   13   tar  189206  172133   3707   13298        68  (Bdale Garbee)   61  cpio  189028   71664  96346   20920        98  (Anibal Monsalve Salazar)

Modes

Copy-out: This is for archive creation, akin to tar -pc

Copy-in: This is for archive extraction, akin to tar -px

Pass-through: This is basically both of the above, akin to tar -pc … |tar -px but in a single command (and therefore microscopically faster). It's similar to cp -pdr, though both cpio and (especially) tar have more customizability. Also consider rsync -a, which people often forget since it's more typically used across a network connection.

I have not compared their performance, but I expect they'll be quite similar in CPU, memory, and archive size (after compression).


TAR(1) is just as good as cpio() if not better. One can argue that it is , in fact, better than CPIO because it is ubiquitous and vetted. There's got to be a reason why we have tar balls everywhere.


Why is cpio better than tar? A number of reasons.

  1. cpio preserves hard links, which is important if you're using it for backups.
  2. cpio doesn't have that annoying filename length limitation. Sure, gnutar has a "hack" that allows you to use longer filenames (it creates a temporary file in which it stores the real name), but it's inherently not portable to non-gnu tar's.
  3. By default, cpio preserves timestamps
  4. When scripting, it has much better control over which files are and are not copied, since you must explicitly list the files you want copied. For example, which of the following is easier to read and understand?

    find . -type f -name '*.sh' -print | cpio -o | gzip >sh.cpio.gz

    or on Solaris:

    find . -type f -name '*.sh' -print >/tmp/includemetar -cf - . -I /tmp/includeme | gzip >sh.tar.gz

    or with gnutar:

    find . -type f -name '*.sh' -print >/tmp/includemetar -cf - . --files-from=/tmp/includeme | gzip >sh.tar.gz

    A couple of specific notes here: for large lists of files, you can't put find in reverse quotes; the command-line length will be overrun; you must use an intermediate file. Separate find and tar commands are inherently slower, since the actions are done serially.

    Consider this more complex case where you want a tree completely packaged up, but some files in one tar, and the remaining files in another.

    find . -depth -print >/tmp/filesegrep    '\.sh$' /tmp/files | cpio -o | gzip >with.cpio.gzegrep -v '\.sh$' /tmp/files | cpio -o | gzip >without.cpio.gz

    or under Solaris:

    find . -depth -print >/tmp/filesegrep    '\.sh$' /tmp/files >/tmp/withtar -cf - . -I /tmp/with    | gzip >with.tar.gztar -cf - .    /tmp/without | gzip >without.tar.gz##          ^^-- no there's no missing argument here.  It's just empty that way

    or with gnutar:

    find . -depth -print >/tmp/filesegrep    '\.sh$' /tmp/files >/tmp/withtar -cf - . -I /tmp/with    | gzip >with.tar.gztar -cf - . -X /tmp/without | gzip >without.tar.gz

    Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!

  5. If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:

    find . -depth -print >/tmp/filessplit /tmp/filesfor F in /tmp/files?? ; do  cat $F | cpio -o | ssh destination "cd /target && cpio -idum" &done

    Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.

    find . -depth -print >/tmp/filesnpipe -4 /tmp/files 'cpio -o | ssh destination "cd /target && cpio -idum"'

    How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!

    A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's

In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.