Dating .tar Archives

Presenting two ways to skim tar format files: via direct parsing and using the specific module.

The file date of an archive is useful to keep around for chronological listings, or determining its age at a glance. It is however often times lost as the files get downloaded, copied or moved. An obvious fix is to reset the date to that of the most recent member contained within. And a script to this end is what I implemented, years ago. If there is or was a proper tool for that already, I wouldn't know.

But old TODOs came to my attention again recently. What better time to clean up some old code, perl-based and all? In particular, there was this bit to decompress the files with an external utility:

... ? ...
: $file =~ /bz2$/i ? open($fh, '-|', 'bzcat', '--', $file)
: open($fh, '<', $file);
[download]

DOS-like logic, based on file suffix? Very un-unix and un-cool. IO::Uncompress::AnyUncompress to the rescue!

Minutes later, there it is — the version II — shorter and neater by a fair bit.

#! /usr/bin/perl

# touch tar archive mtime timestamp

# Usage: $0 [-z] [-n] files ...
#       -z  also check gzip archive time
#       -n  don't actually touch, show what would be done

use strict;
use warnings;

use Getopt::Std;
getopts('zn', \my %opt);

use List::Util q(max);
use IO::Uncompress::AnyUncompress;

use constant HDR_UNPACK => '  a100   a8  a8  a8  a12   a12     a8   a1
+     a100    a6      a2   a32   a32       a8       a8   a155  a12';
use constant HDR_FIELDS => qw(name mode uid gid size mtime chksum type
+ linkname magic version uname gname devmajor devminor prefix _pad);
use constant HDR_OCTALS => qw(mode uid gid size mtime chksum devmajor 
+devminor);

sub tar_header {
        my ($hdr, $r) = @_;

        @$r{HDR_FIELDS,} = map /([^\0]*)/, unpack(HDR_UNPACK, $hdr);

        #return if $$r{magic} !~ /^ustar/;      # some tar-s use weird
+ magic
        return if $$r{_pad} ne '' || grep /[^0-7 ]/, @$r{HDR_OCTALS,};

        $_ = oct for @$r{HDR_OCTALS,};          # fix octal fields

        substr($hdr, 148, 8) = ' 'x8;
        $$r{chksum} == unpack('%C*', $hdr) || $$r{chksum} == unpack('%
+c*', $hdr)
}

sub tar_time {
        my ($fnam, $skip, $mtime, $ztime, $fh, $buf) = (shift, 0, 0);

        $fh = new IO::Uncompress::AnyUncompress($fnam) or return;
        $ztime = $opt{z} && ($fh->getHeaderInfo||{})->{Time} || 0;

        while (read($fh, $buf, 0x200) == 0x200) {
                next if $skip-- > 0;
                next if !tar_header($buf, \my %h);

                $mtime = $h{mtime} if $mtime < $h{mtime};

                next if $h{type} && $h{type} ne "L";
                $skip = ($h{size} + 0x1ff) >> 9;
                $skip = 0 if $skip && seek($fh, $skip<<9, 1);
        }

        return { 'gzip' => $ztime, 'tar' => $mtime };
}

foreach (@ARGV) {
        my ($r, $t);
        -f && ($r = tar_time($_)) && ($t = max values %$r) &&
        ($t != (stat)[9]) && ($opt{n} || utime($t, $t, $_)) &&
        printf "%-60s %s (%s)\n", $_, scalar localtime($t),
                $t == $r->{tar} ? q(tar time) : q(zip time);
}
[download]

Hacking on them tar headers is entertaining for sure, but let's try Archive::Tar now — a module purpose-built for tasks like that. And behold: the version III.

#! /usr/bin/perl

# Usage: $0 [-z] [-n] files ...
#       -z  also check gzip archive time
#       -n  don't actually touch, show what would be done

use Getopt::Std;
getopts('zn', \my %opt);

use List::Util q(max);
use IO::Uncompress::AnyUncompress;

use Archive::Tar;
$Archive::Tar::WARN = 0;

foreach (grep -f, @ARGV) {
        my $fh = new IO::Uncompress::AnyUncompress($_) or next;
        my $zt = $opt{z} && ($fh->getHeaderInfo||{})->{Time} || 0;
        my $tt = max map $_->{mtime}, Archive::Tar->list_archive($fh, 
+0, [q(mtime)]);
        my $t = max $zt, $tt;

        $t && ($t != (stat)[9]) && ($opt{n} || utime($t, $t, $_)) &&
        printf "%-60s %s (%s)\n", $_, scalar localtime($t),
                ($t == $tt) ? q(tar time) : q(zip time);
}
[download]

One-third of the previous size! Cut loose the reporting, the gzip-time foo, and we'd arrive at a one-liner territory. But this brevity has some gotchas. Let's see:

Lots of memory is consumed reading big archives. Apparently list_archive method reads the uncompressed data in full. Is there no "metadata-only" flag one could peruse?
Another thing, list_archive has special cased the [q(name)] to return a flat list instead of hashes. Why not support both [qw(...)] and q(item) requests? Then one might simply write:
my $t = max Archive::Tar->list_archive($file, 1, "mtime");
The lzma/xz modules need to be installed separately for those to work. Release 5.20.1 does not (yet?) include them.

Giving it a second glance, the original script seems to do fine as it was. Some TODOs may stay a while longer, I think.

Comment on Dating .tar Archives Select or Download Code

Replies are listed 'Best First'.
Re: Dating .tar Archives by locked_user sundialsvc4 (Abbot) on Dec 31, 2014 at 15:37 UTC
As an aside, I once had to deal with a site that needed to store things such that there were reliable modification-dates for the stored material, without the benefit of Microsoft SharePoint or any other such goodness. And, what I wound up doing was to Zip-compress the material and then store the Zip files. The header of a Zip archive-entry includes the file mod-date at the time the archive was created. So, even though the mod-date of the archive file would change, it self-contained an easily queried date-stamp that would not. I couldn’t use `tar` in this particular project, but it would have done the same.
Re^2: Dating .tar Archives by RonW (Parson) on Jan 05, 2015 at 18:50 UTC
tar files are not guaranteed to have any kind of "archive header". GNU tar has an optional "volume header", used to help identify the members and sequence of multi-volume¹ archives. Other versions of tar might not have this feature. Of course, if a tar file does have a volume header, its mtime would be the creation date of the archive, so one could stop reading the file once the volume header is read. --- ¹ "multi-volume" means the tar file may not be the complete archive. The archive may been created in size limited chunks. Use of this feature makes the individual tar files usefully extractable (as opposed to splitting a single tar file, which would require re-assmbly, first). Of course, a file that crosses volume boundries can only be partially extracted.	[reply]