comment on

Presenting two ways to skim tar format files: via direct parsing and using the specific module.

The file date of an archive is useful to keep around for chronological listings, or determining its age at a glance. It is however often times lost as the files get downloaded, copied or moved. An obvious fix is to reset the date to that of the most recent member contained within. And a script to this end is what I implemented, years ago. If there is or was a proper tool for that already, I wouldn't know.

But old TODOs came to my attention again recently. What better time to clean up some old code, perl-based and all? In particular, there was this bit to decompress the files with an external utility:

... ? ...
: $file =~ /bz2$/i ? open($fh, '-|', 'bzcat', '--', $file)
: open($fh, '<', $file);
[download]

DOS-like logic, based on file suffix? Very un-unix and un-cool. IO::Uncompress::AnyUncompress to the rescue!

Minutes later, there it is — the version II — shorter and neater by a fair bit.

#! /usr/bin/perl

# touch tar archive mtime timestamp

# Usage: $0 [-z] [-n] files ...
#       -z  also check gzip archive time
#       -n  don't actually touch, show what would be done

use strict;
use warnings;

use Getopt::Std;
getopts('zn', \my %opt);

use List::Util q(max);
use IO::Uncompress::AnyUncompress;

use constant HDR_UNPACK => '  a100   a8  a8  a8  a12   a12     a8   a1
+     a100    a6      a2   a32   a32       a8       a8   a155  a12';
use constant HDR_FIELDS => qw(name mode uid gid size mtime chksum type
+ linkname magic version uname gname devmajor devminor prefix _pad);
use constant HDR_OCTALS => qw(mode uid gid size mtime chksum devmajor 
+devminor);

sub tar_header {
        my ($hdr, $r) = @_;

        @$r{HDR_FIELDS,} = map /([^\0]*)/, unpack(HDR_UNPACK, $hdr);

        #return if $$r{magic} !~ /^ustar/;      # some tar-s use weird
+ magic
        return if $$r{_pad} ne '' || grep /[^0-7 ]/, @$r{HDR_OCTALS,};

        $_ = oct for @$r{HDR_OCTALS,};          # fix octal fields

        substr($hdr, 148, 8) = ' 'x8;
        $$r{chksum} == unpack('%C*', $hdr) || $$r{chksum} == unpack('%
+c*', $hdr)
}

sub tar_time {
        my ($fnam, $skip, $mtime, $ztime, $fh, $buf) = (shift, 0, 0);

        $fh = new IO::Uncompress::AnyUncompress($fnam) or return;
        $ztime = $opt{z} && ($fh->getHeaderInfo||{})->{Time} || 0;

        while (read($fh, $buf, 0x200) == 0x200) {
                next if $skip-- > 0;
                next if !tar_header($buf, \my %h);

                $mtime = $h{mtime} if $mtime < $h{mtime};

                next if $h{type} && $h{type} ne "L";
                $skip = ($h{size} + 0x1ff) >> 9;
                $skip = 0 if $skip && seek($fh, $skip<<9, 1);
        }

        return { 'gzip' => $ztime, 'tar' => $mtime };
}

foreach (@ARGV) {
        my ($r, $t);
        -f && ($r = tar_time($_)) && ($t = max values %$r) &&
        ($t != (stat)[9]) && ($opt{n} || utime($t, $t, $_)) &&
        printf "%-60s %s (%s)\n", $_, scalar localtime($t),
                $t == $r->{tar} ? q(tar time) : q(zip time);
}
[download]

Hacking on them tar headers is entertaining for sure, but let's try Archive::Tar now — a module purpose-built for tasks like that. And behold: the version III.

#! /usr/bin/perl

# Usage: $0 [-z] [-n] files ...
#       -z  also check gzip archive time
#       -n  don't actually touch, show what would be done

use Getopt::Std;
getopts('zn', \my %opt);

use List::Util q(max);
use IO::Uncompress::AnyUncompress;

use Archive::Tar;
$Archive::Tar::WARN = 0;

foreach (grep -f, @ARGV) {
        my $fh = new IO::Uncompress::AnyUncompress($_) or next;
        my $zt = $opt{z} && ($fh->getHeaderInfo||{})->{Time} || 0;
        my $tt = max map $_->{mtime}, Archive::Tar->list_archive($fh, 
+0, [q(mtime)]);
        my $t = max $zt, $tt;

        $t && ($t != (stat)[9]) && ($opt{n} || utime($t, $t, $_)) &&
        printf "%-60s %s (%s)\n", $_, scalar localtime($t),
                ($t == $tt) ? q(tar time) : q(zip time);
}
[download]

One-third of the previous size! Cut loose the reporting, the gzip-time foo, and we'd arrive at a one-liner territory. But this brevity has some gotchas. Let's see:

Lots of memory is consumed reading big archives. Apparently list_archive method reads the uncompressed data in full. Is there no "metadata-only" flag one could peruse?
Another thing, list_archive has special cased the [q(name)] to return a flat list instead of hashes. Why not support both [qw(...)] and q(item) requests? Then one might simply write:
my $t = max Archive::Tar->list_archive($file, 1, "mtime");
The lzma/xz modules need to be installed separately for those to work. Release 5.20.1 does not (yet?) include them.

Giving it a second glance, the original script seems to do fine as it was. Some TODOs may stay a while longer, I think.

In reply to Dating .tar Archives by oiskuu

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.