Presenting two ways to skim tar format files: via direct parsing and using the specific module.

The file date of an archive is useful to keep around for chronological listings, or determining its age at a glance. It is however often times lost as the files get downloaded, copied or moved. An obvious fix is to reset the date to that of the most recent member contained within. And a script to this end is what I implemented, years ago. If there is or was a proper tool for that already, I wouldn't know.

But old TODOs came to my attention again recently. What better time to clean up some old code, perl-based and all? In particular, there was this bit to decompress the files with an external utility:

... ? ... : $file =~ /bz2$/i ? open($fh, '-|', 'bzcat', '--', $file) : open($fh, '<', $file);
DOS-like logic, based on file suffix? Very un-unix and un-cool. IO::Uncompress::AnyUncompress to the rescue!

Minutes later, there it is — the version II — shorter and neater by a fair bit.

#! /usr/bin/perl # touch tar archive mtime timestamp # Usage: $0 [-z] [-n] files ... # -z also check gzip archive time # -n don't actually touch, show what would be done use strict; use warnings; use Getopt::Std; getopts('zn', \my %opt); use List::Util q(max); use IO::Uncompress::AnyUncompress; use constant HDR_UNPACK => ' a100 a8 a8 a8 a12 a12 a8 a1 + a100 a6 a2 a32 a32 a8 a8 a155 a12'; use constant HDR_FIELDS => qw(name mode uid gid size mtime chksum type + linkname magic version uname gname devmajor devminor prefix _pad); use constant HDR_OCTALS => qw(mode uid gid size mtime chksum devmajor +devminor); sub tar_header { my ($hdr, $r) = @_; @$r{HDR_FIELDS,} = map /([^\0]*)/, unpack(HDR_UNPACK, $hdr); #return if $$r{magic} !~ /^ustar/; # some tar-s use weird + magic return if $$r{_pad} ne '' || grep /[^0-7 ]/, @$r{HDR_OCTALS,}; $_ = oct for @$r{HDR_OCTALS,}; # fix octal fields substr($hdr, 148, 8) = ' 'x8; $$r{chksum} == unpack('%C*', $hdr) || $$r{chksum} == unpack('% +c*', $hdr) } sub tar_time { my ($fnam, $skip, $mtime, $ztime, $fh, $buf) = (shift, 0, 0); $fh = new IO::Uncompress::AnyUncompress($fnam) or return; $ztime = $opt{z} && ($fh->getHeaderInfo||{})->{Time} || 0; while (read($fh, $buf, 0x200) == 0x200) { next if $skip-- > 0; next if !tar_header($buf, \my %h); $mtime = $h{mtime} if $mtime < $h{mtime}; next if $h{type} && $h{type} ne "L"; $skip = ($h{size} + 0x1ff) >> 9; $skip = 0 if $skip && seek($fh, $skip<<9, 1); } return { 'gzip' => $ztime, 'tar' => $mtime }; } foreach (@ARGV) { my ($r, $t); -f && ($r = tar_time($_)) && ($t = max values %$r) && ($t != (stat)[9]) && ($opt{n} || utime($t, $t, $_)) && printf "%-60s %s (%s)\n", $_, scalar localtime($t), $t == $r->{tar} ? q(tar time) : q(zip time); }

Hacking on them tar headers is entertaining for sure, but let's try Archive::Tar now — a module purpose-built for tasks like that. And behold: the version III.

#! /usr/bin/perl # Usage: $0 [-z] [-n] files ... # -z also check gzip archive time # -n don't actually touch, show what would be done use Getopt::Std; getopts('zn', \my %opt); use List::Util q(max); use IO::Uncompress::AnyUncompress; use Archive::Tar; $Archive::Tar::WARN = 0; foreach (grep -f, @ARGV) { my $fh = new IO::Uncompress::AnyUncompress($_) or next; my $zt = $opt{z} && ($fh->getHeaderInfo||{})->{Time} || 0; my $tt = max map $_->{mtime}, Archive::Tar->list_archive($fh, +0, [q(mtime)]); my $t = max $zt, $tt; $t && ($t != (stat)[9]) && ($opt{n} || utime($t, $t, $_)) && printf "%-60s %s (%s)\n", $_, scalar localtime($t), ($t == $tt) ? q(tar time) : q(zip time); }
One-third of the previous size! Cut loose the reporting, the gzip-time foo, and we'd arrive at a one-liner territory. But this brevity has some gotchas. Let's see:

Giving it a second glance, the original script seems to do fine as it was. Some TODOs may stay a while longer, I think.

Replies are listed 'Best First'.
Re: Dating .tar Archives
by locked_user sundialsvc4 (Abbot) on Dec 31, 2014 at 15:37 UTC

    As an aside, I once had to deal with a site that needed to store things such that there were reliable modification-dates for the stored material, without the benefit of Microsoft SharePoint or any other such goodness.   And, what I wound up doing was to Zip-compress the material and then store the Zip files.   The header of a Zip archive-entry includes the file mod-date at the time the archive was created.   So, even though the mod-date of the archive file would change, it self-contained an easily queried date-stamp that would not.   I couldn’t use tar in this particular project, but it would have done the same.

      tar files are not guaranteed to have any kind of "archive header". GNU tar has an optional "volume header", used to help identify the members and sequence of multi-volume1 archives. Other versions of tar might not have this feature. Of course, if a tar file does have a volume header, its mtime would be the creation date of the archive, so one could stop reading the file once the volume header is read.

      ---

      1 "multi-volume" means the tar file may not be the complete archive. The archive may been created in size limited chunks. Use of this feature makes the individual tar files usefully extractable (as opposed to splitting a single tar file, which would require re-assmbly, first). Of course, a file that crosses volume boundries can only be partially extracted.