nimdokk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to work up a regular expression that will capture the file name from the following output. This is to be used in a Perl routine to capture filenames from within a Zip file. (There may be modules that will do what I need, but I cannot install them on the system and because of the way things are run, it would not be a good idea to run them out of my home directory so Extra modules are not an option at this point in time). The data would look like this:

0 Stored 0 0% 04-20-05 08:43 00000000 test 1 2 3.z +ip 704106 DeflatN 83362 89% 04-04-05 19:00 8e76dc22 file1.dat

What I need to get out would be the file name, so in the first example, it would be "test 1 2 3.zip" and in the second "file1.dat." If I knew there would never be spaces in the names, no problem, but thats not something I can guarantee. Also, file names might be in a mix of upper and lower case. What I have come up with is the following (untested as of yet):

$line =~ /\d\s{1,2}\w{6,7}\s{1,}%\s{2}\d{2}-\d{2}-\d{2}\s{2}\d{2}:\d{2 +}\s{2}\w{8}\s{1,}(\w {1,}\.\w{3})/; push @temp, $1;

This would be nested inside of a foreach loop that loops through a temp file containing the data. @temp is then returned back to the main script calling this function. My main question is in looking for a better way to handle the regular expression match.

Any ideas would be welcome. Thanks

Replies are listed 'Best First'.
Re: Regular Expressions Matching with Perl
by ikegami (Patriarch) on Apr 20, 2005 at 17:40 UTC

    What about

    chomp($line); push(@temp, (split(' ', $line, 8))[-1]);

    Tested.

      Is it guaranteed that the compression method field will never nest a space, like "LZW cmp" or something?

      When building a regexp against sample data (as opposed to "against a specification") my approach tends to be exactly the opposite of Fletch's - make the regexp constrain as much as possible, so that I can warn if I ever see new data that violates my expectations:

      $line =~ m{^ \s* \d+ # size? \s+ \w+ # compression method \s+ \d+ # compressed size? \s+ \d+ % # compression ratio \s+ \d+ - \d+ - \d+ # date \s+ \d+ : \d+ # time \s+ [0-9a-f]{8} # checksum? \s+ (.*) # filename $}xi or warn "Couldn't match input line '$line'"; $filename = $1;

      It is worth checking whether it is possible to store a filename with some odd characters to see what happens, such as a newline, backslash etc. Similarly it is worth looking for boundary conditions on other fields - if the size is more than 8 digits does it still retain at least one following space?

      Hugo

        Is it guaranteed that the compression method field will never nest a space, like "LZW cmp" or something?

        yes, I think it's always one word (and probably specifically for easy parsing, judging by the odd names).

        When building a regexp against sample data my approach tends to be exactly the opposite of Fletch's

        I call the two approaches "Extraction" (/:.{15}(.*)/) and "Validation" (your's). Which I use is determined by the situation. Sometimes, there's a happy middle that's a mixture of both (Fletch's /[[:hexdigit:]]{8}\s+(.*)$/).

Re: Regular Expressions Matching with Perl
by Transient (Hermit) on Apr 20, 2005 at 17:36 UTC
    Are there always 8 columns of data? Is the file name always last?

    If so (untested) -
    $filename = (split( /\s+/,$line, 8 ))[-1]
    If it's always at a certain index, you could go off of that, also. Regexp's are useful, but not always necessary!

    Update:
    As ikegami points out - this will not work quite the way you want because of the perceived "null field" in front of the the leading spaces as shown in split

    $filename = (split( ' ',$line, 8 ))[-1]
      This won't work because of the leading spaces. See my reply for the fix.
      Right, and when I tried it on a sample containing three files, it worked perfectly on the first line, but on the second two, it grabbed the 7th column as well. It looks like this is definitly taking me in the right direction. I'd tried the split route first but the spaces in the filename would throw everything off. I'll play around with some of these suggestions and see what I can come up with. Thanks for the quick responses.

      Update: I tried the updated line and it worked beautifully, need to do some more testing, but I think I've got a winner here. Thanks again.

      It will always be in the 8th column - I'll give that a shot and see. Just tried my regexp and it didn't do a thing. I'll give this a shot and see what happens.
Re: Regular Expressions Matching with Perl
by davidrw (Prior) on Apr 20, 2005 at 17:44 UTC
    If you were able to use modules, i would suggest Archive::Zip. But here is a non-module solution:
    $line =~ s/^\s+//g; # strip leading whites +pace $line =~ s/\s+$//g; # strip trailing white +space my @cols = split(/ +/, $line); # split on spaces my $filename = join(' ', splice(@lines, 7) ); # piece back together +the filename push @temp, $filename; # store filename
    Could possibly try a fixed-width solution, but i'd be worried that it wouldn't work if the first or third columns varied too much in size.

    Update: I think i overthought a little and forgot about the LIMIT parameter to split() -- probably better than breaking and re-gluing the filename.
      I've thought about Archive::Zip as well, but its not one on our system. :-)
        there is an alternative way to "install" pure perl modules: copy&paste their source code directly on your script... it will require some changes, but usually small, trivial ones.
Re: Regular Expressions Matching with Perl
by Fletch (Bishop) on Apr 20, 2005 at 17:40 UTC

    Regexen should be a short as possible, but no shorter.

    my( $file ) = /[[:hexdigit:]]{8}\s+(.*)$/;

      Too short. That fails if the file size is 10,000,000 bytes or more.
      my ($file) = $line =~ /:.{15}(.*)/;
      would be minimal.

        Erm, it's matching against the 8 hex digits of the CRC-32. How is that affected by the file size? Granted on further examination of unzip -v output it should be anchored off the date as well.

        /:\d\d\s\s[[:hexdigit:]]{8}\s\s(.*)$/
Re: Regular Expressions Matching with Perl
by NateTut (Deacon) on Apr 20, 2005 at 17:40 UTC
    I would think substr would be a good choice too since you seem to be dealing with fixed length fields/records.
      Unfortunately, I think substr will fail when the raw size of the file is 10MB of more, when the compressed size of the file is 10MB or more, or when the compression ratio is 100%. (I've seen it round to 100% once.)
        You wouldn't need to run substr on the whole file at one time, you would run it against each line of the file separately like this:
        use strict; use warnings; use Data::Dumper; my @Temp; use constant FileNameStart => 58; while(<DATA>) { chomp(); push(@Temp, substr($_,FileNameStart)); } print(Dumper(@Temp)); __DATA__ 0 Stored 0 0% 04-20-05 08:43 00000000 test 1 2 3.z +ip 704106 DeflatN 83362 89% 04-04-05 19:00 8e76dc22 file1.dat