Regular Expressions Matching with Perl

nimdokk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to work up a regular expression that will capture the file name from the following output. This is to be used in a Perl routine to capture filenames from within a Zip file. (There may be modules that will do what I need, but I cannot install them on the system and because of the way things are run, it would not be a good idea to run them out of my home directory so Extra modules are not an option at this point in time). The data would look like this:

      0  Stored       0   0%  04-20-05  08:43  00000000   test 1 2 3.z
+ip
 704106  DeflatN  83362  89%  04-04-05  19:00  8e76dc22   file1.dat
[download]

What I need to get out would be the file name, so in the first example, it would be "test 1 2 3.zip" and in the second "file1.dat." If I knew there would never be spaces in the names, no problem, but thats not something I can guarantee. Also, file names might be in a mix of upper and lower case. What I have come up with is the following (untested as of yet):

$line =~ /\d\s{1,2}\w{6,7}\s{1,}%\s{2}\d{2}-\d{2}-\d{2}\s{2}\d{2}:\d{2
+}\s{2}\w{8}\s{1,}(\w
{1,}\.\w{3})/;
push @temp, $1;
[download]

This would be nested inside of a foreach loop that loops through a temp file containing the data. @temp is then returned back to the main script calling this function. My main question is in looking for a better way to handle the regular expression match.

Any ideas would be welcome. Thanks

Comment on Regular Expressions Matching with Perl Select or Download Code

Replies are listed 'Best First'.
Re: Regular Expressions Matching with Perl by ikegami (Patriarch) on Apr 20, 2005 at 17:40 UTC
What about `chomp($line); push(@temp, (split(' ', $line, 8))[-1]);` [download] Tested.	[reply] [d/l]
Re^2: Regular Expressions Matching with Perl by hv (Prior) on Apr 21, 2005 at 09:54 UTC
Is it guaranteed that the compression method field will never nest a space, like "LZW cmp" or something? When building a regexp against sample data (as opposed to "against a specification") my approach tends to be exactly the opposite of Fletch's - make the regexp constrain as much as possible, so that I can warn if I ever see new data that violates my expectations: `$line =~ m{^ \s* \d+ # size? \s+ \w+ # compression method \s+ \d+ # compressed size? \s+ \d+ % # compression ratio \s+ \d+ - \d+ - \d+ # date \s+ \d+ : \d+ # time \s+ [0-9a-f]{8} # checksum? \s+ (.*) # filename $}xi or warn "Couldn't match input line '$line'"; $filename = $1;` [download] It is worth checking whether it is possible to store a filename with some odd characters to see what happens, such as a newline, backslash etc. Similarly it is worth looking for boundary conditions on other fields - if the size is more than 8 digits does it still retain at least one following space? Hugo	[reply] [d/l]
Re^3: Regular Expressions Matching with Perl by ikegami (Patriarch) on Apr 21, 2005 at 14:18 UTC
Is it guaranteed that the compression method field will never nest a space, like "LZW cmp" or something? yes, I think it's always one word (and probably specifically for easy parsing, judging by the odd names). When building a regexp against sample data my approach tends to be exactly the opposite of Fletch's I call the two approaches "Extraction" (`/:.{15}(.)/`) and "Validation" (your's). Which I use is determined by the situation. Sometimes, there's a happy middle that's a mixture of both (Fletch's `/[[:hexdigit:]]{8}\s+(.)$/`).	[reply] [d/l] [select]
Re: Regular Expressions Matching with Perl by Transient (Hermit) on Apr 20, 2005 at 17:36 UTC
Are there always 8 columns of data? Is the file name always last? If so (untested) - `$filename = (split( /\s+/,$line, 8 ))[-1]` [download] If it's always at a certain index, you could go off of that, also. Regexp's are useful, but not always necessary! Update: As ikegami points out - this will not work quite the way you want because of the perceived "null field" in front of the the leading spaces as shown in split `$filename = (split( ' ',$line, 8 ))[-1]` [download]	[reply] [d/l] [select]
Re^2: Regular Expressions Matching with Perl by ikegami (Patriarch) on Apr 20, 2005 at 17:42 UTC
This won't work because of the leading spaces. See my reply for the fix.	[reply]
Re^2: Regular Expressions Matching with Perl by nimdokk (Vicar) on Apr 20, 2005 at 17:49 UTC
Right, and when I tried it on a sample containing three files, it worked perfectly on the first line, but on the second two, it grabbed the 7th column as well. It looks like this is definitly taking me in the right direction. I'd tried the split route first but the spaces in the filename would throw everything off. I'll play around with some of these suggestions and see what I can come up with. Thanks for the quick responses. Update: I tried the updated line and it worked beautifully, need to do some more testing, but I think I've got a winner here. Thanks again.	[reply]
Re^2: Regular Expressions Matching with Perl by nimdokk (Vicar) on Apr 20, 2005 at 17:41 UTC
It will always be in the 8th column - I'll give that a shot and see. Just tried my regexp and it didn't do a thing. I'll give this a shot and see what happens.	[reply]
Re: Regular Expressions Matching with Perl by davidrw (Prior) on Apr 20, 2005 at 17:44 UTC
If you were able to use modules, i would suggest Archive::Zip. But here is a non-module solution: `$line =~ s/^\s+//g; # strip leading whites +pace $line =~ s/\s+$//g; # strip trailing white +space my @cols = split(/ +/, $line); # split on spaces my $filename = join(' ', splice(@lines, 7) ); # piece back together +the filename push @temp, $filename; # store filename` [download] Could possibly try a fixed-width solution, but i'd be worried that it wouldn't work if the first or third columns varied too much in size. Update: I think i overthought a little and forgot about the LIMIT parameter to split() -- probably better than breaking and re-gluing the filename.	[reply] [d/l]
Re^2: Regular Expressions Matching with Perl by nimdokk (Vicar) on Apr 20, 2005 at 17:46 UTC
I've thought about Archive::Zip as well, but its not one on our system. :-)	[reply]
Re^3: Regular Expressions Matching with Perl by salva (Canon) on Apr 21, 2005 at 10:12 UTC
there is an alternative way to "install" pure perl modules: copy&paste their source code directly on your script... it will require some changes, but usually small, trivial ones.	[reply]
Re: Regular Expressions Matching with Perl by Fletch (Bishop) on Apr 20, 2005 at 17:40 UTC
Regexen should be a short as possible, but no shorter. `my( $file ) = /[[:hexdigit:]]{8}\s+(.*)$/;` [download]	[reply] [d/l]
Re^2: Regular Expressions Matching with Perl by ikegami (Patriarch) on Apr 20, 2005 at 17:49 UTC
Too short. That fails if the file size is 10,000,000 bytes or more. `my ($file) = $line =~ /:.{15}(.*)/;` would be minimal.	[reply] [d/l]
Re^3: Regular Expressions Matching with Perl by Fletch (Bishop) on Apr 20, 2005 at 18:21 UTC
Erm, it's matching against the 8 hex digits of the CRC-32. How is that affected by the file size? Granted on further examination of `unzip -v` output it should be anchored off the date as well. `/:\d\d\s\s[[:hexdigit:]]{8}\s\s(.*)$/` [download]	[reply] [d/l]
Re^4: Regular Expressions Matching with Perl by ikegami (Patriarch) on Apr 20, 2005 at 18:39 UTC
Re: Regular Expressions Matching with Perl by NateTut (Deacon) on Apr 20, 2005 at 17:40 UTC
I would think substr would be a good choice too since you seem to be dealing with fixed length fields/records.	[reply]
Re^2: Regular Expressions Matching with Perl by ikegami (Patriarch) on Apr 20, 2005 at 17:44 UTC
Unfortunately, I think `substr` will fail when the raw size of the file is 10MB of more, when the compressed size of the file is 10MB or more, or when the compression ratio is 100%. (I've seen it round to 100% once.)	[reply] [d/l]
Re^3: Regular Expressions Matching with Perl by NateTut (Deacon) on Apr 20, 2005 at 18:57 UTC
You wouldn't need to run substr on the whole file at one time, you would run it against each line of the file separately like this: `use strict; use warnings; use Data::Dumper; my @Temp; use constant FileNameStart => 58; while(<DATA>) { chomp(); push(@Temp, substr($_,FileNameStart)); } print(Dumper(@Temp)); __DATA__ 0 Stored 0 0% 04-20-05 08:43 00000000 test 1 2 3.z +ip 704106 DeflatN 83362 89% 04-04-05 19:00 8e76dc22 file1.dat` [download]	[reply] [d/l]
Re^4: Regular Expressions Matching with Perl by ikegami (Patriarch) on Apr 20, 2005 at 19:28 UTC