Help with a regular expression for file name parsing

bontchev has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help with a regular expression for file name parsing by BrowserUk (Patriarch) on Dec 07, 2011 at 07:11 UTC
This works with the samples supplied: `print $data;; #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive? print for $data =~ m[\@include\s('[^']+'\|"[^"]+"\|.+?(?<!\\))\s]g;; "some file" 'another file' yet\ another\ file` [download] Spreading that out a bit: `m[ \@include \s ## the introducer followed by a space ( ## capture '[^']+' ## A single quoted string with no embedded single + quotes \| ## or "[^"]+" ## a double quoted string with no embedded double + quotes \| ## or .+? (?<!\\) ## a min length string that ends in a space that +isn't escaped ) \s ]gx;;` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^2: Help with a regular expression for file name parsing by TJPride (Pilgrim) on Dec 07, 2011 at 11:30 UTC
Your regular expression works, but the code is rather a muddle. Here's a version that he can use to test with: `$data = join '', <DATA>; print "$_\n" for $data =~ m[\@include\s('[^']+'\|"[^"]+"\|.+?(?<!\\))\s] +g; __DATA__ #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive?` [download]	[reply] [d/l]
Re^3: Help with a regular expression for file name parsing by bontchev (Sexton) on Dec 07, 2011 at 12:07 UTC
When tested with this version, the output is just `1`	[reply] [d/l]
Re^4: Help with a regular expression for file name parsing by TJPride (Pilgrim) on Dec 07, 2011 at 13:23 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Help with a regular expression for file name parsing by Anonymous Monk on Dec 07, 2011 at 07:19 UTC
Surely such a format has a parser already , but anyway, I'm sure this will work, untested `my $pat = qr~ \@include \s+ ( (?: '[^']' ) \| (?: "[^"]" ) \| (?: (?:\\.) \| [^\\s] )+ ) ~x;` [download] `[^"` makes a good search term to find regex for similar formats, like ?node_id=3989;BIT=%5B%5E%22 -> Re^3: More robust link finding than HTML::LinkExtor/HTML::Parser?, Re: skip over an escaped single quote	[reply] [d/l] [select]
Re^2: Help with a regular expression for file name parsing by Anonymous Monk on Dec 07, 2011 at 12:59 UTC
and tested, though i had forgot to escape an \ in `[^\\s]` #!/usr/bin/perl -- #~ 2011-12-07-04:10:56PDT by Anonymous Monk #~ perltidy -csc -otr -opr -ce -nibc -i=4 use strict; use warnings; use autodie; # dies if open/close... fail Main( @ARGV ); exit( 0 ); sub Main { if ( @_ == 2 ) { NotDemoMeaningfulName(@_); } else { Demo(); print '#' x 33 ,"\n", Usage(); } } ## end sub Main sub NotDemoMeaningfulName { my ( $inputFile, $outputFile ) = @_; open my ($inFh), '<', $inputFile; open my ($outFh), '>', $outputFile; while( defined( my $data = <$inFh>) ){ print $outFh "$_\n" for $data =~ m~ \@include \s+ ( (?: '[^']' ) \| (?: "[^"]" ) \| (?: (?:\\.) \| [^\\\s] )+ ) ~xg; #~ for $data =~ m[\@include\s('[^']+'\|"[^"]+"\|.+?(?<!\\))\ +s]g; # /\@include\s+('[^']+'\|"[^"]+"\|.+?(?<!\\))\s+ +/g } close $inFh; close $outFh; } ## end sub NotDemoMeaningfulName sub Usage { <<"__USAGE__"; $0 $0 dataFile newDataFile __USAGE__ } ## end sub Usage sub Demo { my ( $Input, $WantedOutput ) = DemoData(); NotDemoMeaningfulName( \$Input, \my $Output ); require Test::More; Test::More::is( $Output, $WantedOutput, ' NotDemoMeaningfulName Works Aas Designed' ); Test::More::done_testing(); print "\n$Output\n"; } ## end sub Demo sub DemoData { #~ http://perlmonks... my $One = <<'__One__'; @include test #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive? __One__ #~ http://perlmonks... my $Two = <<'__Two__'; test "some file" 'another file' yet\ another\ file __Two__ return $One, $Two; } ## end sub DemoData __END__ $ perl pm.re.942167.pl ok 1 - NotDemoMeaningfulName Works Aas Designed 1..1 test "some file" 'another file' yet\ another\ file ################################# pm.re.942167.pl pm.re.942167.pl dataFile newDataFile [download]	[reply] [d/l] [select]
Re^3: Help with a regular expression for file name parsing by bontchev (Sexton) on Dec 09, 2011 at 08:12 UTC
and tested Not tested enough, I'm afraid. Your code doesn't handle properly even such trivial cases as `@include file`	[reply] [d/l]
Re^4: Help with a regular expression for file name parsing by Anonymous Monk on Apr 19, 2012 at 06:46 UTC
Re: Help with a regular expression for file name parsing by TJPride (Pilgrim) on Dec 07, 2011 at 14:12 UTC
There are really two parts to this. The first is to match the three patterns; the second to eliminate the unwanted wrapper or backslash characters. I tried to figure out a regex that would do both at once, but it's either impossible or my knowledge of regex isn't up to the task. So I cheated. `use strict; use warnings; my $data = join '', <DATA>; my $file; while ($data =~ m/\@include (".?"\|'.?'\|(?:[^\s\\]\|\\ )+)/g) { $file = $1; $file =~ s/["'\\]+//g; print "$file\n"; } __DATA__ #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive?` [download] CAVEAT: Assumes that ", ', and \ will never appear within filenames themselves. If they can, this gets much more complex.	[reply] [d/l]
Re^2: Help with a regular expression for file name parsing by bontchev (Sexton) on Dec 09, 2011 at 08:22 UTC
Thanks, you've been the most helpful one so far. Sadly, the above solution also doesn't solve the problem properly. However, I managed to combine it with another of the regular expressions that was proposed, plus some code for better resolving the escape sequences in the string, plus a better way of removing the quotes (only from the ends of the string - not from everywhere). Here is what I managed to come up with: use strict; use warnings; while (my $data = <DATA>) { if ($data =~ /\@include/i) { $data =~ m/\@include\s+('^'+'\|"^"+"\|.+?(?<!\\))\s/gi; my $fname = $1; $fname =~ s/\\(rnt'"\\ )/"qq\|\\$1\|"/gee; $fname =~ s/^"(.)"$/$1/s or $fname =~ s/^'(.)'$/$1/s; print "File name: <$fname>\n"; } } __DATA__ #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive? #@Include file # @include "\"another one\"" hmmm... # some stuff The "if" is there because, as I've mentioned above, I have to do some other processing of the lines, too. This code mostly works although, as you say, it doesn't handle properly file names containing escaped quotes. Perhaps I should give up the idea of parsing this in some clever way and just process the part after the "@include" character-by-character?	[reply]
Re^3: Help with a regular expression for file name parsing by bontchev (Sexton) on Dec 09, 2011 at 08:26 UTC
Sigh, the site mangled the code I posted. :-( I guess I've used the wrong tag. Let's try again: use strict; use warnings; while (my $data = <DATA>) { if ($data =~ /\@include/i) { $data =~ m/\@include\s+('[^']+'\|"[^"]+"\|.+?(?<!\\))\s/gi; my $fname = $1; $fname =~ s/\\([rnt'"\\ ])/"qq\|\\$1\|"/gee; $fname =~ s/^"(.)"$/$1/s or $fname =~ s/^'(.)'$/$1/s; print "File name: <$fname>\n"; } } __DATA__ #some "random stuff" @include "some file" did you parse that? #more 'random' stuff @include 'another file' you sure? #and more random stuff @include yet\ another\ file positive? #@Include file # @include "\"another one\"" hmmm... # some stuff [download]	[reply] [d/l]
Re^4: Help with a regular expression for file name parsing by BrowserUk (Patriarch) on Dec 09, 2011 at 09:35 UTC