regex help, please

mchampag has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regex help, please by BioLion (Curate) on Dec 21, 2009 at 17:24 UTC
As SuicideJuknie points out, your substitution is failing because your regex doesn't match. I thought it was because your `.+` in the middle of your pattern is 'greedy' so is matching right to the end of the string, and your `Track` part can't match. However this is not right because perlre tells us that " . Match any character (except newline) " So really it is failing because you don't have 'Track' on the same line as the 'CD_TEXT' bit. Some playing around and I got to this, which does the job, and should get you to the answer: use warnings; use strict; my $string = <<END; CD_DA CD_TEXT { LANGUAGE_MAP { 0: 9 } LANGUAGE 0 { TITLE "Multi-01" PERFORMER "" SIZE_INFO { 1, 1, 19, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 7, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0} } } // Track 1 END ## include all characters between markers if ( $string =~ m\|(CD_TEXT[\W\w]+)Track\|){ print "\'$1\' Matched!\n"; } else { print "\'$string\' did not match...\n"; } __END__ 'CD_TEXT { LANGUAGE_MAP { 0: 9 } LANGUAGE 0 { TITLE "Multi-01" PERFORMER "" SIZE_INFO { 1, 1, 19, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 7, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0} } } // ' Matched! [download] Hope this 'elfs. Just a something something...	[reply] [d/l] [select]
Re^2: regex help, please by mchampag (Acolyte) on Dec 21, 2009 at 17:52 UTC
It DID 'elf! Adding the ' 1\n' to force the match to stop at '// Track 1', I changed my s/// to: `$tocfile =~ s/CD_TEXT[\W\w]+(\/\/ Track 1\n)/$1/m;` And it's now doing what I mean. Thanks for the remedial lesson on the . metacharacter!!! -Matt	[reply] [d/l]
Re: regex help, please by Marshall (Canon) on Dec 21, 2009 at 17:33 UTC
You asked about an alternate approach. I have often found that data like you have is best parsed line by line instead of being slurped into a single variable. Of course your mileage will definitely vary! Anyway a "read the line and throw it away" if not needed is much faster than a "slurp" and substitution. A simple formulation of this is shown below. Update: I would also add Flipin good, or a total flop? as another way along the way of shem's approach. The three dot (...) version of the "flip flop" syntax works with multiple lines, the (..) version works with single lines. `#!/usr/bin/perl -w use strict; my $line; while ($line = <DATA>) { skip_record() if $line =~ m/^\sCD_TEXT/; print $line; } sub skip_record { while ($line = (<DATA>), $line !~ m\|^\s//\s*Track\|){}; } #prints: # CD_DA # # // Track 1 __DATA__ CD_DA CD_TEXT { LANGUAGE_MAP { 0: 9 } LANGUAGE 0 { TITLE "Multi-01" PERFORMER "" SIZE_INFO { 1, 1, 19, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 7, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0} } } // Track 1` [download]	[reply] [d/l]
Re^2: regex help, please by mchampag (Acolyte) on Dec 21, 2009 at 18:18 UTC
Thank you very much for both the line-by-line test and flip-flop hints. This problem had to do with fixing an existing script which choked on unexpected input (the CD_TEXT field), and so I was looking for as non-invasive a solution as possible, but I'll definitely keep your tips in mind for next time. -Matt	[reply]
Re: regex help, please by shmem (Chancellor) on Dec 21, 2009 at 17:45 UTC
Alternate approach: `perl -ne 'print unless /^CD_TEXT {/ .. /^}/' tocfile` [download]	[reply] [d/l]
Re: regex help, please by johngg (Canon) on Dec 21, 2009 at 20:25 UTC
Marshall pointed you towards a flip-flop approach to the problem and shmem gave you an example flip-flop one-liner where your data was still in a real file rather than a scalar. You can still use the flip-flop method with multi-line data held in a scalar by opening a filehandle on a reference to the scalar and reading it as you would a disk-based file. `$ perl -E ' > $text = <<'EOD'; > CD_DA > > CD_TEXT { > LANGUAGE_MAP { > 0: 9 > } > LANGUAGE 0 { > TITLE "Multi-01" > PERFORMER "" > SIZE_INFO { 1, 1, 19, 0, 3, 2, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 3, 7, 0, 0, 0, > 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0} > } > } > > // Track 1 > EOD > open $tocFH, q{<}, \ $tocFile or die $!; > while ( <$tocFH> ) { print unless m{^CD_T} .. m{^\}} }' CD_DA // Track 1 $` [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l]
Re: regex help, please by SuicideJunkie (Vicar) on Dec 21, 2009 at 17:03 UTC
In what way is it not working? Is it running too slowly because the greedy `.+` first tries to match the entire file, and then tries the entire file-1, and so on? Is it removing too much because there is a second `// Track` in the file that it can match? Is it removing too little because one of your titles/performers happens to include the sequence `// Track` (while trying a non-greedy "`.+?`")? Without getting into real parsing, I'd suggest starting with a non-greedy +. Match explicitly against the closing bracket and the newlines to reduce (unfortunately not eliminate) the chances of matching something you don't expect, and move the actual `// Track` check into a lookahead, so that it is not part of the match and won't be deleted. (Also, lookaheads are good to learn for the future)	[reply] [d/l] [select]
Re^2: regex help, please by mchampag (Acolyte) on Dec 21, 2009 at 17:18 UTC
The regex apparently isn't matching anything. If I print $tocfile, the CD_TEXT field is still there. I can see I have some book larnin ahead of me. Thanks for the tip about lookaheads.	[reply]