mchampag has asked for the wisdom of the Perl Monks concerning the following question:

UPDATE: Solved! Thanks, BioLion!

Brothers and sisters, I have slurped into a multiline scalar named $tocfile the following text, which I want to mangle:

CD_DA CD_TEXT { LANGUAGE_MAP { 0: 9 } LANGUAGE 0 { TITLE "Multi-01" PERFORMER "" SIZE_INFO { 1, 1, 19, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 7, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0} } } // Track 1

I want to keep the first and last lines with two newlines between them, and I want to delete the section beginning "CD_TEXT". Here's my regex:

$tocfile =~ s/CD_TEXT.+(\/\/ Track)/$1/m;

With it, I'm trying to replace everything starting with 'CD_TEXT' to '// Track" with '// Track', but it isn't working. Can someone please illuminate me as to why it is not, or suggest an alternate approach?

Humble thanks,
Matt

Replies are listed 'Best First'.
Re: regex help, please
by BioLion (Curate) on Dec 21, 2009 at 17:24 UTC

    As SuicideJuknie points out, your substitution is failing because your regex doesn't match. I thought it was because your .+ in the middle of your pattern is 'greedy' so is matching right to the end of the string, and your Track part can't match. However this is not right because perlre tells us that

    " . Match any character (except newline) "
    So really it is failing because you don't have 'Track' on the same line as the 'CD_TEXT' bit.

    Some playing around and I got to this, which does the job, and should get you to the answer:

    use warnings; use strict; my $string = <<END; CD_DA CD_TEXT { LANGUAGE_MAP { 0: 9 } LANGUAGE 0 { TITLE "Multi-01" PERFORMER "" SIZE_INFO { 1, 1, 19, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 7, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0} } } // Track 1 END ## include *all* characters between markers if ( $string =~ m|(CD_TEXT[\W\w]+)Track|){ print "\'$1\' Matched!\n"; } else { print "\'$string\' did not match...\n"; } __END__ 'CD_TEXT { LANGUAGE_MAP { 0: 9 } LANGUAGE 0 { TITLE "Multi-01" PERFORMER "" SIZE_INFO { 1, 1, 19, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 7, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0} } } // ' Matched!

    Hope this 'elfs.

    Just a something something...

      It DID 'elf! Adding the ' 1\n' to force the match to stop at '// Track 1', I changed my s/// to:

      $tocfile =~ s/CD_TEXT[\W\w]+(\/\/ Track 1\n)/$1/m;

      And it's now doing what I mean. Thanks for the remedial lesson on the . metacharacter!!!

      -Matt

Re: regex help, please
by Marshall (Canon) on Dec 21, 2009 at 17:33 UTC
    You asked about an alternate approach. I have often found that data like you have is best parsed line by line instead of being slurped into a single variable. Of course your mileage will definitely vary!

    Anyway a "read the line and throw it away" if not needed is much faster than a "slurp" and substitution. A simple formulation of this is shown below.

    Update: I would also add Flipin good, or a total flop? as another way along the way of shem's approach. The three dot (...) version of the "flip flop" syntax works with multiple lines, the (..) version works with single lines.

    #!/usr/bin/perl -w use strict; my $line; while ($line = <DATA>) { skip_record() if $line =~ m/^\s*CD_TEXT/; print $line; } sub skip_record { while ($line = (<DATA>), $line !~ m|^\s*//\s*Track|){}; } #prints: # CD_DA # # // Track 1 __DATA__ CD_DA CD_TEXT { LANGUAGE_MAP { 0: 9 } LANGUAGE 0 { TITLE "Multi-01" PERFORMER "" SIZE_INFO { 1, 1, 19, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 7, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0} } } // Track 1

      Thank you very much for both the line-by-line test and flip-flop hints. This problem had to do with fixing an existing script which choked on unexpected input (the CD_TEXT field), and so I was looking for as non-invasive a solution as possible, but I'll definitely keep your tips in mind for next time.

      -Matt

Re: regex help, please
by shmem (Chancellor) on Dec 21, 2009 at 17:45 UTC

    Alternate approach:

    perl -ne 'print unless /^CD_TEXT {/ .. /^}/' tocfile
Re: regex help, please
by johngg (Canon) on Dec 21, 2009 at 20:25 UTC

    Marshall pointed you towards a flip-flop approach to the problem and shmem gave you an example flip-flop one-liner where your data was still in a real file rather than a scalar. You can still use the flip-flop method with multi-line data held in a scalar by opening a filehandle on a reference to the scalar and reading it as you would a disk-based file.

    $ perl -E ' > $text = <<'EOD'; > CD_DA > > CD_TEXT { > LANGUAGE_MAP { > 0: 9 > } > LANGUAGE 0 { > TITLE "Multi-01" > PERFORMER "" > SIZE_INFO { 1, 1, 19, 0, 3, 2, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 3, 7, 0, 0, 0, > 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0} > } > } > > // Track 1 > EOD > open $tocFH, q{<}, \ $tocFile or die $!; > while ( <$tocFH> ) { print unless m{^CD_T} .. m{^\}} }' CD_DA // Track 1 $

    I hope this is of interest.

    Cheers,

    JohnGG

Re: regex help, please
by SuicideJunkie (Vicar) on Dec 21, 2009 at 17:03 UTC

    In what way is it not working?

    Is it running too slowly because the greedy .+ first tries to match the entire file, and then tries the entire file-1, and so on?

    Is it removing too much because there is a second // Track in the file that it can match?

    Is it removing too little because one of your titles/performers happens to include the sequence // Track (while trying a non-greedy ".+?")?

    Without getting into real parsing, I'd suggest starting with a non-greedy +. Match explicitly against the closing bracket and the newlines to reduce (unfortunately not eliminate) the chances of matching something you don't expect, and move the actual // Track check into a lookahead, so that it is not part of the match and won't be deleted. (Also, lookaheads are good to learn for the future)

      The regex apparently isn't matching anything. If I print $tocfile, the CD_TEXT field is still there.

      I can see I have some book larnin ahead of me. Thanks for the tip about lookaheads.