Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings monks

I am having some problems with pattern matching.

I need to generate some perl that searches through an html document to find snippets of text delimited by two strings, and to assign anything between these two delimiting strings to an array.
The snippets of text are spread over 4 lines.
The opening delimiter is <!-- Start_of_revision--> on the first line, and the closing delimiter is <!-- End_of_revision-->, on the fourth line.
There can multiple instances of this 4-line snippet in an html document, and each time I detect one of these instances, I would like to put anything between the delimiters into an array, thus building up a list of revisions.

I have tried:

@revision_array=(); if (m/<!-- Start_of_revision-->(.+)<!-- End_of_revision-->/mg) { push (@revision_array, $1); }

But alas, nothing is detected. I'm guessing this is a hopelessly naive approach, but I really haven't got the time to work it out myself, so I was wondering if you guys could help me out?

C J

Replies are listed 'Best First'.
Re: pattern matching
by ikegami (Patriarch) on Jul 18, 2005 at 17:13 UTC

    You have two errors.

    "." doesn't match newlines. Use the "s" regexp modifier.

    The second problem is that the regexp will ever only return one match. It will start matching at the first "<!-- Start_of_revision-->" and will continue until the last <!-- End_of_revision-->. Append a question mark to ".*" to make it non-greedy.

    @revision_array=(); if (m/<!-- Start_of_revision-->(.*?)<!-- End_of_revision-->/sg) { push (@revision_array, $1); }

    The following snippet does the same thing, but it should be faster (at the cost of using more memory):

    @revision_array = m/<!-- Start_of_revision-->(.*?)<!-- End_of_revision-->/sg;

    By the way, I removed the "m" regexp modifier since it's useless for regexps which use neither "^" nor "$".

Re: pattern matching
by halley (Prior) on Jul 18, 2005 at 17:16 UTC
    If your start marker and your end marker are always on separate lines (which isn't always a good assumption or design for HTML-alikes), then the flip-flop can be useful:
    while (<>) { push(@array, $_) if /one/ .. /two/; do_something() if /two/; }

    --
    [ e d @ h a l l e y . c c ]

Re: pattern matching
by JediWizard (Deacon) on Jul 18, 2005 at 17:18 UTC

    First: It is really not recomended to use regex to parse htlm (tends to turn into a nightmare).

    Second: You need to use the s modifier on your regex in order to make . match the newline character.

    Third: using .+ is greedy, and will match from the first <!-- Start_of_revision--> to the last <!-- End_of_revision--> (which I am sure is not what you want).

    Forth: The code below is untested

    @revision_array=(); my $start = qr/<!-- Start_of_revision-->/; my $end = qr/<!-- Start_of_revision-->/; if (m/$start((?:(?!$end).)+)$end/msg) { push (@revision_array, $1); }

    Update: And to get all of them, that if should be a while


    They say that time changes things, but you actually have to change them yourself.

    —Andy Warhol

Re: pattern matching
by cmeyer (Pilgrim) on Jul 18, 2005 at 17:48 UTC

    While the other solutions offered work on a file that is already entirely slurped into memory, it is possible to combine the flip flop operator (".." in scalar context) with line oriented file processing. You could try something like:

    my ( @revision_array, @current_revision); while (<DATA>) { my $status; if ( $status = /<!-- Start_of_revision-->/ .. /<!-- End_of_revision- +->/ and $status != 1 and $status !~ /E/ ) { push @current_revision, $_; } if ( $status =~ /E/ ) { push @revision_array, join '', @current_revision; undef @current_revision; } } my $count; for my $revision (@revision_array) { print 'revision ', ++$count, ": \n$revision\n"; } __DATA__ <head>somthing</head> <!-- Start_of_revision--> some revised text <!-- End_of_revision--> some other regular tags and stuff <br> <!-- Start_of_revision--> another revision <!-- End_of_revision--> the rest of the document

    That obscure line about testing $status to see if it contains the letter 'E' is to detect the state change from "inside the matching area" to "outside the matching area".

    For more information, see 'perldoc perlop'. Search for "Range Operators", and pay attention to the paragraphs about scalar context. If you are crazy, and enjoy using OO beyond its range of usefulness, then you might like Bit::FlipFlop.

    -Colin.

    WHITEPAGES.COM | INC

Re: pattern matching
by mvaline (Friar) on Jul 18, 2005 at 17:26 UTC
    It looks to me like you're trying to re-invent a source-code control system. If you really want to track revisions, you could do so without cluttering up your HTML documents by using a real system like CVS.