natol44 has asked for the wisdom of the Perl Monks concerning the following question:

Hello!

Not sure how to write the topic...

I have html page that contains a few occurrences of some text I need to record.

<START>TEXT1<END><br> <various data><br> <START>TEXT2<END><br> <various data><br> <START>TEXT3<END><br><br>
etc etc.

And so I need TEXT1, TEXT2, TEXT3 etc, recorded in a file.

I first reformated the page content in ONE line by striping the carriage return. So I have one string now and want to extract substrings. Why did I do this? Because I imagine it will be easier to extract the substrings..

Then:

open (FICH, "$file"); $all = <FICH>; close (FICH); my $good = $1 if ($all =~ m/$start(.*?)$end/);
This will give me the first TEXT ($good) occurrence. But how to get all the next ones?

Thanks!

PS. Each page contains up to 50 substrings I need to extract, and I will have a large quantity of pages, that I will process one by one. Substrings to be recorded like CSV, one substring by line, to be later exploited by an Excel sheet.

Replies are listed 'Best First'.
Re: Search all occurences of text delimited by START and END in a string
by flexvault (Monsignor) on May 12, 2015 at 13:29 UTC

    Hello natol44,

    Have you looked at the 3rd optional parameter of 'index'. 'index' is very fast and since you have a large string, you can loop until the end by starting the next search('index') with the end of the most recent end of your search. (untested code):

    my $pos = 0; while ( $pos < length( $all ) ) { my $si = index ( $all, "<START>", $pos ); if ( $si < 0 ) { last; } my $sj = index ( $all, "<END>", $si+1 ); if ( $sj > $si+1 ) { ... Collect you string with 'substr' etc. } $pos = $sj + 4; # set search for finding next text. }
    Note: There are working examples in the Perl Cookbook and also using Super Search.

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

      Thank you all, and especially Ed, whose method was the best for me!

Re: Search all occurences of text delimited by START and END in a string
by choroba (Cardinal) on May 12, 2015 at 13:14 UTC
    What is the real value of START and END? If they are HTML tags, possibly with some attributes, you should use an HTML parser instead of regular expressions.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Search all occurences of text delimited by START and END in a string
by pme (Monsignor) on May 12, 2015 at 13:28 UTC
    Hi natol44,

    You can process your file simply line-by-line.

    #!/usr/bin/perl use strict; use warnings; use diagnostics; my $start = '<START>'; my $end = '<END>'; foreach (<DATA>) { chomp; print "$1\n" if /$start(.+)$end/; } __DATA__ <START>TEXT1<END> <various data> <START>TEXT2<END> <various data> <START>TEXT3<END>
    Or you may use HTML::Parser if your html files are not well formatted.
      While your answer is accurate to within the posted spec, for good form it's probably better to make it new line tolerant and to encourage people inlining variables into regexes to escape meta characters.
      #!/usr/bin/perl use strict; use warnings; use diagnostics; my $start = '<START>'; my $end = '<END>'; my $data = do { local $/; <DATA>; }; while ($data =~ /\Q$start\E(.+?)\Q$end\E/sg) { print "$1\n"; } __DATA__ <START>TEXT1<END> <various data> <START>TEXT2<END> <various data> <START>TEXT3<END>
      If they are concerned about holding the whole file in memory, there is a convenient choice for record separator:
      #!/usr/bin/perl use strict; use warnings; use diagnostics; my $start = '<START>'; my $end = '<END>'; local $/ = $end; while (<DATA>) { while (/\Q$start\E(.+?)\Q$end\E/sg) { print "$1\n"; } } __DATA__ <START>TEXT1<END> <various data> <START>TEXT2<END> <various data> <START>TEXT3<END>
      where I've kept the regex as is since the last record will not be <END> delimited, and so there'd be a failure for an unmatched <START>

      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.