Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Trying to get some data out of some html files I've got (quite a few). Problem is, the data isn't on the same line, so a regular expression won't help me. Basically, each string I want to extract looks likes.
<b>Title: </b> STRING_I_WANT<br>
or sometimes
<b>Title: </b>STRING_I_WANT<br>
Is there anyway to grab everything inbetween "Title:" and the next found instance of an html line break tag when they're not on the same line? Thanks

Replies are listed 'Best First'.
Re: Help with getting everything from X until Y
by rob_au (Abbot) on May 29, 2003 at 03:59 UTC
    In addition to the other answers here which offer regular expressions and alternate line break characters which match over multiple lines, you may want to consider the use of the .. range operator:

    #!/usr/bin/perl while ( <DATA> ) { print if /<b>/ .. /<br>/; } __DATA__ a whole lot of worthless stuff <b>Title: </b> STRING_I_WANT<br> more worthless meaningless stuff or sometimes <b>Title: </b>ANOTHER STRING_I_WANT<br>

    And the resultant output ...

    /users/rc6286 > perl test.perl <b>Title: </b> STRING_I_WANT<br> <b>Title: </b>ANOTHER STRING_I_WANT<br>

    Alternatively, if you are ensure that the match is made across multiple lines, and not merely against a single line which may contain the requisite start and stop range elements, the ... range operator may be used which does not test the second operand in the range statement until the next iteration, thereby ensuring that multiple lines are matched. See the perlop man page under the heading "Range Operators" for further details.

     

    perl -le 'print+unpack"N",pack"B32","00000000000000000000001001100001"'

      Thanks gents - these are all quite useful. I tamed-down the HTML a bit in my original post. It was actually a bit uglier. Here's my results w/ the 3 examples:
      local $/ = "<br>"; while ( <DATA> ) { print $1 if /Title:(.*?)<br>/s; }
      This works nice, but I got a lot of white spaces. Also, for some reason, I was getting an extra <\\b> in addition to the STRING_I_WANT. Easy to take out, but seem my question at the bottom if you could. #2:
      $whole_file =~ /Title:.*?</b>(.*?)<br>/ms;
      I actually assumed this would be the easiest way. I tried:
      undef $/; chomp($whole_file = <IN>); $whole_file =~ s/.*?Title:.*?<\/b>(.*?)<br>/$1/ ; print "$whole_file";
      The only thing this got me was the whole file printed out. :(

      #3
      while ( <DATA> ) { print if /<b>/ .. /<br>/; }
      This worked well. It would print out the whole line if it found something that fit in the range description. But I'm curiuos on one thing on this and the first example - how would I put the value into a variable when using the
      print if /<b>/ .. /<br>/;
      Thanks for your help. I'm making some progress but I'm still "in the books", so there's quiet a few tricks I've still yet to learn.
        When the while is written in this manner, each line that is processed is assigned to the default perl variable, $_ - When subsequent matching and printing is performed without specifying a variable or specific string to act upon, the use of the default variable $_ is assumed and it is this which is acted upon.

        See the perlvar and perlop man pages for further detail.

         

        perl -le 'print+unpack"N",pack"B32","00000000000000000000001001100010"'

        Modifiers matter. And didn't you say there was more than one occurrence in this file? Here's an untested example:

        undef $/; chomp($whole_file = <IN>); while ($whole_file =~ /Title:.*?<\/b>(.*?)<br>/sg) { print $1 . "\n"; }
Re: Help with getting everything from X until Y
by Enlil (Parson) on May 29, 2003 at 03:13 UTC
    Since you are using the <br> tag as the delimiter up to which you are getting stuff (and the next found instance of an html line break tag), you could use it to break up your data through use of $/. By that i mean something like the following:
    use strict; use warnings; local $/ = "<br>"; while ( <DATA> ) { print $1 if /Title:(.*?)<br>/s; } __DATA__ a whole lot of worthless stuff <b>Title: </b> STRING_I_WANT<br> more worthless meaningless stuff or sometimes <b>Title: </b>ANOTHER STRING_I_WANT<br>

    update: This should take care of the extra stuff. I reread your requirement, everything from "Title:" to <br> and did what that specified instead of just the strings you wanted. The following only retrieves those strings and shows you how to put it in a variable (basically assign $1 to something) (see rob au's response which answers the question you asked). Good luck, here is the code:

    -enlil

Re: Help with getting everything from X until Y
by perrin (Chancellor) on May 29, 2003 at 03:28 UTC
    Just slurp the file into a scalar and use a regex as usual:

    $whole_file =~ /Title:.*?</b>(.*?)<br>/sg;
Re: Help with getting everything from X until Y
by Cody Pendant (Prior) on May 29, 2003 at 05:32 UTC
    >Problem is, the data isn't on the same line, so a regular >expression won't help me.

    I just want to check that we're not missing something fundamental here -- this sentence definitely implies that the original poster believes regexes are unable to match patters which cross a linebreak.

    In case that truly was his or her belief -- they can. Specifically, the dot character in regexes matches everything except a linebreak, unless the /s modifier is used.

    So to match this:

    <b>Title: </b> STRING_I_WANT<br>
    you can just use
    m#<b>Title:.*?</b>(.*?)<br>#s;
    right?
    --
    “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
    M-J D