Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys! Had a favor (again!) to ask of you smart people - I have two problems:

1. I have a HTML page in the form -
JUNK JUNK NAME NAME PANE
And I want to extract the "NAME NAME" part only (not the stuff above or below NAME NAME either). How would I go about doing that. I tried some RegEx like   m/(.*)PANE\s*([^<]+)/gi; but that didn't work. Any thoughts? BTW "PANE" is the keyword which is always there.... the NAME NAME changes.

2. I have the same html page, which has a lot of text, but will also has a website URL (not a hyperlink) surrounded by either a "" or a < > and I want to extract it. Thing is that when the < > is there I can get the URL easily, but when I do a RegEx for finding the url in a "", I get the other content on the page which is also in quotes.

Any help would be appreciated.

Thanks.

Replies are listed 'Best First'.
Re: Reg Ex problems....
by Enlil (Parson) on Jan 09, 2003 at 05:59 UTC
    This works (i would have to run but will explain later if needed, and yes it assumes a lot (for instance that the stuff in NAME NAME will be all \w chars) but maybe it will give you a nudge in the right direction:
    use strict; my $stuff = <<EOF; JUNK JUNK NAME NAME PANE THIS IS OTHER JUNK BAH EOF if ($stuff =~ m/(\w+ \w+)\nPANE/) { print $1; } ;
    as for part 2. you are probably using a greedy .*, which would probably be alleviated by changing it to .*? or better yet [^"]+

    update: You say that you are left with a result like this:

    abcddeds. name name pane pane date
    so you can try something like so:
    use strict; my $string = ' abcddeds. name name PANE pane date'; if ($string =~ /(\w+ #one or more word chars (alphanumeric plus +_ matched) \s+ #at least one space \w+ #one or more word chars ) #close capturing parens \s+ #another space pane #matches pane /ix #"i" makes it case insensitive x makes it s +o #i can add comments ) { print $1; }
    you should really take a look at perlre, and try to figure out why what I initially wrote failed against what you say the results looked like. Again though I took a lot of liberty in assuming that "name name" would contain a only alphanumeric chars. The i modifier was added as initially you had PANE, and now it is pane.

    -enlil

      Enlil, Thanks for the reply, but unfortunately it didn't work me. I rechecked my code and it seems that after removing all the HTML tags from the HTML page, I have a result like
      abcddeds. name name pane pane date
      All I need to get is the "name name" before the "pane". Do you think the empty spaces could be causing a problem? Thanks.
Re: Reg Ex problems....
by seattlejohn (Deacon) on Jan 09, 2003 at 07:24 UTC
    For #1: I think you want the trailing s modifier on any pattern you use, since the text you're matching against contains newlines that you want to treat as normal whitespace characters. Perhaps something like this would work:
    m/\n([^\n]*)\nPANE/s

    For #2: It sounds like your regex for identifying a URL might make some erroneous assumptions. Perhaps if you posted the specific code someone could offer more detailed assistance.

            $perlmonks{seattlejohn} = 'John Clyman';

      The m and s modifiers are the everlasting objects of confusion for regexes. What the s modifier does is nothing but making . match *everything*, including newline. You're right that he probably wants to use s in his pattern, but in your pattern you've change the dot to [^\n] and now the s has no effect.

      Though, I'm one of those that propagate a wide use of s, simply because it's so often forgotten. So ++ for you for pointing it out. :)

      ihb
Re: Reg Ex problems....
by helgi (Hermit) on Jan 09, 2003 at 10:59 UTC
    Don't use a regex to parse HTML. Use a module that understands HTML. I suggest HTML::TokeParser::Simple.

    --
    Regards,
    Helgi Briem
    helgi AT decode DOT is