Reg Ex problems....

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys! Had a favor (again!) to ask of you smart people - I have two problems:

1. I have a HTML page in the form -

JUNK JUNK

NAME NAME
PANE
[download]

And I want to extract the "NAME NAME" part only (not the stuff above or below NAME NAME either). How would I go about doing that. I tried some RegEx like m/(.*)PANE\s*([^<]+)/gi; but that didn't work. Any thoughts? BTW "PANE" is the keyword which is always there.... the NAME NAME changes.

2. I have the same html page, which has a lot of text, but will also has a website URL (not a hyperlink) surrounded by either a "" or a < > and I want to extract it. Thing is that when the < > is there I can get the URL easily, but when I do a RegEx for finding the url in a "", I get the other content on the page which is also in quotes.

Any help would be appreciated.

Thanks.

Comment on Reg Ex problems.... Select or Download Code

Replies are listed 'Best First'.
Re: Reg Ex problems.... by Enlil (Parson) on Jan 09, 2003 at 05:59 UTC
This works (i would have to run but will explain later if needed, and yes it assumes a lot (for instance that the stuff in NAME NAME will be all \w chars) but maybe it will give you a nudge in the right direction: `use strict; my $stuff = <<EOF; JUNK JUNK NAME NAME PANE THIS IS OTHER JUNK BAH EOF if ($stuff =~ m/(\w+ \w+)\nPANE/) { print $1; } ;` [download] as for part 2. you are probably using a greedy ., which would probably be alleviated by changing it to `.?` or better yet `[^"]+` update: You say that you are left with a result like this: `abcddeds. name name pane pane date` [download] so you can try something like so: `use strict; my $string = ' abcddeds. name name PANE pane date'; if ($string =~ /(\w+ #one or more word chars (alphanumeric plus +_ matched) \s+ #at least one space \w+ #one or more word chars ) #close capturing parens \s+ #another space pane #matches pane /ix #"i" makes it case insensitive x makes it s +o #i can add comments ) { print $1; }` [download] you should really take a look at perlre, and try to figure out why what I initially wrote failed against what you say the results looked like. Again though I took a lot of liberty in assuming that "name name" would contain a only alphanumeric chars. The i modifier was added as initially you had PANE, and now it is pane. -enlil	[reply] [d/l] [select]
Re: Re: Reg Ex problems.... by Anonymous Monk on Jan 09, 2003 at 06:20 UTC
Enlil, Thanks for the reply, but unfortunately it didn't work me. I rechecked my code and it seems that after removing all the HTML tags from the HTML page, I have a result like `abcddeds. name name pane pane date` [download] All I need to get is the "name name" before the "pane". Do you think the empty spaces could be causing a problem? Thanks.	[reply] [d/l]
Re: Reg Ex problems.... by seattlejohn (Deacon) on Jan 09, 2003 at 07:24 UTC
For #1: I think you want the trailing `s` modifier on any pattern you use, since the text you're matching against contains newlines that you want to treat as normal whitespace characters. Perhaps something like this would work: `m/\n([^\n]*)\nPANE/s` For #2: It sounds like your regex for identifying a URL might make some erroneous assumptions. Perhaps if you posted the specific code someone could offer more detailed assistance. $perlmonks{seattlejohn} = 'John Clyman';	[reply] [d/l]
Re: Re: Reg Ex problems.... by ihb (Deacon) on Jan 09, 2003 at 15:41 UTC
The `m` and `s` modifiers are the everlasting objects of confusion for regexes. What the `s` modifier does is nothing but making `.` match everything, including newline. You're right that he probably wants to use `s` in his pattern, but in your pattern you've change the dot to `[^\n]` and now the `s` has no effect. Though, I'm one of those that propagate a wide use of `s`, simply because it's so often forgotten. So ++ for you for pointing it out. `:)` `ihb`	[reply] [d/l] [select]
Re: Reg Ex problems.... by helgi (Hermit) on Jan 09, 2003 at 10:59 UTC
Don't use a regex to parse HTML. Use a module that understands HTML. I suggest HTML::TokeParser::Simple. -- Regards, Helgi Briem helgi AT decode DOT is	[reply]