regex and HTML

toadi has asked for the wisdom of the Perl Monks concerning the following question:

 <tr bgcolor="#CCCCCC">
                  <td align="right" width="32%" bgcolor="#CCCCCC"><b><
+font size="1" face="Verdana, Arial, Helvetica, sans-seri
f">Firstname:</font></b></td>
                  <td width="68%" bgcolor="#CCCCCC"><font face="Verdan
+a, Arial, Helvetica, sans-serif" size="1">&nbsp;</font><
/td>
                </tr>
                <tr bgcolor="#F4F4ED">
                  <td align="right" width="32%"><b><font face="Verdana
+, Arial, Helvetica, sans-serif" size="1">Lastname:</font
></b></td>
                  <td width="68%"><font face="Verdana, Arial, Helvetic
+a, sans-serif" size="1">Luc Bomans&nbsp;</font></td>
                </tr>
[download]

I needed to parse the first and lastname out of a html page. But I still need to know what is the first and the last when matching.
Tried a lot, but my regex suck on these big things!!!
Can smo plz point me to the light...

--
My opinions may have changed,
but not the fact that I am right

Comment on regex and HTML Download Code

Replies are listed 'Best First'.
Re: regex and HTML by Coyote (Deacon) on Apr 09, 2001 at 19:56 UTC
Try using one of the HTML parsing modules such as HTML::Parser or HTML::TokeParser to do this. Parsing HTML with regexen is a perilous endeavor. You will get your project done much faster and with far fewer errors if you take the virtuous route (i.e., lazy) route and use one of these modules. ---- Coyote	[reply]
Re: regex and HTML by jeroenes (Priest) on Apr 09, 2001 at 20:03 UTC
You can simplify your regexes a lot, by deleting all the HTML tags. Thereafter, you have to look at the structure of the remaining text and whitespace. You could try: `undef $/; my $html = <DATA>; $html =~ s/<.+?>//sg; #strips all HTML, in a quick'n'dirty way $html =~ s/\s+/ /sg; #normalizes whitespace` [download] The final match is left as an exercise... Jeroen "We are not alone"(FZ)	[reply] [d/l]
Re: Re: regex and HTML by Beatnik (Parson) on Apr 09, 2001 at 20:14 UTC
PerlFAQ9 has a entry on Stripping HTML Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply]
Re: Re: regex and HTML by toadi (Chaplain) on Apr 09, 2001 at 23:49 UTC
lol, Got same idea as you. Just stripped the HTML. But your regex is prettier then mine so I won't post it :P -- My opinions may have changed, but not the fact that I am right	[reply]
Re: regex and HTML by premchai21 (Curate) on Apr 09, 2001 at 19:51 UTC
If it's well-formed, XHTML-wise, try XML::Parser.	[reply]
Re: regex and HTML by suaveant (Parson) on Apr 09, 2001 at 20:54 UTC
could do something like... `/Firstname:(?:\s\|<.?>)([^&<]+)/s; print "First: $1\n"; /Lastname:(?:\s\|<.?>)([^&<]+)/s; print "Last: $1\n";` [download] I tried this, it works... It looks for Firstname:, then goes through as many combinations of whitespace characters and html tagged text as it finds. When it runs out of those it grabs all the caharacters till it hits an & or a < If the name always has an ` ` you could do `/Firstname:(?:\s\|<.?>)(.?) /s; print "First: $1\n"; /Lastname:(?:\s\|<.?>)(.?) /s; print "Last: $1\n";` [download] Name is in $1 - Ant	[reply] [d/l] [select]