toadi has asked for the wisdom of the Perl Monks concerning the following question:

<tr bgcolor="#CCCCCC"> <td align="right" width="32%" bgcolor="#CCCCCC"><b>< +font size="1" face="Verdana, Arial, Helvetica, sans-seri f">Firstname:</font></b></td> <td width="68%" bgcolor="#CCCCCC"><font face="Verdan +a, Arial, Helvetica, sans-serif" size="1">&nbsp;</font>< /td> </tr> <tr bgcolor="#F4F4ED"> <td align="right" width="32%"><b><font face="Verdana +, Arial, Helvetica, sans-serif" size="1">Lastname:</font ></b></td> <td width="68%"><font face="Verdana, Arial, Helvetic +a, sans-serif" size="1">Luc Bomans&nbsp;</font></td> </tr>
I needed to parse the first and lastname out of a html page. But I still need to know what is the first and the last when matching.
Tried a lot, but my regex suck on these big things!!!
Can smo plz point me to the light...


--
My opinions may have changed,
but not the fact that I am right

Replies are listed 'Best First'.
Re: regex and HTML
by Coyote (Deacon) on Apr 09, 2001 at 19:56 UTC
    Try using one of the HTML parsing modules such as HTML::Parser or HTML::TokeParser to do this. Parsing HTML with regexen is a perilous endeavor. You will get your project done much faster and with far fewer errors if you take the virtuous route (i.e., lazy) route and use one of these modules.

    ----
    Coyote

Re: regex and HTML
by jeroenes (Priest) on Apr 09, 2001 at 20:03 UTC
    You can simplify your regexes a lot, by deleting all the HTML tags. Thereafter, you have to look at the structure of the remaining text and whitespace. You could try:
    undef $/; my $html = <DATA>; $html =~ s/<.+?>//sg; #strips all HTML, in a quick'n'dirty way $html =~ s/\s+/ /sg; #normalizes whitespace
    The final match is left as an exercise...

    Jeroen
    "We are not alone"(FZ)

      lol,
      Got same idea as you. Just stripped the HTML. But your regex is prettier then mine so I won't post it :P

      --
      My opinions may have changed,
      but not the fact that I am right

Re: regex and HTML
by premchai21 (Curate) on Apr 09, 2001 at 19:51 UTC
Re: regex and HTML
by suaveant (Parson) on Apr 09, 2001 at 20:54 UTC
    could do something like...
    /Firstname:(?:\s|<.*?>)*([^&<]+)/s; print "First: $1\n"; /Lastname:(?:\s|<.*?>)*([^&<]+)/s; print "Last: $1\n";
    I tried this, it works... It looks for Firstname:, then goes through as many combinations of whitespace characters and html tagged text as it finds. When it runs out of those it grabs all the caharacters till it hits an & or a <

    If the name always has an &nbsp; you could do

    /Firstname:(?:\s|<.*?>)*(.*?)&nbsp;/s; print "First: $1\n"; /Lastname:(?:\s|<.*?>)*(.*?)&nbsp;/s; print "Last: $1\n";
    Name is in $1
                    - Ant