sugarkannan has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am a newbie to PERL. I require your guidance for the
following problems. Any suggestions are appreciated. 1.I have this following piece of code, but I am unable to
get the required chunk of data i.e. the values between <br>color="BA3B38"> and  </td></tr>. But I am unable to get
past the first
in the first line. I tried
removing \n using translation $content =~ tr\n[]d;
but still program didn't work.
$content='<td valign="top" align="justify"><font face="arial" size="2" + color="BA3B38">Java and databases make a powerful combination. <br> Getting the two sides <br> to work together <br> Takes some effort to work together JDBC 2 javax.SQL </td></tr>'; $content =~ m{<td valign="top" align="justify"><font face="arial" size +="2" color="BA3B38">([.*?]+)<\td><\tr>}; print "$1\n\n";
2. I have the following chunk of code. In this I require UK
Pound and price and USD and its respective price. The
problem we are facing is we are unable to get the price
(i.e. number 12 and 27) portion and in this is
also we are facing new line character problem when
trying to extract the whole chunk and process values one by one.
<tr><td><font face="verdana" size="1"><b>Product Name</b> </td><td>:</ +td> <td><div align="justify"><font face="verdana" size="1">Queen of Al +ls</div></td> <td rowspan="5" valign="top"><img src="pictures/queen.gif"></td></ +tr> </tr> <tr><td><font face="verdana" size="1"><b>Price(UK Pound)</b></td><td>: +</td> <td><font face="verdana" size="1">12.00</td></tr> <tr><td><font face="verdana" size="1"><b>Price($)</b></td><td>:</td> <td><font face="verdana" size="1">$27.27</td></tr> <tr valign="TOP"><td><font face="verdana" size="1"><b>Description</b>< +/td><td>:</td></tr>
3."Parsing of undecoded UTF-8 will give garbage when
decoding entities at D:/Perl/site/lib/LWP/Protocol.pm
line 114." Any solution to this problem.

Thanks,
Sugar

Replies are listed 'Best First'.
Re: Problems with LWP and REGEX
by pileofrogs (Priest) on Nov 18, 2005 at 20:40 UTC

    OK, I think I can help you with your problems 1 & 2, but I don't know anything about 3.

    1

    Your first question is about matching multiple times on multiple lines in one string, right? Your question would apply to data in the form of:

    FOO=1 FOO=2 FOO=3

    right?

    #! /usr/bin/perl -w use strict; my $string = "FOO=1\nFOO=2\nFOO=3\n"; my @list; while ($string =~ /^FOO=(\d)/mg) { push (@list,$1); } print join(",",@list)."\n"; # prints out 1,2,3

    Note the 'mg' at the end of the regex. The 'm' means you're dealing with multiline strings, and the 'g' means global matching.

    Does that answer question number 1?

    2

    You're scanning the html code block for the pound and dollar values, right? I don't know if this is the right way to do this, but I usually do this kind of thing like this.

    Assuming all dollar values are prefixed by $ and pound values are not and that there aren't any other numbers that look kindof like money in there.

    my $dollars = 0; my $pounds = 0; while(<FILE>) { if (/>(\d+\.\d{2})<) { # no dollar sign, must be pounds $pounds = $1 } elsif (/>\$(\d+\.\d{2}) { # it's got a dollar sign, must be a dollar $dollars = $1; } }
      Thanx Monks !
      Thank your valuable info !
      Expecting more from you all !

      -Sugar
Re: Problems with LWP and REGEX
by ptum (Priest) on Nov 18, 2005 at 20:59 UTC

    As a general solution to any HTML-parsing problem, I've had good success with HTML::TreeBuilder. Once you get the response object back (I assume you're using LWP or WWW::Mechanize), you can elementify it and then step through the resulting tree, looking for your code or content with the as_HTML() or as_text() methods (HTML::Element).

    I never really got TreeBuilder working well when dealing with nested tables, but that may have been more a lack of diligence on my part than any particular deficiency in TreeBuilder or HTML::Parser.

      Thanx for ur valuable info. -Sugar