Jrchak has asked for the wisdom of the Perl Monks concerning the following question:

Ok, well after polishing of my primary regular expression and really getting into the "meat" of my app I relized I was faced by a real dilemma. Here a sample of the html I am sifting through:
<p><b>Area:</b> <br><i>total:</i> 9,629,091 sq km <br><i>land:</i> 9,158,960 sq km <br><i>water:</i> 470,131 sq km <p><b>Area - comparative:</b> about one-half the size of Russia; about three-tenths the size of Afri +ca; about one-half the size of South America (or slightly larger than + Brazil); slightly larger than China; about two and one-half times th +e size of Western Europe <p><b>Land boundaries:</b> <br><i>total:</i> 12,248 km <br><i>border countries:</i> Canada 8,893 km (including 2,477 km with Alaska), Cuba 29 km (US Naval + Base at Guantanamo Bay), Mexico 3,326 km <br><i>note:</i> Guantanamo Naval Base is leased by the US and thus remains part of Cub +a <p><b>Coastline:</b> 19,924 km
Here is my regex to extract the data I want and some sample use of the code:
sub extractData { ($start, $end, $rawData) = @_; if ($rawData =~ /$start\s*(.+?)\s*$end/) { $content = $1; } else { print "no location given, jackass."; $content = 0; } return $content; } $Start = "<i>total:</i>"; $End = "<br>"; $Area_total = extractData($Start, $End, $rawHtml);
This will work and will copy "9,629,091 sq km" to $Area_total. My problem is that after having that extracted I can seem to get similar data like the total under Land Boundries. I try the same code and It fails, I also tried to incorperate the
or the Land Boundries above it and have the result. I think that the newline screws it up somehow. Any thoughts?

Replies are listed 'Best First'.
Re: Extracting similar data from html
by ichimunki (Priest) on Jan 24, 2001 at 07:56 UTC
    Won't it be nice when the CIA puts this information into XML documents available to the public? ;)

    If the HTML is all in one scalar, your RE is going to match the first thing it can and then stop. And unless you change your scalar or further differentiate the $start $end tests, the first thing is going to be the same each time you check it. You either need to break the HTML into smaller chunks, or use something like HTML::TokeParser to wade through this.
      What would be the best way to break the code into smaller chunks? The best way that I can think of wouldn't help because these two examples would still be in the same chunk because of them being in the same general catagory, Geography.

      And to tell you the truth, I had the same Idea about them publishing to XML the first time I checked out the pages and I even predicted on-the-spot that for their next release they will also release it in XML. If they don't its just damn dehumifiying. =)
        MeowChow's advice on how to work with the single RE looks good to me, except that if you are going to use it on all the different countries the implication is that the matches will all be valid and found in the same order for each page (so if a visual survey confirms this will work, by all means use that). This is the problem with RE-based parsing of HTML/XML, it seems like everytime you solve one problem, you find at least one more that impacts your last solution. The HTML::TokeParser module is really easy to use and will make this whole job a lot easier.
Re: Extracting similar data from html
by MeowChow (Vicar) on Jan 24, 2001 at 12:07 UTC
    Simply use the m//g operator in a list context, to extract multiple matches, as below:
    my @info = $rawdata =~ /$Start\s*(.+?)\s*$End/g; print join ';', @info;
    If you plan on running this particular regex frequently, and performance is a concern, you may also want to consider compiling it with qr//:
    my $regex = qr/$Start\s*(.+?)\s*$End/; my @info = $rawdata =~ /$regex/g; print join ';', @info;
Re: Extracting similar data from html
by dkubb (Deacon) on Jan 24, 2001 at 14:19 UTC
    In a post here I listed some of the top 5 most-used Perl modules for parsing HTML.

    I think you will find that using CPAN's HTML modules will be your best bet, rather than using regular expressions to parse out the sample HTML you posted.