Ok, well after polishing of my primary regular expression and really getting into the "meat" of my app I relized I was faced by a real dilemma. Here a sample of the html I am sifting through:
<p><b>Area:</b> <br><i>total:</i> 9,629,091 sq km <br><i>land:</i> 9,158,960 sq km <br><i>water:</i> 470,131 sq km <p><b>Area - comparative:</b> about one-half the size of Russia; about three-tenths the size of Afri +ca; about one-half the size of South America (or slightly larger than + Brazil); slightly larger than China; about two and one-half times th +e size of Western Europe <p><b>Land boundaries:</b> <br><i>total:</i> 12,248 km <br><i>border countries:</i> Canada 8,893 km (including 2,477 km with Alaska), Cuba 29 km (US Naval + Base at Guantanamo Bay), Mexico 3,326 km <br><i>note:</i> Guantanamo Naval Base is leased by the US and thus remains part of Cub +a <p><b>Coastline:</b> 19,924 km
Here is my regex to extract the data I want and some sample use of the code:
sub extractData { ($start, $end, $rawData) = @_; if ($rawData =~ /$start\s*(.+?)\s*$end/) { $content = $1; } else { print "no location given, jackass."; $content = 0; } return $content; } $Start = "<i>total:</i>"; $End = "<br>"; $Area_total = extractData($Start, $End, $rawHtml);
This will work and will copy "9,629,091 sq km" to $Area_total. My problem is that after having that extracted I can seem to get similar data like the total under Land Boundries. I try the same code and It fails, I also tried to incorperate the
or the Land Boundries above it and have the result. I think that the newline screws it up somehow. Any thoughts?

In reply to Extracting similar data from html by Jrchak

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.