in reply to Extracting similar data from html

Won't it be nice when the CIA puts this information into XML documents available to the public? ;)

If the HTML is all in one scalar, your RE is going to match the first thing it can and then stop. And unless you change your scalar or further differentiate the $start $end tests, the first thing is going to be the same each time you check it. You either need to break the HTML into smaller chunks, or use something like HTML::TokeParser to wade through this.

Replies are listed 'Best First'.
Re: Re: Extracting similar data from html
by Jrchak (Initiate) on Jan 24, 2001 at 08:45 UTC
    What would be the best way to break the code into smaller chunks? The best way that I can think of wouldn't help because these two examples would still be in the same chunk because of them being in the same general catagory, Geography.

    And to tell you the truth, I had the same Idea about them publishing to XML the first time I checked out the pages and I even predicted on-the-spot that for their next release they will also release it in XML. If they don't its just damn dehumifiying. =)
      MeowChow's advice on how to work with the single RE looks good to me, except that if you are going to use it on all the different countries the implication is that the matches will all be valid and found in the same order for each page (so if a visual survey confirms this will work, by all means use that). This is the problem with RE-based parsing of HTML/XML, it seems like everytime you solve one problem, you find at least one more that impacts your last solution. The HTML::TokeParser module is really easy to use and will make this whole job a lot easier.