in reply to Re^3: Perl Possibilities
in thread Perl Possibilities
and given the variety of distinct sources (which presumably use distinct HTML/CSS formats and styles), I'd expect a variety of structural differences in the tags that appear in and around the patterns of interest.... Board recommends a vote FOR Proposal No. 2. ...
BTW, on the matter of "html" vs. "txt", it doesn't matter what a given file name looks like - what matters is what the content looks like. If the content has HTML tags, it's HTML data, and needs to be treated as such, regardless of what the file name might be.
If it's typical for texts of this sort to always include a single table near the top of the document that lists the proposals with number, name, and result, it may be that your best bet is Corion's idea about HTML::TableExtractor. It's just a matter of knowing which table in the overall file is the one you want.
Aside from that, any other practical approach will involve parsing the HTML first to get its plain-text content before you do anything that involves string comparisons or regex matches.
|
|---|