HTML::Parser, actually.

Be aware, though, that this module may be a little difficult to wrap your brain around, if you're new to Perl. If you're not comfortable with sub-classing and basic OO, if may be a little overwhelming. That said, there are some examples, and if you dig in the docs, you'll be able to find something that you should be able to carve up for your purposes. However, it's not something you'll be able to do in 20 minutes...

I can't tell exactly what you're trying to do from what you've provided, but there is a module called HTML::TableParser. Since you're using <TR>/</TR> tags, this indicates table rows. HTML::TableParser is useful for yanking the data out of tables. The problem is that if you need the HREF tag info, HTML::TableParser won't give it to you. In the luke_repwalker.pl script, I had a similiar problem. At the bottom of the code is a package you may be able to extract, and with a little tinkering would allow you to extract the table text and the HREF links.

Unless I'm making it more complicated than what you're trying to do, this may be of help. If you need some additional assistance with getting that working, drop me a /msg or an e-mail and we'll see what we can get going.

Using regexps to extract HTML *can* work, but it's not the best idea. Certain tags aren't balanced pairs, which can really mess you up. Also, there are some places where people will render the starting tags, but not the ending tags. Most browsers, trying to be the acommodating beasts they are, don't care about the end tags. This is particularly true of table rows and data. As such, unless you can be assured that the HTML is DTD spec HTML, using regexps is risky business.

This code was something I came up with, based on a /msg from Sharky_The_Dog. I realize that it could be collapsed into one statement, but that wasn't the point (and, dang it, tilly, I know $filename and $match could be 'use vars'!). It's also based on the fact that Sharky says his HTML is machine generated, and legal.
#!/usr/local/bin/perl -w use strict; my $filename = 'filename with HTML'; my $match = 'your criteria here'; { my $data; # # Use braces to localize the $/ assignment, so we don't get bitten + later. # { local $/ = undef; open (FH, "<$filename") || die; $data = <FH>; close FH; } # # @list will contain all the <tr>/</tr> pairs # my @list = $data =~ m/<tr>(.*?)<\/tr>/igs; # # @newlist will contain all the <tr>/</tr> pairs that match our se +arch criteria # my @newlist = grep { /$match/i } @list; # # Display the number of <tr>/</tr> pairs total, and the number tha +t matched the search criteria # print "Items total : ", scalar @list, "\n"; print "Items found : ", scalar @newlist, "\n"; }
--Chris

e-mail jcwren

In reply to (jcwren) RE: Re: File/String search... by jcwren
in thread File/String search... by Sharky_The_Dog

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.