You simply have to extract the data-fields now.

There are several ways to do it.

As we've gotten rid of a lot of crap by only using one line of the original html, you could use a regular expression here, but that is in general not a good idea for decomposing html.

Personally I like HTML::TreeBuilder::XPath that you would have to install from CPAN.

Here is how you would then extract the name from one of the files with it:

use strict; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; #use real file name here open(my $fh, "<", "file.html") or die $!; $tree->parse_file($fh); my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]}); print $name->as_text;
As you can see you simply use an xpath-expression to indentify the node you want.

So how to determine that?

I use a Firefox-plugin called XPather, that allows you to simply click on a html-element and extract the corresponding xpath.

So you load the file you want to parse in Firefox, click on the stuff you want, get the xpath and use that in the perl-script.

Hope that gets you started...


In reply to Re^3: HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files.. by morgon
in thread HTML-Parser: a newbie question: need to extract exactly line 999 out of 5000 files.. by Perlbeginner1

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.