comment on

here's my go


#!/usr/bin/perl 

use strict;
use warnings;
use HTML::TokeParser::Simple;

my $html = q{
<a name="a"></a>
<h2>A</h2>
<table width="100%" cellpadding="0" cellspacing="0" border="0">
  <tr>
    <td>
      <table width="100%" cellpadding="5" cellspacing="0" border="1">
      <tr>
        <td width="33%" valign="top" class="clsTableBody">
          <a href="pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf" targe
+t="_blank">
            Abbott, Evelyn
          </a><br />
          <span>110136892</span><br />
          <a href="pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf" targe
+t="_blank">
            Agnew, Thomas
          </a><br />
          <span>110377660</span><br />
         </td>
        <td width="34%" valign="top" class="clsTableBodyClear">
          <a href="pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf" targe
+t="_blank">
            Allison, David
          </a><br />
          <span>108116112</span><br />
          <a href="pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf" targe
+t="_blank">
            Allison, Gary Owen
          </a><br />
          <span>116815754</span><br />
        </td>
        <td width="33%" valign="top" class="clsTableBody">
          <a href="pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf" targe
+t="_blank">
            Arsenault, Michael
          </a><br />
          <span>108318866</span><br />
          <a href="pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf" targe
+t="_blank">
            Arsenault, Normand A.
          </a><br />
          <span>113069066</span><br />
        </td>
      </tr>
      </table>
    </td>
  </tr>
</table>
};

my $p = HTML::TokeParser::Simple->new(\$html);

# parse until second table
my $table_count = 2;
while (my $t = $p->get_tag('table')){
  last unless --$table_count;
}
my (%href, $this_href, $number);
while (my $t = $p->get_token){
  if ($t->is_start_tag('a')){
    $this_href =  $t->get_attr('href');
    next;
  }
  if ($t->is_start_tag('span')){
    $number = $p->get_trimmed_text('/span');
    $href{$this_href} = $number;
    next;
  }
  last if $t->is_end_tag('table');
}

for my $key (keys %href){
  print "$key -> $href{$key}\n";
}
[download]

output:

---------- Capture Output ----------
> "C:\Perl\bin\perl.exe" _new.pl
pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660
pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892
pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866
pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112
pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066
pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754

> Terminated with exit code 0.
[download]

In reply to Re: Process a HTML file to get information from it. by wfsp
in thread Process a HTML file to get information from it. by Griffler

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.