comment on

Assuming each letter is in an H2 tag (and that these are the only H2 tags) and that each structure is identical.

This should do the trick. We collect the data into a HoH (%href).

Hope this helps.

my $p = HTML::TokeParser::Simple->new(\$html);
my (%href, $this_href, $number, $letter);

while (my $t = $p->get_token){
  
  if ($t->is_start_tag('h2')){
    $letter = $p->get_trimmed_text('/h2');
    next;
  }
  
  if ($t->is_start_tag('a')){
    # skip bookmarks
    next if $t->get_attr('name');
    $this_href =  $t->get_attr('href');
    next;
  }
  
  if ($t->is_start_tag('span')){
    $number = $p->get_trimmed_text('/span');
    $href{$letter}{$this_href} = $number;
    next;
  }
  
}
[download]

output

---------- Capture Output ----------
> "C:\Perl\bin\perl.exe" _new.pl
A
    pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660
    pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892
    pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866
    pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112
    pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066
    pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754
B
    pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660
    pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892
    pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866
    pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112
    pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066
    pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754
C
    pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660
    pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892
    pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866
    pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112
    pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066
    pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754

> Terminated with exit code 0..
[download]

In reply to Re^3: Process a HTML file to get information from it. by wfsp
in thread Process a HTML file to get information from it. by Griffler

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.