Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to extract some numbers from an HTML file. There are five similar lines in the file that look like this:
<a href="/f1/show?page=matchup&lid=206&week=8&mid1=2&mid2=8">
I need to get the week, mid1, and mid2 values. So I need to put 8,2, and 8 into three separate variables. How can I search for this and store those three values in a variable?

Replies are listed 'Best First'.
Broken HTML (was Re: Extract numbers.....)
by merlyn (Sage) on Nov 04, 2001 at 02:14 UTC
    Meta discussion:
    <a href="/f1/show?page=matchup&lid=206&week=8&mid1=2&mid2=8">
    is actually broken HTML. It needs to be entitized as:
    <a href="/f1/show?page=matchup&amp;lid=206&amp;week=8&amp;mid1=2&amp;m +id2=8">
    Sure, some browsers error-correct that broken HTML and try to interpret what the original codes meant, but please don't write code that generates garbage like the first example. See the W3 HTML spec for confirmation and details.

    -- Randal L. Schwartz, Perl hacker

      And for the benefit of other readers, you can avert this issue entirely by avoiding & in your URL strings. If you're using a proper, robust CGI parameter parser (e.g. CGI.pm), this can be re-written as:
      <a href="/f1/show?page=matchup;lid=206;week=8;mid1=2;mid2=8">
      Much cleaner!
Re: Extract numbers.....
by Chady (Priest) on Nov 04, 2001 at 00:13 UTC

    If they are strictly identical...this will, stupidly, work:

    while (my $line = <DATA>) { next unless ($line =~ /^<a href=/); my @vars = split('&', $line); my ($mid2) = (split('=', pop(@vars)))[1] =~ /^(\d+)/; my $mid1 = (split('=', pop(@vars)))[1]; my $week = (split('=', pop(@vars)))[1]; # do domething with the vars... } __DATA__ blah blah <a href="/f1/show?page=matchup&lid=206&week=8&mid1=2&mid2=8"> more lines

    Of course.. this is far away from perfect... and I'm creating a useless array @vars...

    Update: you could just use something from CPAN... like the Link Extractor to do the links..


    He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

    Chady | http://chady.net/
      I'm a big fan of HTML::TokeParser for this sort of data extraction work ...

      #!/usr/bin/perl -w use HTML::TokeParser; use strict; my $html = '<a href="/f1/show?page=matchup&lid=206&week=8&mid1=2&mid2= +8">\n'; my @results; my $parser = HTML::TokeParser->new(\$html) || die $!; while (my $token = $parser->get_token) { my $type = shift @{ $token }; if ($type eq "S") { my ($tag, $attr, $attrseq, $text) = @{ $token }; if (($tag eq "a") && ($attr->{'href'} =~ /^\/f1\/show\?/i)) { my $a_href = $attr->{'href'}; $a_href =~ s/.+\?//g; my %vars = map { ((split(/=/, $_))[0]) => ((split(/=/, $_) +)[1]) } split(/&/, $a_href); push (@results, \%vars); }; }; }; foreach my $vars (@results) { print $_, " ", ${$vars}{$_}, "\n" foreach keys %{$vars}; }; exit 0;

      This code also checks to ensure that the href of the anchor matches what we want (eg. /f1/show) and allows for multiple matches in the HTML. Also too, if you are to extract this data from live web pages, the assignment of $html can easily be replaced with something similar to this:

      use LWP::Simple; my $html = get('http://url.to.webpage.com/'); die "LWP::Simple failed to retrieve source HTML - $!" unless ($html);

      There has also been a tutorial written on this module here by crazyinsomniac which includes an excellent step-by-step example for building a program based on HTML::TokeParser.

       

      Ooohhh, Rob no beer function well without!

Re: Extract numbers.....
by dvergin (Monsignor) on Nov 04, 2001 at 00:35 UTC
    There's a lot you don't say about how much variation might occur in your data. But here's some code:
    #!/usr/bin/perl -w use strict; # Dummy up some data my $html_page = <<END_HTML; <html> <head><title>My test page</title></head> <body> Stuff stuff stuff <a href="/f1/show?page=matchup&lid=206&week=8&mid1=2&mid2=4"> Grut garble glump <a href="/f1/show?page=matchup&lid=206&week=13&mid1=11&mid2=8"> Anderanda manda ander <a href="/f1/show?page=matchup&lid=206&week=4&mid1=7&mid2=7"> Bottom </body> </html> END_HTML # if you KNOW the bits will always be in the order above... while ( $html_page =~ /&week=(\d+)&mid1=(\d+)&mid2=(\d+)/g ) { my ($week, $mid1, $mid2) = ($1, $2, $3); print "week[$week] mid1[$mid1] mid2[$mid2]\n"; # do stuff } # if the bits can occur in any order... while ( $html_page =~ /(<a[^>]+>)/g ) { my $anchor_txt = $1; my ($week, $mid1, $mid2); my $found_all = 1; unless ( ($week) = $anchor_txt =~ /week=(\d+)/ ) {$found_all = 0} unless ( ($mid1) = $anchor_txt =~ /mid1=(\d+)/ ) {$found_all = 0} unless ( ($mid2) = $anchor_txt =~ /mid2=(\d+)/ ) {$found_all = 0} if ($found_all) { print "week[$week] mid1[$mid1] mid2[$mid2]\n"; # do stuff } else { print "Oops! Missing bit or extraneous tag.\n"; } }
    The link extractor   $html_page =~ /(<a^>+>)/g   is not safe for all purposes. But it is a bit more robust than Chady's quick-and-dirty grabber which assumes that the complete text of only one tag will appear on any one line. From the looks of it, it's likely that the one-line proviso is fine for your data.

    If the solutions proposed so far are not flexible enough to handle your data, give another holler...

    Hope this helps. David