Extract numbers.....

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Broken HTML (was Re: Extract numbers.....) by merlyn (Sage) on Nov 04, 2001 at 02:14 UTC
Meta discussion: `<a href="/f1/show?page=matchup&lid=206&week=8&mid1=2&mid2=8">` [download] is actually broken HTML. It needs to be entitized as: `<a href="/f1/show?page=matchup&lid=206&week=8&mid1=2&m +id2=8">` [download] Sure, some browsers error-correct that broken HTML and try to interpret what the original codes meant, but please don't write code that generates garbage like the first example. See the W3 HTML spec for confirmation and details. -- Randal L. Schwartz, Perl hacker	[reply] [d/l] [select]
Re: Broken HTML (was Re: Extract numbers.....) by Fastolfe (Vicar) on Nov 04, 2001 at 03:24 UTC
And for the benefit of other readers, you can avert this issue entirely by avoiding `&` in your URL strings. If you're using a proper, robust CGI parameter parser (e.g. CGI.pm), this can be re-written as: `<a href="/f1/show?page=matchup;lid=206;week=8;mid1=2;mid2=8">` [download] Much cleaner!	[reply] [d/l] [select]
Re: Extract numbers..... by Chady (Priest) on Nov 04, 2001 at 00:13 UTC
If they are strictly identical...this will, stupidly, work: `while (my $line = <DATA>) { next unless ($line =~ /^<a href=/); my @vars = split('&', $line); my ($mid2) = (split('=', pop(@vars)))[1] =~ /^(\d+)/; my $mid1 = (split('=', pop(@vars)))[1]; my $week = (split('=', pop(@vars)))[1]; # do domething with the vars... } __DATA__ blah blah <a href="/f1/show?page=matchup&lid=206&week=8&mid1=2&mid2=8"> more lines` [download] Of course.. this is far away from perfect... and I'm creating a useless array @vars... Update: you could just use something from CPAN... like the Link Extractor to do the links.. He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life. Chady \| http://chady.net/	[reply] [d/l]
Re: Re: Extract numbers..... by rob_au (Abbot) on Nov 04, 2001 at 07:50 UTC
I'm a big fan of HTML::TokeParser for this sort of data extraction work ... #!/usr/bin/perl -w use HTML::TokeParser; use strict; my $html = '<a href="/f1/show?page=matchup&lid=206&week=8&mid1=2&mid2= +8">\n'; my @results; my $parser = HTML::TokeParser->new(\$html) \|\| die $!; while (my $token = $parser->get_token) { my $type = shift @{ $token }; if ($type eq "S") { my ($tag, $attr, $attrseq, $text) = @{ $token }; if (($tag eq "a") && ($attr->{'href'} =~ /^\/f1\/show\?/i)) { my $a_href = $attr->{'href'}; $a_href =~ s/.+\?//g; my %vars = map { ((split(/=/, $_))[0]) => ((split(/=/, $_) +)[1]) } split(/&/, $a_href); push (@results, \%vars); }; }; }; foreach my $vars (@results) { print $_, " ", ${$vars}{$_}, "\n" foreach keys %{$vars}; }; exit 0; [download] This code also checks to ensure that the href of the anchor matches what we want (eg. `/f1/show`) and allows for multiple matches in the HTML. Also too, if you are to extract this data from live web pages, the assignment of `$html` can easily be replaced with something similar to this: `use LWP::Simple; my $html = get('http://url.to.webpage.com/'); die "LWP::Simple failed to retrieve source HTML - $!" unless ($html);` [download] There has also been a tutorial written on this module here by crazyinsomniac which includes an excellent step-by-step example for building a program based on HTML::TokeParser. Ooohhh, Rob no beer function well without!	[reply] [d/l] [select]
Re: Extract numbers..... by dvergin (Monsignor) on Nov 04, 2001 at 00:35 UTC
There's a lot you don't say about how much variation might occur in your data. But here's some code: #!/usr/bin/perl -w use strict; # Dummy up some data my $html_page = <<END_HTML; <html> <head><title>My test page</title></head> <body> Stuff stuff stuff <a href="/f1/show?page=matchup&lid=206&week=8&mid1=2&mid2=4"> Grut garble glump <a href="/f1/show?page=matchup&lid=206&week=13&mid1=11&mid2=8"> Anderanda manda ander <a href="/f1/show?page=matchup&lid=206&week=4&mid1=7&mid2=7"> Bottom </body> </html> END_HTML # if you KNOW the bits will always be in the order above... while ( $html_page =~ /&week=(\d+)&mid1=(\d+)&mid2=(\d+)/g ) { my ($week, $mid1, $mid2) = ($1, $2, $3); print "week[$week] mid1[$mid1] mid2[$mid2]\n"; # do stuff } # if the bits can occur in any order... while ( $html_page =~ /(<a[^>]+>)/g ) { my $anchor_txt = $1; my ($week, $mid1, $mid2); my $found_all = 1; unless ( ($week) = $anchor_txt =~ /week=(\d+)/ ) {$found_all = 0} unless ( ($mid1) = $anchor_txt =~ /mid1=(\d+)/ ) {$found_all = 0} unless ( ($mid2) = $anchor_txt =~ /mid2=(\d+)/ ) {$found_all = 0} if ($found_all) { print "week[$week] mid1[$mid1] mid2[$mid2]\n"; # do stuff } else { print "Oops! Missing bit or extraneous tag.\n"; } } [download] The link extractor `$html_page =~ /(<a^>+>)/g` is not safe for all purposes. But it is a bit more robust than Chady's quick-and-dirty grabber which assumes that the complete text of only one tag will appear on any one line. From the looks of it, it's likely that the one-line proviso is fine for your data. If the solutions proposed so far are not flexible enough to handle your data, give another holler... Hope this helps. David	[reply] [d/l]