Still empty...here is a live and working example

I added your changes (thanks for the tips...helps a lot) but I still get nothing from my print statements. Here is a working example of the page and text I am really trying to get:

#!perl
use LWP::Simple;
use strict;
use warnings;

#get the values out
my %categories = ();
my $page = get 'http://www.handango.com/PlatformSoftware.jsp?platformI
+d=1&siteId=1&zsortParams=true';
while ($page=~m/class="smallprint">([^"]*)"\s+siteId=([^>]*)</sg)
{
        $categories{$1} = $2;
        print "[$1] and [$2]\n";
}

foreach my $idkey (keys %categories)
{
        print "$idkey,$categories{$idkey}\n";
}
[download]

The URL seems to work just fine from a browser, so that doesn't seem to be the problem. Thanks for your help.

Michael Jensen
michael at inshift.com
http://www.inshift.com

Comment on Still empty...here is a live and working example Download Code

Replies are listed 'Best First'.
Re: Still empty...here is a live and working example by tadman (Prior) on Jun 06, 2002 at 08:17 UTC
Remember, 'A' and 'a' are about the same as '~' and 'ā' unless otherwise specified. It's almost always a good idea to include /i in a regexp that can be subject to random influences, such as users. I noticed a few "SmallPrint" entries on your HTML. Before I get to that, let's just take this one step at a time. Here is, I believe, an example of the data you are trying to parse: `<a href="PlatformSoftwareSection.jsp?siteId=1&jid=94DDB69B3747X42D738A +8A4E54CDD8A4&platfor mId=1&special=&bySection=1&sectionId=2167&catalog=1&am +p;title=FireViewer+Videos+%26+Images "> <span class="smallprint">E-Books & Document Readers</span></a>,` [download] Here's something that might do the job: #!/usr/bin/perl -w use strict; use LWP::Simple; my %categories; # No need for '= ()' my $page = get('http://www.handango.com/PlatformSoftware.jsp?platformI +d=1&siteId=1&zsortParams=true'); while ($page =~ / sectionId=(\d+) # Section ID (all digits) [^>"]+"> # Remainder of param and tag \s+ # Some whitespace <span\s+class="smallprint"> # SPAN tag ([^<]) # "Stuff" up to next tag < # Start of next tag /xig) { $categories{$1} = $2; print "[$1] and [$2]\n"; } foreach my $idkey (keys %categories) { print "$idkey,$categories{$idkey}\n"; } [download] You'll note I took the liberty of redefining your regex completely. In this case, I'm scooping the "sectionId" variable (numeric only) followed by any amount of "stuff", then grabbing the non-tagged content of the 'span' tag. It works, as best as I can tell, but isn't very adaptable. This lack of adaptability makes this program 'brittle' (translation: liable to break completely because of a small change in input) and only qualifies this for use as a Quick Hack. I'd hate to think that this would become a piece of code that would be used over six months from now. This is a very assumptive* piece of code, and that's not good. You can assume things won't change in the HTML in the next few hours or days, or even weeks, but any time-frame longer than that is really going out on a limb. If this is a long term thing, I'd suggest doing it properly, perhaps by using HTML::Parser or HTML::LinkExtor and some more robust code that can handle slight changes in formatting better.	[reply] [d/l] [select]
working great...will work on it to make it less fragile by inblosam (Monk) on Jun 06, 2002 at 16:20 UTC
Thanks everybody! The new regexp was the ticket. Works great now, and I will modify it to handle it if the page starts coming out different. I appreciated the tips on working with hashes and regexp. Thanks! Michael Jensen michael at inshift.com http://www.inshift.com	[reply]