inblosam has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to get a key and value pair from a "get". I think I am pretty close but the script doesn't return me anything. Basically I am looking for a word that is always before the word I am looking for, and then one after that matching another word, and then matching the first (as the key) with the second (as the value). Any help is very much appreciated!

#!perl use LWP::Simple; #being a good boy use strict; use warnings; my %categories = (); my $page = get 'http://www.inshift.com/Products.html'; while ($page=~m/href=\"(\w+?)".+?class=\"rightnav\"\>(\w+?)\</sg){ $categories{$1}=$2; #this print doesn't come up at all! print "[$1] and [$2]"; } my @pairlist = keys(%categories); my $idkey = (); foreach $idkey (@pairlist) { #this one doesn't show up at all either print "$idkey,$categories{$idkey}\n"; }


Michael Jensen
michael at inshift.com
http://www.inshift.com

Replies are listed 'Best First'.
Re: Key/Value pair from GET
by Zaxo (Archbishop) on Jun 06, 2002 at 07:16 UTC

    I'd recommend HTML::LinkExtor for this, it's the wheel you're reinventing. Its pod gives application examples.

    After Compline,
    Zaxo

      This is a better example of what I am trying to do (should have been more clear, sorry). I am trying to get values of variables from a link along with the name of the link (Geography below). This might be in the HTML:
      <a href="thepage.jsp?siteId=1&sectionId=443&amp;"> <span class="small">Geography</span></a>,

      So I want to get 443 and Geography, in the form: 443,Geography. I still get nothing printed out on my command prompt. Here is my modified code to fit this model better:
      #!perl use LWP::Simple; use strict; use warnings; #get the values out easily! my %categories = (); my $page = get 'http://www.thepage.com/thepage.html'; while ($page=~m/sectionId\=(\w+?)".+?\"small\">(\w+?)\</sg){ $categories{$1}=$2; print "[$1] and [$2]"; } my @pairlist = keys(%categories); my $idkey = (); foreach $idkey (@pairlist) { print "$idkey,$categories{$idkey}\n"; }


      Michael Jensen
      michael at inshift.com
      http://www.inshift.com
        It's not hard to see why your regex goes wrong in the example given. Your regex expects "sectionId=", followed by one or more word characters, followed by a double quote. However, the data has "sectionId=", followed by 443 (which are word characters), followed by an ampersand. An ampersand is of course not a double quote.

        Abigail

Re: Key/Value pair from GET
by tadman (Prior) on Jun 06, 2002 at 07:24 UTC
    If that print isn't showing up, the problem is that likely your regex is failing. Your regex could be failing because the data isn't quite in the format you expect. You would be better of being more liberal with your expression, which right now is limited to "word" characters by virtue of \w.

    Your second print is likely not showing up because there are no "pairs", since none were added in the first loop.

    A quick fix looks something like this:
    #!/usr/bin/perl -w use strict; use LWP::Simple; my %categories; my $page = get('http://www.inshift.com/Products.html'); while ($page=~m/href="([^"]*)"\s+class="rightnav">([^>]*)</sg) { $categories{$1} = $2; print "[$1] and [$2]\n"; # Note "\n" } foreach my $idkey (keys %categories) { print "$idkey,$categories{$idkey}\n"; }
    A few notes on the changes:
    • Don't assign default variables to those which don't need them. Hashes and arrays start empty, and scalars are undef. In your example you assigned an empty list to a scalar, which is pretty much nonsense.
    • Where appropriate or practical, you can put your declarations within the looping structure, such as foreach.
    • Don't create scratchpad variables like @pairlist which are only used once. Just inline the code that makes them right into the loop. Of course, if the calculation of this is really complicated, maybe you would reconsider, but a simple keys call does not usually qualify.
    • If you print without newlines, your output is buffered and may not show up at all until the program finishes. You can change this behaviour using $| if you like ($OUTPUT_AUTOFLUSH for those who use English).
    • Be sure to think out your regex and test it on sample data before throwing up your hands in despair. In this case, you expected \w to capture a lot more than was in your example.
    A minor error, really.
      I added your changes (thanks for the tips...helps a lot) but I still get nothing from my print statements. Here is a working example of the page and text I am really trying to get:
      #!perl use LWP::Simple; use strict; use warnings; #get the values out my %categories = (); my $page = get 'http://www.handango.com/PlatformSoftware.jsp?platformI +d=1&siteId=1&zsortParams=true'; while ($page=~m/class="smallprint">([^"]*)"\s+siteId=([^>]*)</sg) { $categories{$1} = $2; print "[$1] and [$2]\n"; } foreach my $idkey (keys %categories) { print "$idkey,$categories{$idkey}\n"; }

      The URL seems to work just fine from a browser, so that doesn't seem to be the problem. Thanks for your help.

      Michael Jensen
      michael at inshift.com
      http://www.inshift.com
        Remember, 'A' and 'a' are about the same as '~' and 'â' unless otherwise specified. It's almost always a good idea to include /i in a regexp that can be subject to random influences, such as users. I noticed a few "SmallPrint" entries on your HTML.

        Before I get to that, let's just take this one step at a time. Here is, I believe, an example of the data you are trying to parse:
        <a href="PlatformSoftwareSection.jsp?siteId=1&jid=94DDB69B3747X42D738A +8A4E54CDD8A4&platfor mId=1&amp;special=&amp;bySection=1&amp;sectionId=2167&amp;catalog=1&am +p;title=FireViewer+Videos+%26+Images "> <span class="smallprint">E-Books & Document Readers</span></a>,
        Here's something that might do the job:
        #!/usr/bin/perl -w use strict; use LWP::Simple; my %categories; # No need for '= ()' my $page = get('http://www.handango.com/PlatformSoftware.jsp?platformI +d=1&siteId=1&zsortParams=true'); while ($page =~ / sectionId=(\d+) # Section ID (all digits) [^>"]+"> # Remainder of param and tag \s+ # Some whitespace <span\s+class="smallprint"> # SPAN tag ([^<]*) # "Stuff" up to next tag < # Start of next tag /xig) { $categories{$1} = $2; print "[$1] and [$2]\n"; } foreach my $idkey (keys %categories) { print "$idkey,$categories{$idkey}\n"; }
        You'll note I took the liberty of redefining your regex completely. In this case, I'm scooping the "sectionId" variable (numeric only) followed by any amount of "stuff", then grabbing the non-tagged content of the 'span' tag. It works, as best as I can tell, but isn't very adaptable.

        This lack of adaptability makes this program 'brittle' (translation: liable to break completely because of a small change in input) and only qualifies this for use as a Quick Hack. I'd hate to think that this would become a piece of code that would be used over six months from now. This is a very assumptive piece of code, and that's not good. You can assume things won't change in the HTML in the next few hours or days, or even weeks, but any time-frame longer than that is really going out on a limb.

        If this is a long term thing, I'd suggest doing it properly, perhaps by using HTML::Parser or HTML::LinkExtor and some more robust code that can handle slight changes in formatting better.
Re: Key/Value pair from GET
by Aristotle (Chancellor) on Jun 06, 2002 at 22:36 UTC
    I don't like using long and complex regexes that may break. The URI module is built to parse URIs; why forgo it? On the other hand, all solutions to parse HTML properly require a lot of coding; so to pick out the URLs, we make as few assumptions as possible so that a very simple (and therefor robust) regex will do the trick. The following code only assumes there are links that have a sectionId parameter; we want its value, and we want the link text, minus any tags (whether <span> pairs or something completely different).
    #!/usr/bin/perl -w use strict; use URI; # in the realworld, it comes from somewhere else my $page_content = <<'EOT'; <a href="thepage.jsp?siteId=1&sectionId=443&amp;"> <span class="small">Geography</span></a> EOT # the following regex will simply catch any anchor tags, # making no assumptions about their structure while($page_content =~ /<a\s[^>]*?href="([^"]+)"[^>]*>(.*?)<\/a>/sgi) +{ my $url = new URI $1; my %params = $url->query_form; # now we let URI do the dirty job f +or us next unless exists $params{sectionId}; # was there a sectionId par +ameter? # if we're still in the loop here, there was my $text = $2; $text =~ s/<[^>]*>//sgi; # strip all tags from the anchor's inner +text $text =~ s/^\s+//sgi; # then strip any whitespace from the front $text =~ s/\s+$//sgi; # and from the end print "$params{sectionId},$text\n"; # some pretty output }
    ____________
    Makeshifts last the longest.