Remember, 'A' and 'a' are about the same as '~' and 'â' unless otherwise specified. It's almost always a good idea to include /i in a regexp that can be subject to random influences, such as users. I noticed a few "SmallPrint" entries on your HTML.
Before I get to that, let's just take this one step at a time. Here is, I believe, an example of the data you are trying to parse:
<a href="PlatformSoftwareSection.jsp?siteId=1&jid=94DDB69B3747X42D738A
+8A4E54CDD8A4&platfor
mId=1&special=&bySection=1&sectionId=2167&catalog=1&am
+p;title=FireViewer+Videos+%26+Images
">
<span class="smallprint">E-Books &
Document Readers</span></a>,
Here's something that might do the job:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
my %categories; # No need for '= ()'
my $page = get('http://www.handango.com/PlatformSoftware.jsp?platformI
+d=1&siteId=1&zsortParams=true');
while ($page =~ /
sectionId=(\d+) # Section ID (all digits)
[^>"]+"> # Remainder of param and tag
\s+ # Some whitespace
<span\s+class="smallprint"> # SPAN tag
([^<]*) # "Stuff" up to next tag
< # Start of next tag
/xig)
{
$categories{$1} = $2;
print "[$1] and [$2]\n";
}
foreach my $idkey (keys %categories)
{
print "$idkey,$categories{$idkey}\n";
}
You'll note I took the liberty of redefining your regex completely. In this case, I'm scooping the "sectionId" variable (numeric only) followed by any amount of "stuff", then grabbing the non-tagged content of the 'span' tag.
It works, as best as I can tell, but isn't very adaptable.
This lack of adaptability makes this program 'brittle' (translation: liable to break completely because of a small change in input) and only qualifies this for use as a Quick Hack. I'd hate to think that this would become a piece of code that would be used over six months from now. This is a very
assumptive piece of code, and that's not good. You can
assume things won't change in the HTML in the next few hours or days, or even weeks, but any time-frame longer than that is really going out on a limb.
If this is a long term thing, I'd suggest doing it properly, perhaps by using
HTML::Parser or
HTML::LinkExtor and some more robust code that can handle slight changes in formatting better.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.