Howdy Monks,

I am working on a parser to grab all the unsubcribe links from a big text file. The text file is a mix of plain text and HTML. I am able to use HTML:LinkExtor to grab most of the links, however, at this point it returns 'a href's and img src's' I'm only interested in the 'a href's' and once I have these, I would like to narrow them down with a regex.

As of now it looks like this:

#!/usr/bin/perl use HTML::LinkExtor; use URI::URL; $p = HTML::LinkExtor->new(\&cb, "http://www.x10.com"); sub cb { my($tag, %links) = @_; print "$tag @{[%links]}\n"; } $p->parse_file("rfl.txt"); #@glob = $p; #for($i=0; $i<@glob; $i++){ # $_ = @glob[$i]; # if(/account.cgi/){ # $counter = 1 - $counter; # print $_ ; # } #}
I plan to uncomment the regex portion when I get better results.

I know there are a lot of errors, and I appreciate any guidance. Incidently, I can't use strict, because I get these errors when I do.

Global symbol "$p" requires explicit package name at link.pl line 9. Global symbol "$p" requires explicit package name at link.pl line 14. Execution of link.pl aborted due to compilation errors.
So my main objectives are to remove any 'img src' references, and make sure that all the URL's are stored properly in an array which I can parse further.

Here is the top portion of my current results. I also noticed that some of the URL's are not returned or incomplete.

a href http://www.x10.com/3D%22http://hop.clickbank.net/?aaso2/intelli +%22 a href http://www.x10.com/3D%22http://www.consumerinfo.com/home_pca.as +p?sc=3D141 = a href http://www.x10.com/3D%22http://hop.clickbank.net/?aaso2/webpd%2 +2 a href http://www.x10.com/3D%22http://www.x10.com/xcam2_allspecial33.h +tm%22 a href http://www.x10.com/3D%22http://www.teamnova.com/encore/combo.cf +m?siteid=3 D= a href http://www.x10.com/3D%22http://hop.clickbank.net/?aaso2/intel= a href http://www.x10.com/3D%22http://hop.clickbank.net/?aaso2/intel= img src http://www.x10.com img src http://www.x10.com a href http://www.x10.com/jecn@allaboutspe= img src http://www.x10.com img src http://www.x10.com img src http://www.x10.com a href http://www.x10.com/3D%22http://www.consumerinfo.com/home_pca.as +p?sc=3D14= img src http://www.x10.com/= img src http://www.x10.com
I appreciate any help you can give.

Bests,
amearse


In reply to Jiggy w/ LinkExtor by amearse

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.