Firstly, the reason I'm looping through it is that this is test code to work out a rule for a certain page layout.

I'm writing a program to pull headlines from non-RSS newspapers so I'm looking *only* for the headlines. What I have found is that there is some kind of designation, be it graphic or comment, in the HTML code that I can look for and then start my headline link search after that.

As this is test code, I saved a copy of the html as a text file and was reading it in to an array, then parsing from there.

As for the parser, I'll give it a shot, but from what your output looks like I'd still have to search for all the href links as it's pulling all the <tag> stuff out. That's not what I'm after. I just want the href links and the text between them. That's why I was using:

m/(<a href[^>]*>)(.+</a>)/io;
thereby giving me my link, the text between and the closing tag.

Then I am throwing $1 and $2 in to a hash table to eliminate duplicate headlines.

I will have a look at the parser though. It would be nice to make this easier. : )

What I'm really stumped about though is why the code I posted was concatenating the values on the matches. Unless my PC was seriously overheated and something was going wrong, I can't see why those wouldn't be unique matches every time as you're sending it different data to check.

Any ideas on that?

Update: After much thought I have figured out where my thinking went wrong with my original question.

When I was asking why m!(<a^>*])(.+?)!iog was not matching $3, 4, etc... with the global, but merely $1 and $2, it finally occured to me that all I'm *asking* it to match is $1 and $2.

Some people fall from grace. I prefer a running start...


In reply to Re: (jeffa) Re: Problems splitting HTML in to hash table by Popcorn Dave
in thread Problems splitting HTML in to hash table by Popcorn Dave

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.