in reply to (jeffa) Re: Problems splitting HTML in to hash table
in thread Problems splitting HTML in to hash table

Firstly, the reason I'm looping through it is that this is test code to work out a rule for a certain page layout.

I'm writing a program to pull headlines from non-RSS newspapers so I'm looking *only* for the headlines. What I have found is that there is some kind of designation, be it graphic or comment, in the HTML code that I can look for and then start my headline link search after that.

As this is test code, I saved a copy of the html as a text file and was reading it in to an array, then parsing from there.

As for the parser, I'll give it a shot, but from what your output looks like I'd still have to search for all the href links as it's pulling all the <tag> stuff out. That's not what I'm after. I just want the href links and the text between them. That's why I was using:

m/(<a href[^>]*>)(.+</a>)/io;
thereby giving me my link, the text between and the closing tag.

Then I am throwing $1 and $2 in to a hash table to eliminate duplicate headlines.

I will have a look at the parser though. It would be nice to make this easier. : )

What I'm really stumped about though is why the code I posted was concatenating the values on the matches. Unless my PC was seriously overheated and something was going wrong, I can't see why those wouldn't be unique matches every time as you're sending it different data to check.

Any ideas on that?

Update: After much thought I have figured out where my thinking went wrong with my original question.

When I was asking why m!(<a^>*])(.+?)!iog was not matching $3, 4, etc... with the global, but merely $1 and $2, it finally occured to me that all I'm *asking* it to match is $1 and $2.

Some people fall from grace. I prefer a running start...

Replies are listed 'Best First'.
(jeffa) 3Re: Problems splitting HTML in to hash table
by jeffa (Bishop) on Jun 11, 2002 at 19:20 UTC
    Sorry, but i didn't ask why you are looping, i asked why are you looping like that? But the point is mu. Read on. ;)

    "looks like I'd still have to search for all the href links as it's pulling all the stuff out..."

    That's much more trivial to do then you make it sound. Now, i don't know what a 'headline' is, so i am going to assume it is the text between the anchor tags. All you need to do is this:

    # create the parser, etc. my %hash; while (my $tag = $parser->get_tag('a')) { $hash{$parser->get_text} = $tag->[1]->{href}; } for (keys %hash) { print qq|<a href="$_">$hash{$_}</a>\n|; }
    Every time you add a key to hash, non-unique keys will overwrite the ones that already exists - i see no good reason to encapsulate this in a subroutine call.

    If you want unique URL's instead, simply switch  $parser->get_text with $tag->[1]->{href} (and the keys with the values in the for loop). If you want to parse the href links even further, then i suggest the URI module:

    use URI; # etc. my @list; while (my $tag = $parser->get_tag('a')) { my $uri = URI->new($tag->[1]->{href}); push @list, { path => $uri->path(), query => { $uri->query_form() }, text => $parser->get_text(), }; } print Dumper \@list;
    There are soooo many cool modules out there to make your life easier. I personally have more fun writing 'glue code' than 'doing it all by hand'. Doing the later is a good way to learn, but after that, i say it is better and faster to use the help of the CPAN (and all the wonderful folks who contribute).

    "What I'm really stumped about though is why the code I posted was concatenating the values on the matches ...Any ideas on that?"

    Nope, sorry. When i see someone doing it the wrong way, instead of trying to understand their logic i try to show them a more right way. It would take far too much energy do the former and liberal amount of PSI::ESP.

    I know this came off as grumpy - but i really do wish you the best in your endeavor. Good luck!

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Thanks for all that!

      Firstly, the reason I am looping like that is I'm reading a file in to an array, indexing the count until I find my target text, then I know the index from which I need to count to find what I'm after. There may be a more efficient way to do it, but for now I want it to work. : )

      As far as my problem I've at least found it. For some reason the author of this particular page had put all their news headlines, links and text, on one long line. Now that I know that I *think* I can take it from there.

      And you didn't sound grumpy at all. For now, I think I'm going to steer clear of the modules to practice my regexes as I'm still a bit rusty on some of the finer points of that. However once this thing is running, I will definitely look at the module aspect to see if I can shorten the code.

      At present I've got 79 newspaper websites that I want to look at, but I've managed to pare it down to 19 rules so that isn't too bad I don't think.

      Oh, btw, is the ESP::PSI module in the ACME section of CPAN? I think I could really use that for some *serious* debugging... ; )

      Some people fall from grace. I prefer a running start...