How to prevent a value from being repeated?

turbolofi has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks!
I'm building a screen-scraper for Web of Science (http://isiknowledge.com - requires subscription)
So far it's pretty simple: it retrieves a search result, retrieves the search results from which it extracts data, all done in a nice loop.
However, I have some problems with my regexps, specifically with avoiding that a value is repeated.
Please take a look at the following code:

#!/usr/bin/perl -w
# Get urls from result page
use warnings;
use strict;
use LWP::Simple;
use HTML::Entities; # for htmldecode

my (
$resultpage, $resultpage_url, $i, $new_url, $new_title, $title, @urls,
+ $n, $host,
$article_title, $authors, $source_journal, $source_volume, $source_iss
+ue, $source_pages, $source_publish_date
);

$host = "http://apps.isiknowledge.com";
$resultpage_url = 'http://apps.isiknowledge.com/summary.do?product=UA&
+search_mode=GeneralSearch&qid=2&SID=S1NIfp9Koh1L9D1D5I4&page=1&action
+=changePageSize&pageSize=10';
$n = 0;
main();

sub main{
    $resultpage = get ("$resultpage_url") or die "couldn't retrieve";

    while ($resultpage =~ m{<a class=\"smallV110\" href=\"(.*?)\">}gis
+) {
    push(@urls, "$1\n");
        for $i ($urls[$n]) {
            $new_url = get ($host . decode_entities($urls[$n]));
            sleep 1; # be nice to the server
            $new_url =~ m{<title>(.*?)<\/title>}gis; # capture page ti
+tle
            $new_title = $1;
            print "$new_title\n";
            $new_url =~ m{\<td class\=\"FullRecTitle\">(.*?)</td>}gis;
+ # Capture article title
            $article_title = $1;
            print "$article_title\n";
            $new_url =~ m{(journal of hazardous)}gis; # should capture
+ journal name - still struggling with this regexp. Is repeated!
            $source_journal = $1;
            print "$source_journal\n";
            
        };
$n++;
    };
};
[download]

In all but the first iteration, the last regexp isn't matched, resulting in a repeat of the article title. (I guess that the backreference from the regexp to capture the article title is assigned to $source_journal, which is then printed)

A solution could be to write the last regexp properly, resulting in a new value always being assigned to the variable $source_journal, but as I'm plannig on parsing many records, I'd rather want a 'fool-proof' solution that would also work in the rare case of a non-match of a carefully written regexp.

I guess I'm looking for tips on how to better store and print the values, perhaps also to better match several regexps at once.

Any tips/suggestions will be much appreciated!

Comment on How to prevent a value from being repeated? Download Code

Replies are listed 'Best First'.
Re: How to prevent a value from being repeated? by almut (Canon) on May 12, 2009 at 13:52 UTC
In all but the first iteration, the last regexp isn't matched, resulting in a repeat of the article title. (I guess that the backreference from the regexp to capture the article title is assigned to $source_journal, which is then printed) Captures like `$1` are only valid if a regex actually matched... So the solution would be to test if it matched, and assign some other (default) value to `$source_journal` in case it didn't. `if ($new_url =~ m{(...)}) { $source_journal = $1; } else { $source_journal = '(not specified)'; }` [download]	[reply] [d/l] [select]
Re^2: How to prevent a value from being repeated? by turbolofi (Acolyte) on May 12, 2009 at 13:55 UTC
That sounds like a good plan, I'll try that. Thanks! The code works perfectly, thanks again.	[reply]