Hi monks!
I'm building a screen-scraper for Web of Science (http://isiknowledge.com - requires subscription)
So far it's pretty simple: it retrieves a search result, retrieves the search results from which it extracts data, all done in a nice loop.
However, I have some problems with my regexps, specifically with avoiding that a value is repeated.
Please take a look at the following code:
#!/usr/bin/perl -w # Get urls from result page use warnings; use strict; use LWP::Simple; use HTML::Entities; # for htmldecode my ( $resultpage, $resultpage_url, $i, $new_url, $new_title, $title, @urls, + $n, $host, $article_title, $authors, $source_journal, $source_volume, $source_iss +ue, $source_pages, $source_publish_date ); $host = "http://apps.isiknowledge.com"; $resultpage_url = 'http://apps.isiknowledge.com/summary.do?product=UA& +search_mode=GeneralSearch&qid=2&SID=S1NIfp9Koh1L9D1D5I4&page=1&action +=changePageSize&pageSize=10'; $n = 0; main(); sub main{ $resultpage = get ("$resultpage_url") or die "couldn't retrieve"; while ($resultpage =~ m{<a class=\"smallV110\" href=\"(.*?)\">}gis +) { push(@urls, "$1\n"); for $i ($urls[$n]) { $new_url = get ($host . decode_entities($urls[$n])); sleep 1; # be nice to the server $new_url =~ m{<title>(.*?)<\/title>}gis; # capture page ti +tle $new_title = $1; print "$new_title\n"; $new_url =~ m{\<td class\=\"FullRecTitle\">(.*?)</td>}gis; + # Capture article title $article_title = $1; print "$article_title\n"; $new_url =~ m{(journal of hazardous)}gis; # should capture + journal name - still struggling with this regexp. Is repeated! $source_journal = $1; print "$source_journal\n"; }; $n++; }; };

In all but the first iteration, the last regexp isn't matched, resulting in a repeat of the article title. (I guess that the backreference from the regexp to capture the article title is assigned to $source_journal, which is then printed)

A solution could be to write the last regexp properly, resulting in a new value always being assigned to the variable $source_journal, but as I'm plannig on parsing many records, I'd rather want a 'fool-proof' solution that would also work in the rare case of a non-match of a carefully written regexp.

I guess I'm looking for tips on how to better store and print the values, perhaps also to better match several regexps at once.

Any tips/suggestions will be much appreciated!

In reply to How to prevent a value from being repeated? by turbolofi

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.