good day dear community,
i currently work on a little harvester - running mechanize:
target: http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de&webgrab_path=http://esv2000.edi.admin.ch/d/entry.asp?Id=519
This site hold ++ 2700 records on foundations - all is free to use with no limitations copyrights!
Hmmm what i have so far - the harvesting task should be no problem - i think!
i will take WWW::Mechanize - especially cause i want to run it for the form based search and for the selecting
of the the individual entries (of the records)
Hmm - i am sure that all the needed algorithm would be basically those that i can include in two nested loops:
first: the outer loop runs the form based search, the inner loop processes the search results...
Well: The outer loop would use the select() and the submit_form() functions - but on the second search form on the page. Can we use DOM processing here?
Well - how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using teh following call.
$mech->follow_link ( url_regex => qr/webgrab_path=http:\/\/evs2000.*\? Id=\d+$/, n => $result_nbr);
This ensures that the mechanic browser will be forwarded to the entry page. Hmm, basically the URL query looks for links that have the webgrap_path to Id-pattern, which is unique for each database entry. The $result_nbr variable tells mechanize, which one of the results it should follow in the next step. This should go stepwise through all ++2700 records of the database.
By the way: If we have several result-pages, we could also use the same way (and idea) to traverse through all the result-pages.Well, for the semantic extraction of the entry-information, we could do the parsing of the content of the actual entries hmm with XML:LibXML's html parser (which should work fine on this page), because it gives us some powerful DOM-selection (using XPath) methods. Well the actual looping through the pages should be doable in a few lines of perl of max. 20 lines...
But wait: the processing of the entry-pages will then be the most complex part of the script.
Approaches: In principle we could do the same algorithm with a single while loop if we use the back() function smartly.
Can you give me a hint for the beginning - the processing of the entry pages - doing this in Perl:: Mechanize
Look forward to hear from you
regards
pb
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.