Re: web search for certain data
by Corion (Patriarch) on Nov 23, 2011 at 19:31 UTC
|
Maybe it would be helpfull if you showed us a small program that reproduces the problem, and if you also told us what parts of the HTML seem to be missing?
| [reply] |
|
|
I havenot yet written anything related to my final task but i have started here, in my output text i get the data as parent node, when i click on the site view source i see total different entries for all meta name content.
open(Mytempoutput, ">tempoutput.txt");
print Mytemp
output (get $url);
Is there any other way to just read html page into a text file and then print characters starting a pattern??
| [reply] [d/l] [select] |
|
|
How is the code you posted related to the problem you described in the top post? What parts of HTML does your incomplete code not download?
I'm asking you these questions so you can help us help you better. Please show us the relevant code, the data, a description of how your program fails to do what you need, and the desired result. We can only help you if we are able to reproduce your problem.
| [reply] |
|
|
|
|
|
Re: web search for certain data
by cavac (Prior) on Nov 23, 2011 at 20:08 UTC
|
I just looked at the source of the page you provided. You're quite lucky in this case, the webmaster seems to have dumped all relevant information into the body as well as into the meta tags.
If i where you, i'd look into something like
- WWW::Mechanize::GZip to fetch the files, "click" links and fill out forms.
- Regular expressions to get the information line-by-line from the META tags into a hash.
- think about perl functions like open, print, close to write a CSV file. You know,
- foreach field of the required fields write the properly quoted content and add a semicolon and don't forget the newline at the end.
I'm specifically vague, but following the list should give you a (simple, ugly) quickhack solution for your problem. It will break if the webmaster removes or changes the meta lines. But if this is a onetime job, it should work.
Of course, if you need something flexible, reliable that will work for some years, you should really take a look into real HTML parsers.
Don't use '#ff0000':
use Acme::AutoColor; my $redcolor = RED();
All colors subject to change without notice.
| [reply] |
Re: web search for certain data
by choroba (Cardinal) on Nov 23, 2011 at 19:33 UTC
|
Can you show your code? What parts of the page did get into the output? Your client (lwp) might not be able to run javascript. | [reply] |
|
|
I havenot yet written anything related to my final task but i have started here, in my output text i get the data as parent node, when i click on the site view source i see total different entries for all meta name content.
open(Mytempoutput, ">tempoutput.txt"); print Mytemp output (get $url);
Is there any other way to just read html page into a text file and then print characters starting between a metaname content and write to excel??
| [reply] [d/l] |
|
|
What does this output mean?
| [reply] [d/l] |
|
|
Re: web search for certain data
by Marshall (Canon) on Nov 24, 2011 at 06:15 UTC
|
I am going to suggest an idea that you may not have thought of. The US NIH (National Institutes of Health) maintains a database of medical abstracts, PubMed. This LINK is the same article.
There are a huge number of tools to access this database. pubcrawler is an example.
What I am suggesting is that using an even bigger database of which the Karger articles would be a subset may be the way to go.
I've written 3 posts about PubMed: Marshall re: Pub Med. (Hit the search button when this screen appears). The node titles that I replied to are not informative to say the least! But I would suggest that you read these posts because I think that you will find some interesting pointers. Basically there are Perl tools and source code that will do what you want in an even more efficient and better way than scanning/parsing html pages.
I don't want to act as free advertisement for commercial products here, but there are professional level complete applications that do what you want and even more, but they will cost about $700.
Update:
Even this node title: "Re: web search for certain data" would not lead one to believe that what the real problem is to search medical databases. Again, the big, super big medical databases provide applications and Perl code to access them efficiently. One of my articles talks about how to be a "polite consumer". | [reply] |