Getting Text from Website

LaneM1234 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Getting Text from Website by davido (Cardinal) on Jun 21, 2012 at 19:00 UTC
Could this itself be a homework assignment? The easiest is if you can spot a pattern in the URL's. A good "RESTful" site would make each resource available at its own URL. For example, "`.../public_html/1112hcal/06222012/assignment.html`" would lead directly to the assignment for 6/22/2012. Just mouse-over a few of the assignment links to see if there is some predictable pattern to their URL's. Then you can simply check a given date to see if it produces an assignment. Some form of this strategy is likely to work well for you. If it turns out there is a predictable URL for each homework assignment, just use LWP::Simple (which you've already discovered) to grab the assignment. Maybe even just poll each day the URL that might correspond to that date, or the URL that might correspond to the next un-fetched assignment. The harder way is to use something like HTML::LinkExtor to extract the links on an index page and determine which of the links pertain to homework assignments. Even that isn't too difficult; it's just not as automation-friendly as a nice RESTful approach. To give any more specific advice I think we would need to see what you have written so far, and exactly where you are stuck. Dave	[reply] [d/l]
Re: Getting Text from Website by zentara (Cardinal) on Jun 21, 2012 at 19:08 UTC
Automating-Web-based-Data-Retrieval-with-Perl.htm Download MIT OpenCourseware Simple link extraction tool How do I return the text of a link after using find_link with WWW::Mechanize? Those ought to get you started. I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re: Getting Text from Website by LaneM1234 (Initiate) on Jun 21, 2012 at 19:07 UTC
Unfortunately, the actual pages are arbitrarily long and include multiple days of homework. So there is not a link pattern per day. So what I need to do is download the website's code, search through it to find the date I'm looking for (i.e. "4/23/012") and then copy not only the matched string but a little more of the code so that the assignment is included. I have so far been able to get `my $url = 'http://staweb.sta.cathedral.org/departments/math/mhansen/pu +blic_html/1112hcal/1112hcal.htm'; use LWP::Simple; my $content = get $url; die "Couldn't get $url" unless defined $content;` [download] So I am able to get the downloaded html code but don't know how to then find the information I am looking for (the date and the assignment) and then copy that into a new txt file. And no, not a HW assignment. School's out for summer!	[reply] [d/l]
Re^2: Getting Text from Website by aaron_baugher (Curate) on Jun 21, 2012 at 22:30 UTC
There are two main ways to do it: 1. Use some sort of HTML or DOM parsing module, have it pick out the element of the page containing your date, and then go up the chain of parents until you get something that contains all the info you want. Looking at the page in your link, the date would be in a `<span>` tag, which is inside a `<p>`, which is inside a `<td>`, which is inside a `<tr>`, which appears to contain the info you want. So you'd have to point to the correct span and then get its parent's parent's parent, and either parse the text out of that or plug that `<tr>` into a `<table>` of your own. Exactly how to do that process will depend on what module you use. With something like Mojo::DOM, it could look something like this ('ve barely used it, but it looks a lot like jQuery which I'm familiar with, so I think this is close): `for my $e ($dom->find('span')->each) { if($e->text =~ /$mydate/ ){ my $myhtml = $e->parent->parent->parent->text; # do stuff with $myhtml } }` [download] 2. Parse the data from the raw HTML with your own regular expressions. See example below. Regexes like this tend to be tricky to create and brittle, because they're liable to break as soon as the page design changes at all. (So will a DOM/parser method if the nesting of the elements changes, but a regex may break just because they start capitalizing a tag.) But for a quick-and-dirty hack that you're using for your own use, it gets the job done. `#!/usr/bin/env perl use Modern::Perl; use LWP::Simple; my $date = $ARGV[0] \|\| '4/23/012'; my $page = get('http://staweb.sta.cathedral.org/departments/math/mhansen/public_h +tml/1112hcal/1112hcal.htm'); die "Couldn't get page" unless $page; my( $assignment ) = $page =~ m{ $date .+? <span .+?>(.+?)</span>\s*</p +> }sx; say $assignment;` [download] Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l] [select]
Re^2: Getting Text from Website by bitingduck (Deacon) on Jun 22, 2012 at 06:25 UTC
Since it looks like calc HW you're getting and not computer HW, here's a hint: #!/usr/bin/perl # find calc homework use strict; use warnings; use HTML::TreeBuilder; my $date='4/27/012'; #set the date, you can do this dynamically my $url= 'http://staweb.sta.cathedral.org/departments/math/mhansen/pub +lic_html/1112hcal/1112hcal.htm'; # get the page and make a tree structure out of it my $tree= HTML::TreeBuilder->new_from_url($url); #break the table into rows my @elements = $tree->find_by_tag_name('tr'); #loop through the rows looking for the date #and use the as_trimmed_text to get rid of all the extra htmlness foreach (@elements){ if((my $hw=$_->as_trimmed_text())=~m%$date%){ print $hw."\n"; } } [download] It takes approach 1 that aaron_baugher describes, but mostly ignores the details of the page structure. We know it's a table and we want the rows. Knowing that the first column is just the day and date, I'm going to assume we want to keep them anyway. The `find_by_tag_name` just gets all the rows and all the stuff inside them. There's a bunch of `<p>` and `<span>` tags that really aren't interesting, so I take the lazy approach and use `as_trimmed_text` to throw those away and just keep the contents of the two cells all together. It's also useful to know that HTML::TreeBuilder gets a bunch of methods from HTML::Element. Update: tweaked the code formatting to keep the comments from wrapping ...And to note that some of your assignments have links in them-- you can use HTML::Element to dig those out before you apply as_trimmed_text, or dig them out an of other various possible ways.	[reply] [d/l] [select]
Re: Getting Text from Website by ansh batra (Friar) on Jun 21, 2012 at 19:01 UTC
dude post some code which you have tried if you want you can download the whole web page using wget command specify if you want some part or the whole web page P.S ur requirements are not clear	[reply]
Re: Getting Text from Website by Anonymous Monk on Jun 22, 2012 at 03:19 UTC
Parsing HTML / Re^4: Parsing HTML, A regex question ...	[reply]