Could this itself be a homework assignment?
The easiest is if you can spot a pattern in the URL's. A good "RESTful" site would make each resource available at its own URL. For example, ".../public_html/1112hcal/06222012/assignment.html" would lead directly to the assignment for 6/22/2012. Just mouse-over a few of the assignment links to see if there is some predictable pattern to their URL's. Then you can simply check a given date to see if it produces an assignment. Some form of this strategy is likely to work well for you.
If it turns out there is a predictable URL for each homework assignment, just use LWP::Simple (which you've already discovered) to grab the assignment. Maybe even just poll each day the URL that might correspond to that date, or the URL that might correspond to the next un-fetched assignment.
The harder way is to use something like HTML::LinkExtor to extract the links on an index page and determine which of the links pertain to homework assignments. Even that isn't too difficult; it's just not as automation-friendly as a nice RESTful approach.
To give any more specific advice I think we would need to see what you have written so far, and exactly where you are stuck.
| [reply] [d/l] |
Unfortunately, the actual pages are arbitrarily long and include multiple days of homework. So there is not a link pattern per day. So what I need to do is download the website's code, search through it to find the date I'm looking for (i.e. "4/23/012") and then copy not only the matched string but a little more of the code so that the assignment is included.
I have so far been able to get
my $url = 'http://staweb.sta.cathedral.org/departments/math/mhansen/pu
+blic_html/1112hcal/1112hcal.htm';
use LWP::Simple;
my $content = get $url;
die "Couldn't get $url" unless defined $content;
So I am able to get the downloaded html code but don't know how to then find the information I am looking for (the date and the assignment) and then copy that into a new txt file.
And no, not a HW assignment. School's out for summer!
| [reply] [d/l] |
There are two main ways to do it:
1. Use some sort of HTML or DOM parsing module, have it pick out the element of the page containing your date, and then go up the chain of parents until you get something that contains all the info you want. Looking at the page in your link, the date would be in a <span> tag, which is inside a <p>, which is inside a <td>, which is inside a <tr>, which appears to contain the info you want. So you'd have to point to the correct span and then get its parent's parent's parent, and either parse the text out of that or plug that <tr> into a <table> of your own. Exactly how to do that process will depend on what module you use. With something like Mojo::DOM, it could look something like this ('ve barely used it, but it looks a lot like jQuery which I'm familiar with, so I think this is close):
for my $e ($dom->find('span')->each) {
if($e->text =~ /$mydate/ ){
my $myhtml = $e->parent->parent->parent->text;
# do stuff with $myhtml
}
}
2. Parse the data from the raw HTML with your own regular expressions. See example below. Regexes like this tend to be tricky to create and brittle, because they're liable to break as soon as the page design changes at all. (So will a DOM/parser method if the nesting of the elements changes, but a regex may break just because they start capitalizing a tag.) But for a quick-and-dirty hack that you're using for your own use, it gets the job done.
#!/usr/bin/env perl
use Modern::Perl;
use LWP::Simple;
my $date = $ARGV[0] || '4/23/012';
my $page =
get('http://staweb.sta.cathedral.org/departments/math/mhansen/public_h
+tml/1112hcal/1112hcal.htm');
die "Couldn't get page" unless $page;
my( $assignment ) = $page =~ m{ $date .+? <span .+?>(.+?)</span>\s*</p
+> }sx;
say $assignment;
Aaron B.
Available for small or large Perl jobs; see my home node.
| [reply] [d/l] [select] |
Since it looks like calc HW you're getting and not computer HW, here's a hint:
#!/usr/bin/perl
# find calc homework
use strict;
use warnings;
use HTML::TreeBuilder;
my $date='4/27/012'; #set the date, you can do this dynamically
my $url= 'http://staweb.sta.cathedral.org/departments/math/mhansen/pub
+lic_html/1112hcal/1112hcal.htm';
# get the page and make a tree structure out of it
my $tree= HTML::TreeBuilder->new_from_url($url);
#break the table into rows
my @elements = $tree->find_by_tag_name('tr');
#loop through the rows looking for the date
#and use the as_trimmed_text to get rid of all the extra htmlness
foreach (@elements){
if((my $hw=$_->as_trimmed_text())=~m%$date%){
print $hw."\n";
}
}
It takes approach 1 that aaron_baugher describes, but mostly ignores the details of the page structure. We know it's a table and we want the rows. Knowing that the first column is just the day and date, I'm going to assume we want to keep them anyway. The find_by_tag_name just gets all the rows and all the stuff inside them. There's a bunch of <p> and <span> tags that really aren't interesting, so I take the lazy approach and use as_trimmed_text to throw those away and just keep the contents of the two cells all together. It's also useful to know that HTML::TreeBuilder gets a bunch of methods from HTML::Element.
Update: tweaked the code formatting to keep the comments from wrapping
...And to note that some of your assignments have links in them-- you can use HTML::Element to dig those out before you apply as_trimmed_text, or dig them out an of other various possible ways. | [reply] [d/l] [select] |
dude post some code which you have tried if you want you can download the whole web page using wget command specify if you want some part or the whole web page
P.S ur requirements are not clear
| [reply] |