LaneM1234 has asked for the wisdom of the Perl Monks concerning the following question:

Perl Monks!

I am very new to Perl and am trying to create a script that will allow me to download my homework assignments off of my teacher's website for a specific day. He puts our HW on his website, http://staweb.sta.cathedral.org/departments/math/mhansen/public_html/1112hcal/1112hcal.htm. I would like to make a script that when given a date, finds the corresponding assignment and prints it in a blank text file. I am able to create all of the mechanics except for the copying the assignment part

I have been able to use LWP::Simple to find the text, but don't know how to make the script choose the corresponding assignment. Nor do I know how to print that into a blank text file. I don't think this is very complicated, but I'm really bad at Perl, so any/all help would be appriciated!

Replies are listed 'Best First'.
Re: Getting Text from Website
by davido (Cardinal) on Jun 21, 2012 at 19:00 UTC

    Could this itself be a homework assignment?

    The easiest is if you can spot a pattern in the URL's. A good "RESTful" site would make each resource available at its own URL. For example, ".../public_html/1112hcal/06222012/assignment.html" would lead directly to the assignment for 6/22/2012. Just mouse-over a few of the assignment links to see if there is some predictable pattern to their URL's. Then you can simply check a given date to see if it produces an assignment. Some form of this strategy is likely to work well for you.

    If it turns out there is a predictable URL for each homework assignment, just use LWP::Simple (which you've already discovered) to grab the assignment. Maybe even just poll each day the URL that might correspond to that date, or the URL that might correspond to the next un-fetched assignment.

    The harder way is to use something like HTML::LinkExtor to extract the links on an index page and determine which of the links pertain to homework assignments. Even that isn't too difficult; it's just not as automation-friendly as a nice RESTful approach.

    To give any more specific advice I think we would need to see what you have written so far, and exactly where you are stuck.


    Dave

Re: Getting Text from Website
by zentara (Cardinal) on Jun 21, 2012 at 19:08 UTC
Re: Getting Text from Website
by LaneM1234 (Initiate) on Jun 21, 2012 at 19:07 UTC
    Unfortunately, the actual pages are arbitrarily long and include multiple days of homework. So there is not a link pattern per day. So what I need to do is download the website's code, search through it to find the date I'm looking for (i.e. "4/23/012") and then copy not only the matched string but a little more of the code so that the assignment is included. I have so far been able to get
    my $url = 'http://staweb.sta.cathedral.org/departments/math/mhansen/pu +blic_html/1112hcal/1112hcal.htm'; use LWP::Simple; my $content = get $url; die "Couldn't get $url" unless defined $content;
    So I am able to get the downloaded html code but don't know how to then find the information I am looking for (the date and the assignment) and then copy that into a new txt file. And no, not a HW assignment. School's out for summer!

      There are two main ways to do it:

      1. Use some sort of HTML or DOM parsing module, have it pick out the element of the page containing your date, and then go up the chain of parents until you get something that contains all the info you want. Looking at the page in your link, the date would be in a <span> tag, which is inside a <p>, which is inside a <td>, which is inside a <tr>, which appears to contain the info you want. So you'd have to point to the correct span and then get its parent's parent's parent, and either parse the text out of that or plug that <tr> into a <table> of your own. Exactly how to do that process will depend on what module you use. With something like Mojo::DOM, it could look something like this ('ve barely used it, but it looks a lot like jQuery which I'm familiar with, so I think this is close):

      for my $e ($dom->find('span')->each) { if($e->text =~ /$mydate/ ){ my $myhtml = $e->parent->parent->parent->text; # do stuff with $myhtml } }

      2. Parse the data from the raw HTML with your own regular expressions. See example below. Regexes like this tend to be tricky to create and brittle, because they're liable to break as soon as the page design changes at all. (So will a DOM/parser method if the nesting of the elements changes, but a regex may break just because they start capitalizing a tag.) But for a quick-and-dirty hack that you're using for your own use, it gets the job done.

      #!/usr/bin/env perl use Modern::Perl; use LWP::Simple; my $date = $ARGV[0] || '4/23/012'; my $page = get('http://staweb.sta.cathedral.org/departments/math/mhansen/public_h +tml/1112hcal/1112hcal.htm'); die "Couldn't get page" unless $page; my( $assignment ) = $page =~ m{ $date .+? <span .+?>(.+?)</span>\s*</p +> }sx; say $assignment;

      Aaron B.
      Available for small or large Perl jobs; see my home node.

      Since it looks like calc HW you're getting and not computer HW, here's a hint:
      #!/usr/bin/perl # find calc homework use strict; use warnings; use HTML::TreeBuilder; my $date='4/27/012'; #set the date, you can do this dynamically my $url= 'http://staweb.sta.cathedral.org/departments/math/mhansen/pub +lic_html/1112hcal/1112hcal.htm'; # get the page and make a tree structure out of it my $tree= HTML::TreeBuilder->new_from_url($url); #break the table into rows my @elements = $tree->find_by_tag_name('tr'); #loop through the rows looking for the date #and use the as_trimmed_text to get rid of all the extra htmlness foreach (@elements){ if((my $hw=$_->as_trimmed_text())=~m%$date%){ print $hw."\n"; } }

      It takes approach 1 that aaron_baugher describes, but mostly ignores the details of the page structure. We know it's a table and we want the rows. Knowing that the first column is just the day and date, I'm going to assume we want to keep them anyway. The find_by_tag_name just gets all the rows and all the stuff inside them. There's a bunch of <p> and <span> tags that really aren't interesting, so I take the lazy approach and use as_trimmed_text to throw those away and just keep the contents of the two cells all together. It's also useful to know that HTML::TreeBuilder gets a bunch of methods from HTML::Element.

      Update: tweaked the code formatting to keep the comments from wrapping

      ...And to note that some of your assignments have links in them-- you can use HTML::Element to dig those out before you apply as_trimmed_text, or dig them out an of other various possible ways.

Re: Getting Text from Website
by ansh batra (Friar) on Jun 21, 2012 at 19:01 UTC

    dude post some code which you have tried
    if you want you can download the whole web page using wget command
    specify if you want some part or the whole web page

    P.S ur requirements are not clear

Re: Getting Text from Website
by Anonymous Monk on Jun 22, 2012 at 03:19 UTC