in reply to Getting Text from Website

Unfortunately, the actual pages are arbitrarily long and include multiple days of homework. So there is not a link pattern per day. So what I need to do is download the website's code, search through it to find the date I'm looking for (i.e. "4/23/012") and then copy not only the matched string but a little more of the code so that the assignment is included. I have so far been able to get
my $url = 'http://staweb.sta.cathedral.org/departments/math/mhansen/pu +blic_html/1112hcal/1112hcal.htm'; use LWP::Simple; my $content = get $url; die "Couldn't get $url" unless defined $content;
So I am able to get the downloaded html code but don't know how to then find the information I am looking for (the date and the assignment) and then copy that into a new txt file. And no, not a HW assignment. School's out for summer!

Replies are listed 'Best First'.
Re^2: Getting Text from Website
by aaron_baugher (Curate) on Jun 21, 2012 at 22:30 UTC

    There are two main ways to do it:

    1. Use some sort of HTML or DOM parsing module, have it pick out the element of the page containing your date, and then go up the chain of parents until you get something that contains all the info you want. Looking at the page in your link, the date would be in a <span> tag, which is inside a <p>, which is inside a <td>, which is inside a <tr>, which appears to contain the info you want. So you'd have to point to the correct span and then get its parent's parent's parent, and either parse the text out of that or plug that <tr> into a <table> of your own. Exactly how to do that process will depend on what module you use. With something like Mojo::DOM, it could look something like this ('ve barely used it, but it looks a lot like jQuery which I'm familiar with, so I think this is close):

    for my $e ($dom->find('span')->each) { if($e->text =~ /$mydate/ ){ my $myhtml = $e->parent->parent->parent->text; # do stuff with $myhtml } }

    2. Parse the data from the raw HTML with your own regular expressions. See example below. Regexes like this tend to be tricky to create and brittle, because they're liable to break as soon as the page design changes at all. (So will a DOM/parser method if the nesting of the elements changes, but a regex may break just because they start capitalizing a tag.) But for a quick-and-dirty hack that you're using for your own use, it gets the job done.

    #!/usr/bin/env perl use Modern::Perl; use LWP::Simple; my $date = $ARGV[0] || '4/23/012'; my $page = get('http://staweb.sta.cathedral.org/departments/math/mhansen/public_h +tml/1112hcal/1112hcal.htm'); die "Couldn't get page" unless $page; my( $assignment ) = $page =~ m{ $date .+? <span .+?>(.+?)</span>\s*</p +> }sx; say $assignment;

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re^2: Getting Text from Website
by bitingduck (Deacon) on Jun 22, 2012 at 06:25 UTC
    Since it looks like calc HW you're getting and not computer HW, here's a hint:
    #!/usr/bin/perl # find calc homework use strict; use warnings; use HTML::TreeBuilder; my $date='4/27/012'; #set the date, you can do this dynamically my $url= 'http://staweb.sta.cathedral.org/departments/math/mhansen/pub +lic_html/1112hcal/1112hcal.htm'; # get the page and make a tree structure out of it my $tree= HTML::TreeBuilder->new_from_url($url); #break the table into rows my @elements = $tree->find_by_tag_name('tr'); #loop through the rows looking for the date #and use the as_trimmed_text to get rid of all the extra htmlness foreach (@elements){ if((my $hw=$_->as_trimmed_text())=~m%$date%){ print $hw."\n"; } }

    It takes approach 1 that aaron_baugher describes, but mostly ignores the details of the page structure. We know it's a table and we want the rows. Knowing that the first column is just the day and date, I'm going to assume we want to keep them anyway. The find_by_tag_name just gets all the rows and all the stuff inside them. There's a bunch of <p> and <span> tags that really aren't interesting, so I take the lazy approach and use as_trimmed_text to throw those away and just keep the contents of the two cells all together. It's also useful to know that HTML::TreeBuilder gets a bunch of methods from HTML::Element.

    Update: tweaked the code formatting to keep the comments from wrapping

    ...And to note that some of your assignments have links in them-- you can use HTML::Element to dig those out before you apply as_trimmed_text, or dig them out an of other various possible ways.