There are two main ways to do it:

1. Use some sort of HTML or DOM parsing module, have it pick out the element of the page containing your date, and then go up the chain of parents until you get something that contains all the info you want. Looking at the page in your link, the date would be in a <span> tag, which is inside a <p>, which is inside a <td>, which is inside a <tr>, which appears to contain the info you want. So you'd have to point to the correct span and then get its parent's parent's parent, and either parse the text out of that or plug that <tr> into a <table> of your own. Exactly how to do that process will depend on what module you use. With something like Mojo::DOM, it could look something like this ('ve barely used it, but it looks a lot like jQuery which I'm familiar with, so I think this is close):

for my $e ($dom->find('span')->each) { if($e->text =~ /$mydate/ ){ my $myhtml = $e->parent->parent->parent->text; # do stuff with $myhtml } }

2. Parse the data from the raw HTML with your own regular expressions. See example below. Regexes like this tend to be tricky to create and brittle, because they're liable to break as soon as the page design changes at all. (So will a DOM/parser method if the nesting of the elements changes, but a regex may break just because they start capitalizing a tag.) But for a quick-and-dirty hack that you're using for your own use, it gets the job done.

#!/usr/bin/env perl use Modern::Perl; use LWP::Simple; my $date = $ARGV[0] || '4/23/012'; my $page = get('http://staweb.sta.cathedral.org/departments/math/mhansen/public_h +tml/1112hcal/1112hcal.htm'); die "Couldn't get page" unless $page; my( $assignment ) = $page =~ m{ $date .+? <span .+?>(.+?)</span>\s*</p +> }sx; say $assignment;

Aaron B.
Available for small or large Perl jobs; see my home node.


In reply to Re^2: Getting Text from Website by aaron_baugher
in thread Getting Text from Website by LaneM1234

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.