Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I hope everyone is well :-D.

Had a question --> I'm in the process of creating a class scheduler for which I have to get the class information from another website. Thing is that on this website if you search for a class by subject, it will dish out the class information in 2 rows per class i.e. first row has some class number, name, etc with hyperlinks and second row contains the location, time, instructor, etc.

Now, the thing is that I've been able to write the code to go and get all the hyperlinks and the text around the hyperlinks for EACH class on the results page, but don't know how to decipher the page to get the location, etc info i.e. the info on the second line (MARKED in the code below).

The format of the results is something like this (this is for one result, multiple results have multiple similar code)
<TABLE BORDER=0 CELLPADDING=2 CELLSPACING=2 WIDTH=100%> <TR><TD> <TABLE bgcolor=#ffffcc BORDER=0 CELLPADDING=1 CELLSPACING=1 widt +h=100%> (***********I GOT THIS SECTION COVERED ******************) <tr BGCOLOR=#ffffcc> <!--UMBEG-5 --> <!--UMBEG-2 --> <!--<td headers="header1" height=20><a href=ClassSearchDetail.asp?AppF +rom=Sched&SearchType=B&CurrCareer=&CurrInst=UMAMH&CurrInstDescr=U%2E+ +of+Massachusetts+Amherst&CurrTerm=1033&CurrTermDescr=Spring+2003&Curr +Row=&CurrClass=11125>U1</a></td>--> <td headers="header1" height=20> U1</td> <!--UMEND-2 --> <!--<td headers="header2" height=20>&nbsp;&nbsp;<a href=ClassSearchDet +ail.asp?AppFrom=Sched&SearchType=B&CurrCareer=&CurrInst=UMAMH&CurrIns +tDescr=U%2E+of+Massachusetts+Amherst&CurrTerm=1033&CurrTermDescr=Spri +ng+2003&CurrRow=&CurrClass=11125>11125</a></td>--> <td headers="header2" height=20>&nbsp;&nbsp;11125</td> <!--UMEND-5--> <!--UMBEG-5--> <td headers="header3"><a href=ClassSearchDetail.asp?AppFrom=Sched&Sear +chType=B&CurrCareer=&CurrInst=UMAMH&CurrInstDescr=U%2E+of+Massachuset +ts+Amherst&CurrTerm=1033&CurrTermDescr=Spring+2003&CurrRow=&CurrClass +=11125> ART-ED</a></td> <td headers="header4"><a href=ClassSearchDetail.asp?AppFrom=Sched&Sear +chType=B&CurrCareer=&CurrInst=UMAMH&CurrInstDescr=U%2E+of+Massachuset +ts+Amherst&CurrTerm=1033&CurrTermDescr=Spring+2003&CurrRow=&CurrClass +=11125> 501</a></td> <td headers="header5"><a href=ClassSearchDetail.asp?AppFrom=Sched&Sear +chType=B&CurrCareer=&CurrInst=UMAMH&CurrInstDescr=U%2E+of+Massachuset +ts+Amherst&CurrTerm=1033&CurrTermDescr=Spring+2003&CurrRow=&CurrClass +=11125> Student Teaching N-9</a></td> <!--UMBEG-5 --> <!--<td headers="header6"><a href=ClassSearchDetail.asp?AppFrom=Sched& +SearchType=B&CurrCareer=&CurrInst=UMAMH&CurrInstDescr=U%2E+of+Massach +usetts+Amherst&CurrTerm=1033&CurrTermDescr=Spring+2003&CurrRow=&CurrC +lass=11125>01</a></td>--> <td headers="header6"> 01</td> <!--<td headers="header7"><a href=ClassSearchDetail.asp?AppFrom=Sched& +SearchType=B&CurrCareer=&CurrInst=UMAMH&CurrInstDescr=U%2E+of+Massach +usetts+Amherst&CurrTerm=1033&CurrTermDescr=Spring+2003&CurrRow=&CurrC +lass=11125>LEC</a></td>--> <td headers="header7"> LEC</td> <!--<td headers="header8"><a href=ClassSearchDetail.asp?AppFrom=Sched& +SearchType=B&CurrCareer=&CurrInst=UMAMH&CurrInstDescr=U%2E+of+Massach +usetts+Amherst&CurrTerm=1033&CurrTermDescr=Spring+2003&CurrRow=&CurrC +lass=11125>Open</a></td>--> <td headers="header8"> Open</td> <!--<td headers="header9"><a href=ClassSearchDetail.asp?AppFrom=Sched& +SearchType=B&CurrCareer=&CurrInst=UMAMH&CurrInstDescr=U%2E+of+Massach +usetts+Amherst&CurrTerm=1033&CurrTermDescr=Spring+2003&CurrRow=&CurrC +lass=11125>21</a></td>--> <td headers="header9"> 21</td> <!--UMBEG-1 --> <!-- <td><a href=ClassSearchDetail.asp?AppFrom=Sched&SearchType=B&Curr +Career=&CurrInst=UMAMH&CurrInstDescr=U%2E+of+Massachusetts+Amherst&Cu +rrTerm=1033&CurrTermDescr=Spring+2003&CurrRow=&CurrClass=11125>0</a>< +/td> --> <!--<td><a href=ClassSearchDetail.asp?AppFrom=Sched&SearchType=B&CurrC +areer=&CurrInst=UMAMH&CurrInstDescr=U%2E+of+Massachusetts+Amherst&Cur +rTerm=1033&CurrTermDescr=Spring+2003&CurrRow=&CurrClass=11125>N</a></ +td> --> <td>N</td> <!--UMEND-1 --> <!--UMEND-5--> </tr> (****** NEED HELP HERE to get Location, TIMe, etc****************) <tr BGCOLOR=#ffffcc> <!--UMBEG-6--> <!--<td >&nbsp;</td>--> <!--UMEND-6 --> <!-- UMBEG-3 --> <td>Location: TBA</td> <!--UMBEG-6 --> <td colspan=2>Time:1:00AM&nbsp;1:00AM</td> <td colspan=2>Days: TBA</td> <!--UMEND-6 --> <!--<td colspan=2><font face="Arial, Helvetica, sans-serif" size=1></f +ont></td>--> <!--UMBEG-6 --> <td colspan=4>Instructor: TBA</td> <!--UMEND-6 --> <!--UMEND-3 --> <!--UMBEG-4--> <!--Removed the End of row tag here so we can include Topic On this Li +ne --> <!--</tr> --> <!--UMBEG-4--> <TD>&nbsp;</TD> </TR> </TABLE> </TD></TR> </TABLE>

I know it's probably simple and I'm missing out on how to search the page, but I want to keep in my mind not to search the WHOLE page after each result (the first row), but just search the row AFTER it ... (sometimes I get 100 results so don't want to waste resources everytime).

Anyways, just some ideas on what I should use for this would be greatly helpful.

Thanks and you guys rule! Surya

Replies are listed 'Best First'.
Re: Best way to search for specifics in a webpage?
by tachyon (Chancellor) on Dec 26, 2002 at 10:03 UTC

    Personally I would use HTML::TableExtract which is a subclass of HTML::Parser for everything. If you want a fragile regex solution you could do (assuming html page is in the scalar $html):

    my ($location) = $html =~ m/Location:\s*([^<]+)/i; my ($time) = $html =~ m/Time:\s*([^<]+)/i; my ($days) = $html =~ m/Days:\s*([^<]+)/i; my ($instructor) = $html =~ m/Instructor:\s*([^<]+)/i; # if you want plain text you will need to do this $location = unescapeHTML($location); $time = unescapeHTML($time); $day = unescapeHTML($days); $instructor = unescapeHTML($instructor); # this unescapes common cases, not all possible cases. For perfection +-> CPAN sub unescapeHTML { my( $unescape ) = @_; return undef unless defined($unescape); $unescape=~ s[&(.*?);]{ local $_ = $1; /^amp$/i ? '&' : /^quot$/i ? '"' : /^gt$/i ? '>' : /^lt$/i ? '<' : /^nbsp/i ? ' ' : /^#(\d+)$/ ? chr($1) : /^#x([0-9a-f]+)$/i ? chr(hex($1)) : $_ }gex; return $unescape; }

    If you use arrays rather than scalars for location et al and add a /g you will get all the locations on the page...

    my @location = $html =~ m/Location:\s*([^<]+)/gi; # first match will be in $loction[0] and last match (no suprisingly) i +n $location[-1]

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      tachyon, Thanks! Totally appreciate it ..... :-D. Quick general question -- how do you guys know so much about Perl? Is it experience or hobby or .....? Would love to be able to solve problems and be comfortable with the language. Thanks again, Surya

        In term of languages Perl has a lot to offer for people who need to get the job done - besides the enormous power of the language itself you have CPAN and the community. CPAN is *the* best resource of free high quality library functions for any language IMHO. The modules on CPAN are as ecclectic as Perl itself and cover almost anything you can think of doing. Perlmonks is one of the best support forums for any language and you have others like comp.lang.perl.misc if you like newsgroups and don't mind the odd flame.

        By trade I am a doctor of medicine but have been running an IT company and doing systems admin for quite a while now. Programming for 25 years now, and almost exclusively in Perl for the last 3.

        As with anything the more you do the better you get. The beauty of Perl is that (with modules) you can get amazing results very early on, with the community there to help you with problems. BTW New Monks is probably worth a read (especially the bit about how to ask questions) as is the CGI Help Guide and A Guide to Installing Modules and Use strict, warnings and diagnostics or die

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Best way to search for specifics in a webpage?
by Anonymous Monk on Dec 26, 2002 at 08:48 UTC
    Sorry .... for some reason it didn't post with my login. I apologize. Thanks in advance. Surya