Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Parsing COD text help

by dimmesdale (Friar)
on Jul 26, 2004 at 23:46 UTC ( [id://377616]=note: print w/replies, xml ) Need Help??


in reply to Parsing COD text help

My sincere thanks to all those who contributed. I feel, first, I need to explain for the sloppiness of the code above (with the GOTO and otherwise messy logic). I started out trying to parse a few lines with a few regexes, then as I started checking what was being assigned I quickly saw there were tons of special cases to be handlded: the number of sections varies, sometimes data isn't present, or sometimes it is marked with TBD, some data is marked "Reserved", etc. So my attempt put hack upon hack (the gotos started out as next, but that was when while was an infinite loop!)

Well, enough excuses. I came up with something that seems to work pretty well. I guess I should explain why I am doing this. I want to make a program that will ask me what classes I want to take and then tell me all the possible schedule combinations (if any) I can have with those classes. The schedule combinations part I already finished in Java (which I did to teach myself the language, because C/C++, perl, scheme, and Q/PBASIC aren't good enough for UVa -- but that's another discusssion!).

Anyways, the code, for those interested. I ended up using the TokeParse::Simple, which I refrained from at first, not having used it before and wanting to get something tested as quick as possible (a missaplication of laziness I suppose), but the lovely examples helped me through it ...

#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; use Data::Dumper; # $courses{mneumonic}{sectID} = # [ "Section number", "Credit", "CurrEnroll", "MaxEnroll", # [start time], [end time], [days], [location], [instructor] ] # use constant SECT_NUMBER => 0; use constant CREDIT_HOURS => 1; use constant CURR_ENROLL => 2; use constant MAX_ENROLL => 3; use constant START_TIME => 4; use constant END_TIME => 5; use constant DAYS => 6; use constant LOCATION => 7; use constant INSTRUCTOR => 8; my $file = 'APMA.txt'; my $stream = HTML::TokeParser::Simple->new( $file ); my ($class,$title); my (%courses, $mneumonic, $sectID); # Flag to tell program if last $title was a match against /Day/ # If so, location follows next (no consistent marker otherwise) my $wasJustDays = 0; while( my $t = $stream->get_token ) { if( $t->is_start_tag( 'a' ) and $t->return_attr( 'href' ) =~ m/course_nbr/ ) { # And thus begins a new Course ... $mneumonic = $stream->get_text( '/a' ); } elsif( $t->is_start_tag( 'span' ) ) { $class = $t->return_attr( 'class' ); $title = $t->return_attr( 'title' ); if( defined $title and not defined $class ) { # These would be all the rest of the fields, # ... Schedule number, credit hours, etc. if ($title =~ /Schedule Number/) { $sectID = $stream->get_text( '/span' ); } elsif ($title =~ /Section Number/) { $courses{$mneumonic}{$sectID}->[SECT_NUMBER] = $stream->get_text( '/span' ); } elsif ($title =~ /Credit Hours/) { $courses{$mneumonic}{$sectID}->[CREDIT_HOURS] = $stream->get_text( '/span' ); } elsif ($title =~ /Time/) { $stream->get_text( '/span' ) =~ /(\d+)-(\d+)/; push @{$courses{$mneumonic}{$sectID}->[START_TIME]}, $1; push @{$courses{$mneumonic}{$sectID}->[END_TIME]}, $2; } elsif ($title =~ /Day/) { push @{$courses{$mneumonic}{$sectID}->[DAYS]}, $stream->get_text( '/span' ); $wasJustDays = 1; # See note at variable declaration } elsif ($title =~ /Instructor/) { push @{$courses{$mneumonic}{$sectID}->[INSTRUCTOR]}, $stream->get_text( '/span' ); } elsif ($title =~ m<Enrollment:Authorized/Actual>) { $stream->get_text( '/span' ) =~ m<(\d+)/(\d+)>; $courses{$mneumonic}{$sectID}->[MAX_ENROLL] = $1; $courses{$mneumonic}{$sectID}->[CURR_ENROLL] = $2; } else { if ($wasJustDays == 1) { push @{$courses{$mneumonic}{$sectID}->[LOCATION]}, $title . ": " . $stream->get_text( '/span' ); $wasJustDays = 0; # See note at variable declaration } } } elsif( defined $class and $class eq 'title' ) { # This is the name of the course; e.g., Linear Algebra # Ignore for the time being ... #print $stream->get_text( '/span' ), "\n"; } } }
edit: removed readmore

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://377616]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-04-19 16:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found