My sincere thanks to all those who contributed. I feel, first, I need to explain for the sloppiness of the code above (with the GOTO and otherwise messy logic). I started out trying to parse a few lines with a few regexes, then as I started checking what was being assigned I quickly saw there were tons of special cases to be handlded: the number of sections varies, sometimes data isn't present, or sometimes it is marked with TBD, some data is marked "Reserved", etc. So my attempt put hack upon hack (the gotos started out as next, but that was when while was an infinite loop!)

Well, enough excuses. I came up with something that seems to work pretty well. I guess I should explain why I am doing this. I want to make a program that will ask me what classes I want to take and then tell me all the possible schedule combinations (if any) I can have with those classes. The schedule combinations part I already finished in Java (which I did to teach myself the language, because C/C++, perl, scheme, and Q/PBASIC aren't good enough for UVa -- but that's another discusssion!).

Anyways, the code, for those interested. I ended up using the TokeParse::Simple, which I refrained from at first, not having used it before and wanting to get something tested as quick as possible (a missaplication of laziness I suppose), but the lovely examples helped me through it ...

#!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; use Data::Dumper; # $courses{mneumonic}{sectID} = # [ "Section number", "Credit", "CurrEnroll", "MaxEnroll", # [start time], [end time], [days], [location], [instructor] ] # use constant SECT_NUMBER => 0; use constant CREDIT_HOURS => 1; use constant CURR_ENROLL => 2; use constant MAX_ENROLL => 3; use constant START_TIME => 4; use constant END_TIME => 5; use constant DAYS => 6; use constant LOCATION => 7; use constant INSTRUCTOR => 8; my $file = 'APMA.txt'; my $stream = HTML::TokeParser::Simple->new( $file ); my ($class,$title); my (%courses, $mneumonic, $sectID); # Flag to tell program if last $title was a match against /Day/ # If so, location follows next (no consistent marker otherwise) my $wasJustDays = 0; while( my $t = $stream->get_token ) { if( $t->is_start_tag( 'a' ) and $t->return_attr( 'href' ) =~ m/course_nbr/ ) { # And thus begins a new Course ... $mneumonic = $stream->get_text( '/a' ); } elsif( $t->is_start_tag( 'span' ) ) { $class = $t->return_attr( 'class' ); $title = $t->return_attr( 'title' ); if( defined $title and not defined $class ) { # These would be all the rest of the fields, # ... Schedule number, credit hours, etc. if ($title =~ /Schedule Number/) { $sectID = $stream->get_text( '/span' ); } elsif ($title =~ /Section Number/) { $courses{$mneumonic}{$sectID}->[SECT_NUMBER] = $stream->get_text( '/span' ); } elsif ($title =~ /Credit Hours/) { $courses{$mneumonic}{$sectID}->[CREDIT_HOURS] = $stream->get_text( '/span' ); } elsif ($title =~ /Time/) { $stream->get_text( '/span' ) =~ /(\d+)-(\d+)/; push @{$courses{$mneumonic}{$sectID}->[START_TIME]}, $1; push @{$courses{$mneumonic}{$sectID}->[END_TIME]}, $2; } elsif ($title =~ /Day/) { push @{$courses{$mneumonic}{$sectID}->[DAYS]}, $stream->get_text( '/span' ); $wasJustDays = 1; # See note at variable declaration } elsif ($title =~ /Instructor/) { push @{$courses{$mneumonic}{$sectID}->[INSTRUCTOR]}, $stream->get_text( '/span' ); } elsif ($title =~ m<Enrollment:Authorized/Actual>) { $stream->get_text( '/span' ) =~ m<(\d+)/(\d+)>; $courses{$mneumonic}{$sectID}->[MAX_ENROLL] = $1; $courses{$mneumonic}{$sectID}->[CURR_ENROLL] = $2; } else { if ($wasJustDays == 1) { push @{$courses{$mneumonic}{$sectID}->[LOCATION]}, $title . ": " . $stream->get_text( '/span' ); $wasJustDays = 0; # See note at variable declaration } } } elsif( defined $class and $class eq 'title' ) { # This is the name of the course; e.g., Linear Algebra # Ignore for the time being ... #print $stream->get_text( '/span' ), "\n"; } } }
edit: removed readmore

In reply to Re: Parsing COD text help by dimmesdale
in thread Parsing COD text help by dimmesdale

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.