Monks, I am trying to parse the data at virginia.edu/.../webcod.enf.txt. An example of the format is:
92861 APMA 109 0001 GI LC CALCULUS I 4.0 0900-0950 M W F OLS 011 OBERHAUSER JP 055 002 O 0830-0920 T OLS 005 90063 APMA 109 0002 GI LC CALCULUS I 4.0 1000-1050 M W F OLS 120 BECK M 055 004 O 0830-0920 R OLS 120 91589 APMA 109 0003 GI LC CALCULUS I 4.0 1100-1150 M W F OLS 120 BECK M 055 006 O 0830-0920 T MEC 205 93778 APMA 109 0004 GI LC CALCULUS I 4.0 1200-1250 M W F OLS 120 BECK M 055 004 O 0830-0920 T MEC 205
(Looking at the html version is easier to parse mentally, ..., but seemed like a harder problem initially.)
The lines that begin with a 5 digit number denote the beginning of a particular section (i.e., an instance of a certain course being taught). APMA 109 would be a mneumonic for the course. Then the section number, the course name, etc. The issue is made more difficult because certain information is sometimes absent. I've made an attempt at it, but on dumping out my hash, it's not anywhere close to what it should be. Your help is most appreciated. Code ...
#!/usr/bin/perl # parseCOD.perl # Attempts to parse .txt COD files for UVa use strict; use warnings; use Data::Dumper; # coures{mneumonic}->[sections] = # [ "Name", "ID", "Credit", "CurrEnroll", "MaxEnroll", [start time], + [end time], [days], [location], [instructor] ] # # So, for instance, to find out all the professors of each section of +SPAN411: # for (@{$coures{SPAN411}->[9]}) { print } # my %courses; my ($id, $mneumonic, $sect, $maxEnroll, $currEnroll, $name, $credit); my (@startTime, @endTime, @days, @location, @instructor); my $line; my $file = "webcod.enf.txt"; my $i = 1; open COD,$file or die "No go fo' $file\n"; open OUT,">out.txt" or die "No go fo' out.txt\n"; for (1..10) { my $junk = <COD>; } # Don't need first ten lines ... while ($line = <COD>) { SECTION: if ($line =~ /\d\d\d\d\d/) { # Start of a course section if ($line =~ /(\d\d\d\d\d)\s((?:\W|\S){2,5}\S*?\d\d\d\w?)\s+(\d\d\ +d\d)\s(.*)(\d\.\d)/) { $id = $1; $mneumonic = $2; $sect = $3; $name = $4; $credit = $5; } else { die "Line not properly parsed! Choking, choking ... dead +. Line:\n'$line'"; } # Now, get times, dates, etc. $line = <COD>; if ($line =~ /TBA/) { $line = <COD>; goto SECTION; } if ($line =~ /\s*(\d+)-(\d+)\s*([MTWRFS ]+?)\s*(\w\w\w\s*\w?\d*\w? +)\s*(\w*[,'-]?\w*\s*\w*)\s*(\d\d\d)\s*(\d\d\d)/) { $startTime[0] = $1; $endTime[0] = $2; $days[0] = $3; $location[0 +] = $4; $instructor[0] = $5; $maxEnroll = $6; $currEnroll = $7; } else { die "Oof! Malformed line. Line:\n'$line'"; } while ($line = <COD>) { if ($line =~ /\d\d\d\d\d/) { $line = <COD>; goto SECTION; } if ($line =~ /GRAD ENGR/) { $line = <COD>; goto SECTION; } if ($line =~ /TBA/) { $line = <COD>; goto SECTION; } if ($line =~ /RESTRICTED TO/) { $line = <COD>; goto SECTION; } if ($line =~ /^$/) { $line = <COD>; goto SECTION; } if ($line =~ /\s+([a-zA-Z]+\s+[a-zA-Z])/) { push @instructor, $1; } elsif ($line =~ /\s*(\d+)-(\d+)\s*([MTWRFS ]+?)\s*(\w\w\w\s*\w +?\d*)\s*(\w*[,'-]?\w*\s*\w*)/) { push @startTime, @$1; push @endTime, @$2; push @days, @$3; pus +h @location, @$4; if ($5) { push @instructor, @$5; } else { push @instructor, @{$instructor[0]}; } } else { die "Doh! Silly out of format line. Line:\n'$line'"; +} } } } continue { # Remove spaces from $mneumonic $mneumonic =~ s/\s+//g; $courses{$mneumonic}->[$sect] = [$name, $id, $credit, $currEnroll, $ +maxEnroll, \@startTime, \@endTime, \@days, \ +@location, \@instructor]; print "Assigned ", $i++, "\n"; }

In reply to Parsing COD text help by dimmesdale

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.