There is a lot of extraneous material in that datafile. The key part (to me) with data munging is to first break out the records, then process each record in turn. This snippet breaks all the courses out into records, skipping all the dross. From there it should be easy to parse $course reliably which contains one complete course entry.

my $file = "c:/webcod_enf.txt"; open F, $file or die $!; local $/ = "\n\n"; # break up into rough 'records' at blank lines # often you can get good records just by setting $ +/ # not with this data though ;-) while (<F>) { # skip the extraneous data, valid chunks will start with ^\s*\d{5} unless ( m/^\s*\d{5}/) { #print "Skipping:\n$_"; next; } # now we have chunks of real data to parse. we split it on the uni +que m/^\s*\d{5}/m # numeric feature to break out the individual records. we use a lo +okahead assertion # to do this so we don't loose that data in the split for my $course( split /(?=^\s*\d{5})/m, $_ ) { next if $course =~ m/^\s*$/; # we possibly get a null record t +o start so skip print "$course\n\n"; } } __DATA__ 92861 APMA 109 0001 GI LC CALCULUS I 4.0 0900-0950 M W F OLS 011 OBERHAUSER JP 055 002 O 0830-0920 T OLS 005 90063 APMA 109 0002 GI LC CALCULUS I 4.0 1000-1050 M W F OLS 120 BECK M 055 004 O 0830-0920 R OLS 120 91589 APMA 109 0003 GI LC CALCULUS I 4.0 1100-1150 M W F OLS 120 BECK M 055 006 O 0830-0920 T MEC 205

cheers

tachyon


In reply to Re: Parsing COD text help by tachyon
in thread Parsing COD text help by dimmesdale

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.