comment on

There is a lot of extraneous material in that datafile. The key part (to me) with data munging is to first break out the records, then process each record in turn. This snippet breaks all the courses out into records, skipping all the dross. From there it should be easy to parse $course reliably which contains one complete course entry.

my $file = "c:/webcod_enf.txt";
open F, $file or die $!;

local $/ = "\n\n";  # break up into rough 'records' at blank lines
                    # often you can get good records just by setting $
+/
                    # not with this data though ;-)

while (<F>) {

    # skip the extraneous data, valid chunks will start with ^\s*\d{5}

    unless ( m/^\s*\d{5}/) {
        #print "Skipping:\n$_";
      next;
    }

    # now we have chunks of real data to parse. we split it on the uni
+que m/^\s*\d{5}/m
    # numeric feature to break out the individual records. we use a lo
+okahead assertion
    # to do this so we don't loose that data in the split

    for my $course( split /(?=^\s*\d{5})/m, $_ ) {
      next if $course =~ m/^\s*$/;   # we possibly get a null record t
+o start so skip
        print "$course\n\n";
    }
}

__DATA__
 92861 APMA 109  0001 GI LC CALCULUS I                        4.0
       0900-0950 M W F  OLS   011  OBERHAUSER      JP 055 002 O
       0830-0920  T     OLS   005


 90063 APMA 109  0002 GI LC CALCULUS I                        4.0
       1000-1050 M W F  OLS   120  BECK            M  055 004 O
       0830-0920    R   OLS   120


 91589 APMA 109  0003 GI LC CALCULUS I                        4.0
       1100-1150 M W F  OLS   120  BECK            M  055 006 O
       0830-0920  T     MEC   205
[download]

cheers

tachyon

In reply to Re: Parsing COD text help by tachyon
in thread Parsing COD text help by dimmesdale

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.