comment on

I've had the pleasure of hacking through three types of flat text file recently. They are

the entire pile of RFC's from http://www.faqs.org

3000 alcoholic beverage recipes (from somewhere I, mysteriously, cannot remember)

So I have been parsing a lot of flat text files. The RFC's are HTML, but there's a lot of fluff at the beginning and at the end, so I've been using the belowmentioned loop to extract the 'meat'. The drinks are also HTML, but have different crap around them. The CIF files are not HTML, and I cant really strip a lot of data from them -- but I want to be able to strip them out of other data.

So with these three programs (in about 2 weeks) I have had to use some sort of start parsing - parse - stop parsing loop three times. I've even pondered writing a small module to do it for me (not for CPAN, probably would post it here, but just something to keep in my homedir to ease future scripts). This is something that has undoubtedly been done zillions of times. After all, what is perl but a <!- pathologically eclectic RUBBISH lister!!!!!! ->parser?

So whilst working on making my code readable, I stumbled upon (see Using arrays of qr!! to simplify larger RE's for readability (code). and Optimization for readability and speed (code)) the use of arrays of qr!! and iterate through them when matching text. This allows some flexibility (i.e., mulitple "start" and "finish" conditions), and it also is pretty clear to read (as it reduces the size of the individual regular expressions).

But looking over the code, I dont get a good "satisfied" feeling re-using it. So, here it is, and I'd like to know what others would do instead:

my @beginnings =
( qr{This is valid</a>},
  qr{[So]+(?:IS|isnt) this}, 
);

my @endings =
( qr{(?:we) Should not [be] [Pp]arsing after},
  qr{either (o|f) these [Ll]ines},
);

sub isbeg {
  my $test = shift;
  foreach (@beginnings) { return undef unless $test =~ $_ }
  $test;
}

sub isend {
  my $test = shift;
  foreach (@endings) { return undef unless $test =~ $_}
  $test;
}

# here is the part I dislike, partially because of the $parsing variab
+le
# it just doesnt seem as "clean" as something some of
# you would write.

  my $parsing;
  foreach my $line (@lines) {
    $parsing++ if isbeg( $line );
    push @extracted, "$line\n" if $parsing;
    last if isend( $line );
  }
[download]

I'm familiar with HTML::TokeParser and HTML::Parser, but since I do this a lot on non-HTML files, I'd like to extract the good parts with my loop and use the parsing modules to parse the stuff I want to parse (rather than the gristle).

thanks
hermano deppon

--
Laziness, Impatience, Hubris, and Generosity.

In reply to Parse Loops with flat text files. (code) by deprecated

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.