8 months ago I wrote a perl script to perform the following functions:-

Open a file,read the input and identify garbage lines via regexes and remove them.

Read the file again (I know I know already inefficient) and split into records and then split the records into columns and then store specific columns in a hash keyed on record id.

Once we have all the records the script would then insert them all into an Oracle database which another part of the script would query and tag and report on. It worked and I was happy.

In the interim I have learnt a lot about structure, packages, modules, regexes (with a lot of help from the monks - thanks Monks!).

I now feel the need for a rewrite and a new approach as it needs to be quicker and more robust than it is. Plus it will be an opportunity to learn something new. The garbage identification mechanism needs constant updating as the garbage varies. I need to state what I will accept rather than what I wont accept. (Its output from Meridian Voice Switches for those that are interested). I am extending the functionality so that it will be able to parse multiple output types

I need to end up with a data structure containing data structures of valid records.

Reading about I get the feeling I should be using IO::Filter or implementing the Filtered IO idea from TheDamian's OOPerl to remove the garbage from my file and then perhaps use Parse::RecDescent to parse my file and validate the record and create the datastructure. At the same time I am conscious I dont want to make it more complex than it needs to be.

This seems like a very common task to want to perform. My questions are:-

What approach would you take to a problem of this kind? I think I am that point but what criteria do I use to determine whether I should stop using plain regexes and consider using Parse::RecDescent?

Typical output of a record looks like set_uk's scratchpad To show the complexity of the problem.

General rules are:-

The first word at the beginning of each column is its key. There are a lot of valid key types 1000+ - anything not starting with a key is garbage. Unless:- If the first col is blank then the data belongs to a key on the previous col. If the first col is the same as previous then data belongs to the previous col.

I'd be interested to hear what you think. No doubt there are shortcuts to this type of problem that I am not aware of.

Simon


In reply to To parse or not to parse by set_uk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.