So you have at least four different input file formats (maybe there are others you haven't shown?), and there is some minimum set of feature information of interest that they all have in common (but each format also contains other information that is not common and not relevant).

Are you able to know in advance, for a given input file, what sort of format it contains (e.g. based on the file name or which directory it's in)? Or does your task include discovering what format is being used in each file, and then parsing it accordingly?

Assuming that each of the input files is pretty small, I would "slurp" the full file into a single scalar variable rather than read it line-by-line:

$_ = ''; if ( open( INPUT, "<", $file )) { $/ = undef; $_ = <INPUT>; close INPUT; } if ( ! $_ ) { warn "No data found in $file\n"; next; } # now it will be easier to categorize/parse $_ ...
Starting like that, you should be able to create a suitable subroutine for each file type, such that the sub returns a list or hash structure that will go directly into the desired XML output. The sub could just be a regex match or something more complicated (series of matches, and/or split on "\n", and/or Text::xSV parse, and/or whatever).

As for your code, I strongly recommend that you start with use strict; and you should add a lot more error checking on things like opening files and doing chdir.

I think your handling of the config file might not provide the right sort of flexibility. If each run only applies to a particular "scanpath" directory, and all such directories are always organized the same way, and only contain files of a particular type, then sure, your approach would be workable. But don't you want a single process that can be run once and cover all types of input, rather than having to run it several times with a different config file each time?


In reply to Re: Perl Parser to Handle Any File Format by graff
in thread Perl Parser to Handle Any File Format by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.