In my experience parsing and transforming printer files ("report scraping"), I've used regular expression pattern matching more often than substr or unpack. Why? Because there's no guarantee the report data will be consistently aligned in column positions. As it happens, items tend to drift left and right a bit, especially over the lifetime of a report that changes occassionally. Maybe the date was in column positions 33 through 42 for a few years, then somebody modified the report; thereafter, the date was in column positions 23 through 32. Obviously, there could be other variation over time besides the shifting left or right of report items, but this is precisely why, in general, I've found it better to start with regular expression pattern matching right out of the chute. It's more adaptable in the face of variation.

I've also found it better (more understandable, more maintainable, etc.) to parse the report into pages or records first, and then to scrape the data from each page or record in a separate step, typically using a function that returns a list or hash of the parsed data.

Jim


In reply to Re^2: Problem with a regex? by Jim
in thread Problem with a regex? by TStanley

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.