johnmck has asked for the wisdom of the Perl Monks concerning the following question:

I am a Perl neophyte when it comes to reading and parsing files. I am definitely in need of wisdom.

I have a 3000 line text file that is comprised of 46 character lines:

C4432882490H019000020150211ESL6690 0H2015PC
C4833076550HC0P0000201412093J46651 0H2015DX
C6033106980H057130020150323FRE7602 0H2015PC
C663160140MT007015G20141124274847A MT2015PC

Character 1 (or 0?) is a field, Characters 2-16 are a field, 17-30 another, and finally 31-46 are the last field. I want to read the file from /tmp (filename is always the same and always in the same place) and parse it into the fields (the counts never change) and then save it as a comma-delimited version.

Since the fields never change length it would seem that unpack is a better choice than substr, is that correct?

Is there a simple bit of code that someone might get me started with, please? Thank you!

Replies are listed 'Best First'.
Re: Unpack or substr to create CSV?
by toolic (Bishop) on May 02, 2015 at 16:23 UTC
    Since the fields never change length it would seem that unpack is a better choice than substr, is that correct?
    Perhaps. Read perlpacktut, which compares the various approaches. Here is an example of using unpack to parse each line into an array:
    use warnings; use strict; while (<DATA>) { chomp; my @cols = unpack 'A1A15A14A16', $_; print join(',', @cols), "\n"; } __DATA__ C4432882490H019000020150211ESL6690 0H2015PC C4833076550HC0P0000201412093J46651 0H2015DX C6033106980H057130020150323FRE7602 0H2015PC C663160140MT007015G20141124274847A MT2015PC
Re: Unpack or substr to create CSV?
by hdb (Monsignor) on May 02, 2015 at 16:16 UTC

    I only count 43 per line, could this be a source of your problems? There are many ways to do this, for example using regular expressions:

    use strict; use warnings; while(<DATA>){ s/(.)(.{15})(.{14})(.{13})/$1,$2,$3,$4/; print; } __DATA__ 0123456789012345678901234567890123456789012 C4432882490H019000020150211ESL6690 0H2015PC C4833076550HC0P0000201412093J46651 0H2015DX C6033106980H057130020150323FRE7602 0H2015PC C663160140MT007015G20141124274847A MT2015PC

      Not to mention that it's really easy to add data validation with this approach:

      use strict; use warnings; while(<DATA>){ s/\A(.)(.{15})(.{14})(.{13})\Z/$1,$2,$3,$4/ or die "bad record: '$_'"; print; } __DATA__ 0123456789012345678901234567890123456789012 C4432882490H019000020150211ESL6690 0H2015PC C4833076550HC0P0000201412093J46651 0H2015DX C6033106980H057130020150323FRE7602 0H2015PC C663160140MT007015G20141124274847A MT2015PC

Re: Unpack or substr to create CSV?
by AnomalousMonk (Archbishop) on May 02, 2015 at 16:48 UTC
    ... a better choice ...

    Here's the standard <rant>: What the heck is your criterion for "better"? I would gravitate to an unpack solution for fixed-width records, but might your maintainer better understand substr? If so, substr would be better. Are you concerned about speed? For such a small dataset, I doubt there would be any significant difference between the three approaches mentioned so far in this thread, but the only way to tell is to Benchmark. (Update: Do you want to support data validation at all?) And so on... </rant>


    Give a man a fish:  <%-(-(-(-<

      I agree with AnomalousMonk, for such a small dataset, any approach is probably good enough. Just use the one you understand best and that your maintainer is likely to understand best. I personally would choose substr because anytime I use unpack, I need to go through the documentation again, and substr is marginally better than a regex. But a regex would do just about as well for this data size.

      Je suis Charlie.

        After posting the above, I realized that a regex approach might give you data validation, if this was of any concern, almost for free, so I think now that I might incline in this direction. But again, there are too many unstated conditions and requirements to allow more than a hand-waving consideration of alternatives, although this may be valuable to johnmck.


        Give a man a fish:  <%-(-(-(-<

Re: Unpack or substr to create CSV?
by Laurent_R (Canon) on May 02, 2015 at 20:23 UTC
    Hi,

    as I already said in another post on your thread, performance is probably completely irrelevant for the small dataset your are talking about.

    However, just in case you are interested, I ran a detailed benchmark on a very similar problem a bit less that a year and a half ago. The results are here: Re: Performance problems on splitting long strings. You'll see that unpack won the race, but substr wasn't that far behind.

    It did make some difference to me, however, because I was running the processing of two 6-GB files, with the long string to be split representing at least 75% to 80% of the data volume.

    This was just for your information. Again, I don't think you should care at all about that for your low data volumes.

    Je suis Charlie.
Re: Unpack or substr to create CSV?
by Tux (Canon) on May 03, 2015 at 10:50 UTC
    $ perl -MText::CSV_XS=csv -we'csv (in => sub {[ unpack "AA15A14A*", <> + // exit ]})' < test.txt C,4432882490H0190,00020150211ESL,"6690 0H2015PC" C,4833076550HC0P0,000201412093J4,"6651 0H2015DX" C,6033106980H0571,30020150323FRE,"7602 0H2015PC" C,663160140MT0070,15G20141124274,"847A MT2015PC"

    Enjoy, Have FUN! H.Merijn
Re: Unpack or substr to create CSV?
by karlgoethebier (Abbot) on May 03, 2015 at 16:20 UTC

    AnomalousMonk wrote:

    "...but might your maintainer understand substr better?"

    I don't know but as no one provided a solution that uses substr, i wrote one:

    #!/usr/bin/env perl use strict; use warnings; my @pairs = ( [ 0, 1 ], [ 1, 15 ], [ 16, 14 ], [ 30, 4 ], [ 35, 8 ] ); while ( my $line = <DATA> ) { for my $pair (@pairs) { my $index = $pair->[0]; my $offset = $pair->[1]; print substr $line, $index, $offset; print qq( ); } print qq(\n); } __DATA__ C4432882490H019000020150211ESL6690 0H2015PC C4833076550HC0P0000201412093J46651 0H2015DX C6033106980H057130020150323FRE7602 0H2015PC C663160140MT007015G20141124274847A MT2015PC

    Output:

    karls-mac-mini:monks karl$ ./substring.pl C 4432882490H0190 00020150211ESL 6690 0H2015PC C 4833076550HC0P0 000201412093J4 6651 0H2015DX C 6033106980H0571 30020150323FRE 7602 0H2015PC C 663160140MT0070 15G20141124274 847A MT2015PC

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»