trolis has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perlmonks,

I have written a simple perl script (which works just fine) to open csv file, skip the first two lines of a header and format the columns from: 400;0,34567; to --> 400 0.34567 (change semicolon to space & comma to dot) I would appreciate your ideas on how to improve the following code:

#!/usr/bin/perl use strict; use warnings; use Text::CSV; my $csv = Text::CSV->new; open (CSV, "<File_name.csv") or die $!; while (<CSV>) { next if (1 .. 2); $_ =~ s/\;/ /g; $_ =~ s/,/\./g; if ($csv->parse($_)) { my @columns = $csv->fields(); print "@columns\n"; } else { my $err = $csv->error_input; print "Failed to parse line: $err"; } } close CSV;

Somehow I have a feeling that this code is not efficient but it works. Is it worth to improve it or just use it as it is? I know it is possible to do this simple character replacement directly in vim using sed or with awk but I want to use Perl instead. Thank you!

Replies are listed 'Best First'.
Re: ideas on how to improve existing code
by GrandFather (Saint) on Aug 29, 2011 at 09:55 UTC

    There is no efficiency issue of note. In particular, if it does the job you need to do fast enough it is sufficiently efficient. However there is a bunch of stuff that can be improved.

    For a start you should always use the three argument version of open and lexical file handles:

    open my $csvIn, '<', $fileName or die "Failed to open $fileName: $!";

    It also helps if error strings provide more than just the bare minimum of information. For an open including the file name that was being used (rather than hoping it is what you think it ought to be) helps a huge amount.

    Skipping the first two lines in the loop that handles the rest of the lines is ok, but not near as clear as skipping them outside the loop:

    <$csvIn> for 1 .. 2; while (defined (my $line = <$csvIn>)) {

    Using translate for simple character translation is fast and clear:

    $line =~ tr/;,/ ./;

    Although if you were to use regular expressions you should note that there is no need to quote ; in a regular expression string and no need to quote anything in the replacement string (except stuff you'd quote if it were a double quoted string).

    Using consistent indentation helps a lot!

    I prefer to make function calls explicit by supplying parenthesis even is they aren't strictly required. At the very least you should be consistent (cf: $csv->fields and $csv->error_input).

    Taking that all together you'd end up with something like the following (untested) code:

    #!/usr/bin/perl use strict; use warnings; use Text::CSV; my $csv = Text::CSV->new (); my $fileName = 'File_name.csv'; open my $csvIn, '<', $fileName or die "Can't open $fileName: $!"; <$csvIn> for 1 .. 2; while (defined (my $line = <$csvIn>)) { $line =~ tr/;,/ ./; if ($csv->parse ($line)) { my @columns = $csv->fields (); print "@columns\n"; } else { my $err = $csv->error_input; print "Failed to parse line $.: $err"; } } close $csvIn;
    True laziness is hard work
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: ideas on how to improve existing code
by Tux (Canon) on Aug 29, 2011 at 13:39 UTC

    As you are already familiar with Text::CSV, please keep using its readline instead of falling back to perl's <> method. Things turn nasty quite fast when the CSV contains nested quotation, Unicode or newlines. (This line was updated as GrandFather just copied the use of OP's reading method).

    Check if you have installed Text::CSV_XS. Text::CSV is just a wrapper module over Text::CSV_PP and/or Text::CSV_XS. The XS version is about 100 times faster than the PP version and with lots of columns and/or rows, the difference adds up quite fast (see this graph).

    update: I extended the speed compare tests with a plain split approach (which obviously breaks on nested sep_char, newlines and other problematic content. Even then Text::CSV_XS can outperform plain perl! See here.

    Use the getline method. Do not mix perl's readline (<>) with the parse method.

    (Already mentioned) Use three-arg open calls and lexical handles.

    Use builtin error reporting (the auto_diag attribute). No need for else-branches at all.

    #!/usr/bin/perl use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, # Always do so auto_diag => 1, # Makes finding bugs a whole lot easier }); my $file = "File_name.csv"; # Now you can use it in error reporting open my $fh, "<", $file or die "$file: $!"; # Three-arg open $csv->getline ($fh) for 1, 2; # Skip first two lines while (my $row = $csv->getline ($fh)) { tr/;,/ ./ for @$row; say join "\t" => @$row; } close $fh;

    Enjoy, Have FUN! H.Merijn

      I'm not sure where you get the "perl parsing method" idea from as my sample code in reply to the OP is a minor refactoring of the OP's code. It uses $csv->parse ($line) to parse individual lines from the file as does the OP's code.

      However, despite the OP's proffered guess that the code may be inefficient, the plea in the title of the node is "ideas on how to improve existing code" which I addresses by pointing to maintenance and reliability related good coding practises. People seem to get hung up about "efficiency" issues without any need at all to make the code execute more quickly. Maintenance and reliability are generally much more important. If the OP were not using a module for parsing csv already I'd recommend using a module just from the point of view of saving programmer time and making the code more reliable and maintainable - "efficiency" is generally completely irrelevant!

      True laziness is hard work

        It was nothing personal, but I am on a crusade in trying to prevent people that already use Text::CSV, Text::CSV_XS and/or Text::CSV_PP to use

        while (<$fh>) { my @row = $csv->parse ($_); ...

        which I was referring to with your "perl parsing method" (or any variation thereof) instead of

        while (my $row = $csv->getline ($fh) { ...

        Which is not only faster, but immensely safer. Under the hood both <> and $csv->getline use perl's getline function and both respect the (perlio) layers, but the upper method is not able to detect the difference between an embedded $/ inside quotes or at the end of a line, so it is very open to erroneous behavior. Besides that, CSV's getline is more lenient towards trailing carriage returns and/or newlines (unless you explicitely set the eol attribute.


        Enjoy, Have FUN! H.Merijn
      Tux, thanks a lot for taking your time to explain me the basics!
Re: ideas on how to improve existing code
by Anonymous Monk on Aug 29, 2011 at 09:02 UTC

    It is not clear why you are diddling the data, what you call "format the columns", so why are you? You should do that after parsing into cvs, not before

    Somehow I have a feeling that this code is not efficient but it works.

    Thats just gas, don't worry about it

    Is it worth to improve it or just use it as it is?

    The code is literally 20 lines, there is nothing to improve

    The only possible thing is use "modern perl" practices, meaning use 3 argument open, and lexical filehandle, and maybe use autodie because the error message is better

    $ perl -Mautodie -le " open my $fh => '<', '23232' " Can't open '23232' for reading: 'No such file or directory' at -e line + 1
      Thanks for reply and comments.

      My initial goal is to set up those files to the default format which gnuplot accepts (space delimited columns of data with a dot as decimal separator). I have many csv files which all look slightly differently so the idea was to make them all "the same". For sure there could be many other ways to complete the task but somehow I have chosen Perl. The script should grow over time as I plan to implement more features but I have just started.

Re: ideas on how to improve existing code
by JavaFan (Canon) on Aug 29, 2011 at 11:27 UTC
    Is it worth to improve it or just use it as it is?
    You tell us! Does it work for you? Is it fast enough for you? If they answer is "yes" to both questions, leave it alone and do something else with your time.
      As I said, it works but I was looking let's say for a more proper way to get the same result.

      Thanks to everybody and especially you GrandFather for suggestions & advices.

Re: ideas on how to improve existing code
by i5513 (Pilgrim) on Aug 29, 2011 at 13:23 UTC
    If csv is simple enough and you know it is provided by a tool which provides csv files correctly, I don't see why using Text::Csv Maybe a simple sustitution work ?
    open (CSV, "<File_name.csv") or die $!; while (<CSV>) { next if (1 .. 2); $_ =~ s/\;/ /g; $_ =~ s/(\d),(\d)/\1\.\2/g; print; }
      Note that this breaks on any CSV data which allows quoted strings with embedded commas.
      this,line,has,6,"fields, not",7
      Just use Text::Csv and you don't have to worry about things like that.