MajingaZ has asked for the wisdom of the Perl Monks concerning the following question:

I was recently tasked with making CSV files from a list of tab delimited files.

I slurped the file as they are about 10 Megs and I do my substitutions, but they are slow.

Sample input file looks like:
Foo\tBar\t ... FieldX\n
etc...

I do not need to check for row consistency within the file
# Read File from $INPUT local $/ = undef; my $file = <$INPUT>; # Replace for ", in the file as converting to , " delimiters would cau +se ", to potentially cause parsing problems in the output files $file =~ s/\",/,/g; # Replace tab with "," hard coded for now $file =~ s/\t/\",\"/g; # Inserting " at beginning and end of everyline $file =~ s/\n/\"\n\"/g; # Remove dangling " at eof if $INPUT ends with \n $file =~ s/\"\z//; # Prepend " for first line $file = '"'.$file;


I looked at Text::CSV_XS and may end up using that, though I still have a few things to figure out with how to do everything I listed.

However I'm mainly concerned about why these regexs are so expensive. Any insights would be greatly appreciated.

I ran DProf and my report has 98% of run time is tied to these regexs.

Replies are listed 'Best First'.
Re: Making CSV Files from Tab Delimited Files
by graff (Chancellor) on Feb 16, 2011 at 03:16 UTC
    I looked at Text::CSV_XS and may end up using that, though I still have a few things to figure out with how to do everything I listed.

    The whole point about using a standard CPAN module like Text::CSV(_XS) is that someone else has already figured out how to do all the things you listed -- and that code has already been debugged and tested -- so all you need to worry about is learning to use the module correctly.

    Plus, there's probably a few things you hadn't thought of yet, and there are some choices you might like to have available as options. The module will support those things too.

Re: Making CSV Files from Tab Delimited Files
by AnomalousMonk (Archbishop) on Feb 15, 2011 at 23:36 UTC
    ... I'm ... concerned about why these regexs are so expensive.
    ... 98% of run time is tied to these regexs.

    We see data being read and munged in the code shown in the OP. The inference I draw from the OP is that the only code not shown is to open the file, write the munged data and close the file. As such, there's almost nothing happening other than data munging done by regexes and, as jwkrahn points out, a possibly expensive string prepend. I don't understand why one would expect the great majority of time to be spent other than in regex execution given the length of the strings involved.

    ... my substitutions ... are slow.

    But what does 'slow' mean? A day? An hour? A minute? My guess is that the regexes shown in the OP would take on the order of a minute or so per file. Is this MajingaZ's experience? How fast do they need to be?

Re: Making CSV Files from Tab Delimited Files
by CountZero (Bishop) on Feb 16, 2011 at 07:28 UTC
    Reading these files line-by-line with Text::CSV both for input and output is the way to do this. You will be amazed by its speed.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Making CSV Files from Tab Delimited Files
by Jim (Curate) on Feb 16, 2011 at 00:15 UTC
    I was recently tasked with making CSV files from a list of tab delimited files.

    This is a trivial text processing task for which Perl is ideally suited.

    I slurped the file as they are about 10 Megs and I do my substitutions, but they are slow.

    Don't do that. There's no good reason to slurp whole files into memory in this situation. It just makes the task more complicated and is undoubtedly causing the slowness you're observing.

    The following Perl script should perform the conversion correctly and be plenty fast enough.

    #!perl use strict; use warnings; use English qw( -no_strict_vars ); local $INPLACE_EDIT = '.bak'; while (<ARGV>) { s/"/""/g; s/\t/","/g; s/^/"/; s/$/"/; }
Re: Making CSV Files from Tab Delimited Files
by jwkrahn (Abbot) on Feb 15, 2011 at 22:54 UTC

    It looks like $file = '"'.$file; would be really slow because you are copying the entire file.

    Anyway, it looks like you only need one regular expression (and transliteration):

    my $file = <$INPUT>; $file =~ s/(?:(?<=^)|(?<=\t))([^\t\n]*)(?=[\t\n])/"$1"/mg; $file =~ tr/\t/,/;
Re: Making CSV Files from Tab Delimited Files
by wind (Priest) on Feb 15, 2011 at 23:25 UTC
    Don't know why yours is taking so long, but the following is just an amusing in-place alternative:
    #!/usr/bin/perl use strict; use warnings; my $filename = 'file.tabs'; local @ARGV = ($filename); local $^I = '.bac'; while (<>){ chomp; $_ = join(',', map {qq{"$_"}} split "\t") . "\n"; print; }
    I agree that Text::CSV_XS is probably the way to go though.