in reply to Re: delimited files
in thread delimited files

The files are flat text files with anywhere from 3 to 50 or fields. The range of delimiters that I have seen thus far include - _ * & ^ % $ # @ ! ~ ` < > . : ; € œ þ

Replies are listed 'Best First'.
Re^3: delimited files
by thundergnat (Deacon) on May 18, 2005 at 15:06 UTC

    If you can assume the same number of fields in each line, you can you can try counting each possible delimiter for the first 5 lines or so, and seeing which returns a reasonable result.

    This is a little rough, and could use better error checking, but something along these lines.

    Run this in the directory containg the csv files. It assumes file extentions of .csv and saves the "corrected" files as filename.csv.new. Modify to suit.

    Update: edited script slightly to remove useless use of array.

    ############################################### use warnings; use strict; my @delimiters = ('_', '*', '&', '^', '%', '$', '#', '@', '!', '~', '` +', '<', '>', '.', ':', ';', '€', 'œ', 'þ', ','); my @files = glob('*.csv'); # or whatever my %likely; for my $file(@files){ open my $fh, '<', $file or warn "Couldn't open $file. $!"; my %delim_count; for my $count (1..5){ my $line = <$fh>; for (@delimiters){ my $testline = $line; $delim_count{$_}{total} += $testline =~ s/\Q$_\E//g; } } for (@delimiters){ if (defined $delim_count{$_}{total} and ($delim_count{$_}{tota +l}) > 2 and ($delim_count{$_}{total}/5 == int($delim_count{$_}{total} +/5))){ no warnings 'uninitialized'; $likely{$file} = $_ if ($delim_count{$_}{total} > $delim_c +ount{$likely{$file}}{total}); } } print "Most likely delimiter for $file is $likely{$file}\n" } for my $file (keys %likely){ if (defined $likely{$file}){ print "Updating $file....\n"; next if ($likely{$file} eq ','); my ($csv,$output); unless (open $csv, '<', $file){ warn "Couldn't open $file. $!"; next; } unless (open $output, '>', "$file.new"){ warn "Couldn't open $file.new for writing. $!"; next; } while (<$csv>){ s/\Q$likely{$file}\E/,/g; print $output $_; } }else{ print "Ambiguous delimiter for file $file\n"; } }