saldoman has asked for the wisdom of the Perl Monks concerning the following question:

I have a situation where I want to determine what a file record is delimited by. I have files that have various delimiters and I need to figure it out on the fly. I need to change whatever delimiters that are not comma into commas. Is there a regular exporession that could determine a files field delimiters?

Replies are listed 'Best First'.
Re: delimited files
by polettix (Vicar) on May 18, 2005 at 13:08 UTC
    Maybe Crystal::Sphere could help.

    Jokes apart, you first need to establish if your valid characters and your possible delimiters are disjointed sets. If one file can use ";" as a separator, while it's a perfectly valid char inside the fields of another... it's something that I would not automate.

    Otherwise, if these sets do not intersect (e.g. you only have digits as values inside fields, and possible separators are ";,|:"), you can simply transform each separator character into a comma:

    while (<>) { tr/;:|/,/; print; }

    Update: removed spurious line from example.

    Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))')

    Don't fool yourself.
Re: delimited files
by tbone1 (Monsignor) on May 18, 2005 at 13:14 UTC
    A way to determine a file's delimiter on the fly? Not that I know of. In fact, I'm not sure it can be done. Think about it, you could have one file, call it a.txt, that is a CSV file. Your next file, let's call it b.txt just to be creative, could have commas in it but be a pipe-delimited file. b.txt could have underscores, which are the delimiters for the next file ...

      BERTIE: What should we call this one, Jeeves?
      JEEVES: C, sir?
      BERTIE: Well, I suppose "Caeser" is as good a name as any.

    ... called caeser.txt, which just so happens to contain those commas and quotes which are used to delimit fields in a.txt.

    If anyone can come up with a solution to this, I'd like to shake that person's hand ... or rather, at least know who did it and what their reasoning was.

    (BTW, I apologize for the injection of P.G. Wodehouse into the middle of this, but that is a weakness of mine. Please forgive me.)

    --
    tbone1, YAPS (Yet Another Perl Schlub)
    And remember, if he succeeds, so what.
    - Chick McGee

      Try parsing the file 256 times — once using each ASCII character as the field delimiter — and collect statistics on the number of columns that result. The delimiter character that yields the most uniform number of fields per line is the likely one. (Yes, I've actually done this. Works pretty well sometimes.)
Re: delimited files
by dragonchild (Archbishop) on May 18, 2005 at 13:15 UTC
    To expand further on what the prior responses are saying, a file's structure is determined by agreement between the person making the file and the person reading the file. As far as a computer is concerned, a file is just a stream of 1's and 0's. Nothing less, nothing more. It is you, the programmer, that imposes structure on that stream. If you do not know the structure, then it is up to you to discover it.

    The best place to start is your manager. If the person giving you the task has not provided the information you need to complete said task, they need to be made aware of it ASAP.


    • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
    • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
      I totally agree with you, but in this business that request will go unanswered. I was handed a project with a 48 hour TAT and the hard drive of the customers data. The hard drive consisted of 251 GB of data files. Some of these files are considered metadata files which are the delimited files I was describing earlier. With this much data to work there is no way to go through it manually to figure out the delimiters. This is actually somewhat humorous, I said to my manager that this project is nearly void of any meaningful details. His reponse was " You have to make your own details" Nice!
        Your solution

        • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
        • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
Re: delimited files
by holli (Abbot) on May 18, 2005 at 13:16 UTC
    Given a limited set of delimiters, this might work.
    my @delim = (",", ";", ":", "\t"); my $line = "This;is;delimited;with;a;semicolon,;and;there;is;a;comma"; for ( @delim ) { print "possibly delimited with $_\n" if $line =~ /([^$_]+$_){2,}/; } # prints "possibly delimited with ;"
    This does not handle quoted fields. But in my exp. quoted CSV-files are seldom used, so I'll leave it as it is.


    holli, /regexed monk/
Re: delimited files
by thundergnat (Deacon) on May 18, 2005 at 13:01 UTC

    Without knowing more about the structure of your files and/or the range of possible delimiters, this question is impossible to answer in any meaningful way.

      The files are flat text files with anywhere from 3 to 50 or fields. The range of delimiters that I have seen thus far include - _ * & ^ % $ # @ ! ~ ` < > . : ; € œ þ

        If you can assume the same number of fields in each line, you can you can try counting each possible delimiter for the first 5 lines or so, and seeing which returns a reasonable result.

        This is a little rough, and could use better error checking, but something along these lines.

        Run this in the directory containg the csv files. It assumes file extentions of .csv and saves the "corrected" files as filename.csv.new. Modify to suit.

        Update: edited script slightly to remove useless use of array.

        ############################################### use warnings; use strict; my @delimiters = ('_', '*', '&', '^', '%', '$', '#', '@', '!', '~', '` +', '<', '>', '.', ':', ';', '€', 'œ', 'þ', ','); my @files = glob('*.csv'); # or whatever my %likely; for my $file(@files){ open my $fh, '<', $file or warn "Couldn't open $file. $!"; my %delim_count; for my $count (1..5){ my $line = <$fh>; for (@delimiters){ my $testline = $line; $delim_count{$_}{total} += $testline =~ s/\Q$_\E//g; } } for (@delimiters){ if (defined $delim_count{$_}{total} and ($delim_count{$_}{tota +l}) > 2 and ($delim_count{$_}{total}/5 == int($delim_count{$_}{total} +/5))){ no warnings 'uninitialized'; $likely{$file} = $_ if ($delim_count{$_}{total} > $delim_c +ount{$likely{$file}}{total}); } } print "Most likely delimiter for $file is $likely{$file}\n" } for my $file (keys %likely){ if (defined $likely{$file}){ print "Updating $file....\n"; next if ($likely{$file} eq ','); my ($csv,$output); unless (open $csv, '<', $file){ warn "Couldn't open $file. $!"; next; } unless (open $output, '>', "$file.new"){ warn "Couldn't open $file.new for writing. $!"; next; } while (<$csv>){ s/\Q$likely{$file}\E/,/g; print $output $_; } }else{ print "Ambiguous delimiter for file $file\n"; } }
Re: delimited files
by etm117 (Pilgrim) on May 18, 2005 at 13:50 UTC
    You stated above that files have anywhere from 3 to 50 fields, but you weren't clear on whether a single file would have the same number of fields in each line...
    Therefore, my answer makes the assumption that a file has the same number of fields for each line (though different files may have different numbers of fields). If that is not the case, please ignore the rest of the response.

    You can try and split the first 5 lines of a given file on each of the different known deliminators. For each deliminator tested, if the 5 lines you test return the same length array (greater than 1, as that is the default case if there was nothing to split on), then that deliminator is likely to be the correct one.

    While this isn't 100% guaranteeed, many of the files should process correctly this way, and then you may only have a few files to work by hand after verifying yourself what the true deliminator is.

Re: delimited files
by Roy Johnson (Monsignor) on May 18, 2005 at 13:56 UTC
    Your best bet is probably to look at the first few records of the file, tally up all the [^\w\s] (or whatever you consider potential delimiting) chars, and guess that the most common one is the delimiter.

    I hope you're not saying that the delimiter can change from record to record within a file. That would make it an even dodgier proposition.


    Caution: Contents may have been coded under pressure.
Re: delimited files
by BrowserUk (Patriarch) on May 18, 2005 at 15:05 UTC

    Making a few assumptions about the nature of your "delimited" files:

  • The delimiter char should appear in every line.
  • The delimieter char should appear the same number of times in every line.

    It should be possible to analyse the file to determine the delimiter. Something like this may work (untested):

    #! perl -slw use strict; my %charsByLine; my %freqCharsPerLine; my %chars; while( <> ) { chomp; for( split '', $_ ) { $chars{ $_ } = 1; push @{ $charsByLine{ $_ } }, $.; $freqCharsPerLine{ $. }{ $_ }++ ; } } my $last = $.; ## Eliminate chars that do not appear in every line @{ $charsByLine{ $_ } } != $last and delete $chars{ $_ } for keys %cha +rs; ## Eliminate chars where they appear a different number of time per li +ne for my $char ( keys %chars ) { my $previousCount = $freqCharsPerLine{ 1 }{ $char }; for my $line ( 2 .. $last ) { if( $freqCharsPerLine{ $line }{ $char } != $previousCount ) { delete $chars{ $char }; last; } } } if( keys %chars == 1 ) { print "The delimiter for this file is: ", keys %chars; } elsif( keys %chars ) { print "Candidate delimiters for this file are: ", keys %chars; } else { print "Unable to determine a likely candidate for this file!"; }

    Of course, if the files have header lines, or contain quoted items that can contain the delimiter char, then the above assumptions would need to be modified to account for that. But if there is any consistancy in the format of the files, it should be possible to derive a heuristic that would detect the right character in most cases, and flag any anomolies for manual inspection/determination.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: delimited files
by displeaser (Hermit) on May 18, 2005 at 13:25 UTC
Re: delimited files
by Animator (Hermit) on May 18, 2005 at 14:01 UTC

    Here is a possible idea that could work (I'll leave it up to you to implement it)

    Assumption: one record is stored on each line (or you know exactly how many lines define one record).

    Idea: Create a hash which represents the charachters and the numbers of lines they appear on. (Which means: read the file record per record, and add 1 to the correct hash-element for each unique charachter).

    Now look for the charachters that occurs on every line. (meaning loop the hash and look for the elements where the value equals the total number of lines (after excluding alphanumeric charachters etc))

    Now when you are done with that, then you might or might not be there yet. As in, what if there are two charachters that appear on every line? A solution for this could be that you try each charachter as a delimiter, and count how many fields you get... if there is a big difference in the number of fields then there is a big chance you used the wrong delmiter (except ofcourse when the data is really strange)