Remove Tabs and Newlines Inside Fields of Text Tab Delimited Files from Excel

I don't know if this is something that is common knowledge or not, but I figured this out once, and at the time I was overly pleased with myself. Now, wallowing in my hubris from this past epiphany (it felt like one for me anyway), I thought I'd share. I do realize there are simpler ways to read such files via various CPAN modules, but I still consider the code I write to be pretty amateurish (I'm actually an Electronics Engineer), so I'm still proud I figured this out at all.

As far as I can tell, Text Tab Delimited type files saved out of Excel will have quotes added around individual tab delimited fields that have certain characters inside those fields, presumably so Excel can better handle them itself. My "trick" is essentially exploiting this to efficiently resolve lines that are read in from such files and split on tabs when the fields themselves can have tabs and/or newlines inside of them.

while (my $line = <$TD_fh>)
{
    chomp($line);
    my @data = split /\t/, $line;
    
    # This block of code strips tabs and newlines from inside individu
+al tab delimited data fields.
    # Each individual tab or newline is replaced with a single space c
+haracter.
    my $last_index = $#data;
    for (my $field_index=0; $field_index<$last_index; $field_index++)
    {
        if (($data[$field_index] =~ tr/"//) % 2 == 1)
        {
            my $new_string = "$data[$field_index] $data[$field_index+1
+]";
            splice @data, $field_index, 2, $new_string;
            $line = join "\t", @data;
            $last_index--;
            $field_index--;
        }
    }
    if (($data[$last_index] =~ tr/"//) % 2 == 1)
    {
        $line .= " " . <$TD_fh>;
        redo;
    }
    
    # Make use of the resolved line and/or data however you please her
+e.
}
[download]

So essentially it just comes down to always checking each field for an even number of quotes, and if that isn't found, some action needs to be taken (as far as I can tell, Excel "escapes" quotes themselves by adding another quote beside it). For all but the last field, the code assumes an extra tab was in there and consolidates the next field with the current one being looked at. For the last field, the code assumes an extra newline was in there and consolidates the next line with the current one being looked at. It's not glamorous or overly complex, but that's actually why I was proud of it. Relative to the other things I tried, it's extremely efficient as well.

I did ultimately write this code in to a subroutine that returned the resolved split on tabs array for a line, but the biggest change to the code there is just replacing the "redo" line with a recursive call to the subroutine assigned to its local "@data" array.

# This subroutine accepts a filehandle and a line read from that fileh
+andle as arguments given in that order, it is meant for "Text (Tab de
+limited)" type files.
# It strips tabs and newlines from inside individual tab delimited dat
+a fields.
# It will modify the line that was passed to it (as if passed by refer
+ence) to resolve it, and return an array of the completely resolved l
+ine split on tabs.
sub resolve_tab_delimited_file_line
{
    my $fh = $_[0];
    chomp($_[1]); # $_[1] being the read line passed in to this subrou
+tine that is to be modified if necessary (as if passed by reference)
    my @data = split /\t/, $_[1];
    
    my $last_index = $#data;
    for (my $field_index=0; $field_index<$last_index; $field_index++)
    {
        if (($data[$field_index] =~ tr/"//) % 2 == 1)
        {
            splice @data, $field_index, 2, "$data[$field_index] $data[
+$field_index+1]";
            $_[1] = join "\t", @data;
            $last_index--;
            $field_index--;
        }
    }
    if (($data[$last_index] =~ tr/"//) % 2 == 1)
    {
        $_[1] .= " " . <$fh>;
        @data = &resolve_tab_delimited_file_line;
    }
    
    return @data;
}
[download]

Any thoughts from the monastery? Just curious if the monks think this is cool, stupid, obvious, irrelevant, or anything else? Or for that matter good or bad? Or can anyone point out any glaring oversight in the code where it isn't dealing with something that would make it choke and die a miserable death if encountered? That last one would make most of my office mates happy. I'm one of only two Perl Advocates swimming in a sea of Python Zealots, a group of which I recently read is sometimes referred to as a "smug". *snicker*

I love it when things get difficult, after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious

Comment on Remove Tabs and Newlines Inside Fields of Text Tab Delimited Files from Excel Select or Download Code

Replies are listed 'Best First'.
Re: Remove Tabs and Newlines Inside Fields of Text Tab Delimited Files from Excel by choroba (Cardinal) on Jul 13, 2016 at 15:25 UTC
If you want to see some wild data to test your code against, check the tests of Text::CSV_XS and/or Text::CSV_PP. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]
Re^2: Remove Tabs and Newlines Inside Fields of Text Tab Delimited Files from Excel by perldigious (Priest) on Jul 13, 2016 at 15:56 UTC
You piqued my curiosity. I'm presuming what you meant is that they have a standard predefined set of test files to run code changes against. I see references in the code to test.csv and test.txt, but I don't see links to those files anywhere to look inside and presumably have my face twist in horror (I'm glad I was never dumb enough to try to deal with CSV files myself btw). I have come to the conclusion that I'd like to start learning and using the Perl DBI if I can find some free time at work to do so. Not just to do the types of things I'm already doing in tab delimited files, but if I could actually talk my company's admins in to giving me direct read access to our Oracle database I feel like I could suddenly have a lot more power to do useful things. As it is, I suffer from having to output multiple flat files from our web form based database inquiry tool we pay for and then having to parse and cross reference the data in those files myself using Perl... which is why I learned and learned to love Perl in the first place actually I love it when things get difficult; after all, difficult pays the mortgage. - Dr. Keith Whites I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious	[reply]
Re^3: Remove Tabs and Newlines Inside Fields of Text Tab Delimited Files from Excel by hippo (Archbishop) on Jul 13, 2016 at 16:10 UTC
Most modules on CPAN come with tests (and those which don't should). When the module is installed by the user there are four steps which should always be run on the unpacked tarball. These are: Produce the makefile or the build script Make/build the distro from the source Run the tests Install into @INC The tests in step 3 are the ones which we're talking about here and by convention live in the top-level `t/` directory. For example, here's where they are in Text::CSV_XS. You can view the results of the tests being run on a wide variety of platforms at CPAN Testers. You can use these tests as a basis for some you run against your own code to see how it handles the various edge cases, etc.	[reply] [d/l]
Re^4: Remove Tabs and Newlines Inside Fields of Text Tab Delimited Files from Excel by perldigious (Priest) on Jul 13, 2016 at 16:22 UTC
Re^5: Remove Tabs and Newlines Inside Fields of Text Tab Delimited Files from Excel by stevieb (Canon) on Jul 19, 2016 at 12:13 UTC
Re^3: Remove Tabs and Newlines Inside Fields of Text Tab Delimited Files from Excel by choroba (Cardinal) on Jul 13, 2016 at 16:07 UTC
It seems the tests either create the files themselves or contain the data in the __DATA__ section. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]