comment on

I don't know if this is something that is common knowledge or not, but I figured this out once, and at the time I was overly pleased with myself. Now, wallowing in my hubris from this past epiphany (it felt like one for me anyway), I thought I'd share. I do realize there are simpler ways to read such files via various CPAN modules, but I still consider the code I write to be pretty amateurish (I'm actually an Electronics Engineer), so I'm still proud I figured this out at all.

As far as I can tell, Text Tab Delimited type files saved out of Excel will have quotes added around individual tab delimited fields that have certain characters inside those fields, presumably so Excel can better handle them itself. My "trick" is essentially exploiting this to efficiently resolve lines that are read in from such files and split on tabs when the fields themselves can have tabs and/or newlines inside of them.

while (my $line = <$TD_fh>)
{
    chomp($line);
    my @data = split /\t/, $line;
    
    # This block of code strips tabs and newlines from inside individu
+al tab delimited data fields.
    # Each individual tab or newline is replaced with a single space c
+haracter.
    my $last_index = $#data;
    for (my $field_index=0; $field_index<$last_index; $field_index++)
    {
        if (($data[$field_index] =~ tr/"//) % 2 == 1)
        {
            my $new_string = "$data[$field_index] $data[$field_index+1
+]";
            splice @data, $field_index, 2, $new_string;
            $line = join "\t", @data;
            $last_index--;
            $field_index--;
        }
    }
    if (($data[$last_index] =~ tr/"//) % 2 == 1)
    {
        $line .= " " . <$TD_fh>;
        redo;
    }
    
    # Make use of the resolved line and/or data however you please her
+e.
}
[download]

So essentially it just comes down to always checking each field for an even number of quotes, and if that isn't found, some action needs to be taken (as far as I can tell, Excel "escapes" quotes themselves by adding another quote beside it). For all but the last field, the code assumes an extra tab was in there and consolidates the next field with the current one being looked at. For the last field, the code assumes an extra newline was in there and consolidates the next line with the current one being looked at. It's not glamorous or overly complex, but that's actually why I was proud of it. Relative to the other things I tried, it's extremely efficient as well.

I did ultimately write this code in to a subroutine that returned the resolved split on tabs array for a line, but the biggest change to the code there is just replacing the "redo" line with a recursive call to the subroutine assigned to its local "@data" array.

# This subroutine accepts a filehandle and a line read from that fileh
+andle as arguments given in that order, it is meant for "Text (Tab de
+limited)" type files.
# It strips tabs and newlines from inside individual tab delimited dat
+a fields.
# It will modify the line that was passed to it (as if passed by refer
+ence) to resolve it, and return an array of the completely resolved l
+ine split on tabs.
sub resolve_tab_delimited_file_line
{
    my $fh = $_[0];
    chomp($_[1]); # $_[1] being the read line passed in to this subrou
+tine that is to be modified if necessary (as if passed by reference)
    my @data = split /\t/, $_[1];
    
    my $last_index = $#data;
    for (my $field_index=0; $field_index<$last_index; $field_index++)
    {
        if (($data[$field_index] =~ tr/"//) % 2 == 1)
        {
            splice @data, $field_index, 2, "$data[$field_index] $data[
+$field_index+1]";
            $_[1] = join "\t", @data;
            $last_index--;
            $field_index--;
        }
    }
    if (($data[$last_index] =~ tr/"//) % 2 == 1)
    {
        $_[1] .= " " . <$fh>;
        @data = &resolve_tab_delimited_file_line;
    }
    
    return @data;
}
[download]

Any thoughts from the monastery? Just curious if the monks think this is cool, stupid, obvious, irrelevant, or anything else? Or for that matter good or bad? Or can anyone point out any glaring oversight in the code where it isn't dealing with something that would make it choke and die a miserable death if encountered? That last one would make most of my office mates happy. I'm one of only two Perl Advocates swimming in a sea of Python Zealots, a group of which I recently read is sometimes referred to as a "smug". *snicker*

I love it when things get difficult, after all, difficult pays the mortgage. - Dr. Keith Whites
I hate it when things get difficult, so I'll just sell my house and rent cheap instead. - perldigious

In reply to Remove Tabs and Newlines Inside Fields of Text Tab Delimited Files from Excel by perldigious

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.