I don't know if this is something that is common knowledge or not, but I figured this out once, and at the time I was overly pleased with myself. Now, wallowing in my hubris from this past epiphany (it felt like one for me anyway), I thought I'd share. I do realize there are simpler ways to read such files via various CPAN modules, but I still consider the code I write to be pretty amateurish (I'm actually an Electronics Engineer), so I'm still proud I figured this out at all.
As far as I can tell, Text Tab Delimited type files saved out of Excel will have quotes added around individual tab delimited fields that have certain characters inside those fields, presumably so Excel can better handle them itself. My "trick" is essentially exploiting this to efficiently resolve lines that are read in from such files and split on tabs when the fields themselves can have tabs and/or newlines inside of them.
while (my $line = <$TD_fh>) { chomp($line); my @data = split /\t/, $line; # This block of code strips tabs and newlines from inside individu +al tab delimited data fields. # Each individual tab or newline is replaced with a single space c +haracter. my $last_index = $#data; for (my $field_index=0; $field_index<$last_index; $field_index++) { if (($data[$field_index] =~ tr/"//) % 2 == 1) { my $new_string = "$data[$field_index] $data[$field_index+1 +]"; splice @data, $field_index, 2, $new_string; $line = join "\t", @data; $last_index--; $field_index--; } } if (($data[$last_index] =~ tr/"//) % 2 == 1) { $line .= " " . <$TD_fh>; redo; } # Make use of the resolved line and/or data however you please her +e. }
So essentially it just comes down to always checking each field for an even number of quotes, and if that isn't found, some action needs to be taken (as far as I can tell, Excel "escapes" quotes themselves by adding another quote beside it). For all but the last field, the code assumes an extra tab was in there and consolidates the next field with the current one being looked at. For the last field, the code assumes an extra newline was in there and consolidates the next line with the current one being looked at. It's not glamorous or overly complex, but that's actually why I was proud of it. Relative to the other things I tried, it's extremely efficient as well.
I did ultimately write this code in to a subroutine that returned the resolved split on tabs array for a line, but the biggest change to the code there is just replacing the "redo" line with a recursive call to the subroutine assigned to its local "@data" array.
# This subroutine accepts a filehandle and a line read from that fileh +andle as arguments given in that order, it is meant for "Text (Tab de +limited)" type files. # It strips tabs and newlines from inside individual tab delimited dat +a fields. # It will modify the line that was passed to it (as if passed by refer +ence) to resolve it, and return an array of the completely resolved l +ine split on tabs. sub resolve_tab_delimited_file_line { my $fh = $_[0]; chomp($_[1]); # $_[1] being the read line passed in to this subrou +tine that is to be modified if necessary (as if passed by reference) my @data = split /\t/, $_[1]; my $last_index = $#data; for (my $field_index=0; $field_index<$last_index; $field_index++) { if (($data[$field_index] =~ tr/"//) % 2 == 1) { splice @data, $field_index, 2, "$data[$field_index] $data[ +$field_index+1]"; $_[1] = join "\t", @data; $last_index--; $field_index--; } } if (($data[$last_index] =~ tr/"//) % 2 == 1) { $_[1] .= " " . <$fh>; @data = &resolve_tab_delimited_file_line; } return @data; }
Any thoughts from the monastery? Just curious if the monks think this is cool, stupid, obvious, irrelevant, or anything else? Or for that matter good or bad? Or can anyone point out any glaring oversight in the code where it isn't dealing with something that would make it choke and die a miserable death if encountered? That last one would make most of my office mates happy. I'm one of only two Perl Advocates swimming in a sea of Python Zealots, a group of which I recently read is sometimes referred to as a "smug". *snicker*
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Remove Tabs and Newlines Inside Fields of Text Tab Delimited Files from Excel
by choroba (Cardinal) on Jul 13, 2016 at 15:25 UTC | |
by perldigious (Priest) on Jul 13, 2016 at 15:56 UTC | |
by hippo (Archbishop) on Jul 13, 2016 at 16:10 UTC | |
by perldigious (Priest) on Jul 13, 2016 at 16:22 UTC | |
by stevieb (Canon) on Jul 19, 2016 at 12:13 UTC | |
by choroba (Cardinal) on Jul 13, 2016 at 16:07 UTC |