Re: Re: Re: making a single column out of a two-column text file

Yes, you have missed somethin vital. If you require that either column may randomly blank, are using spaces and not "\t" tabs as the separator you have invalid and unparsable data. Unless you have either fixed column widths or some defined separator structure you are up the proverbial. Consider this:

A     B
      C
D
   E
[download]

You are chopping off leading spaces which will move both C and E into col 1 but there is no way to assign either to a column unless you have a fixed width or say a tab separator. If the data is really this:

A\tB
\tC
D\t
   E\t
[download]

which is what it should be you are fine. Just split on the "\t".

Did you generate the data yourself? If not virtually any programmer with half a brain would do column data like:

# first remove tabs from data and sub in 4 spaces
$_ = s/\t/    /g for @cols
my $row = join "\t", @cols;
print SOMEFILE $row, "\n";
[download]

This gives you a file you can parse unambiguosly as each and every tab represents a column break. Thus if @cols = ( '', '', 'foo', 'bar', '' ) the resulting record will be "\t\tfoo\tbar\t" A split "\t" on this record will give back the original col fields unambiguously regardless of the contents of @cols - the price you pay is that you can't allow tabs in your data. If you have to have tabs you would generally substitute in some token (must be very improbable in data) on the way in and remove it on the way out.

@cols = ( "foo", "\t", "bar" );
print "original '@cols' ", scalar @cols, "\n";

s/\t/<%tab%>/g for @cols;
$row = join "\t", @cols;
print "row      '$row'\n";

@ret_cols = split "\t", $row;
s/<%tab%>/\t/g for @ret_cols;

print "retreive '@ret_cols' ", scalar @ret_cols, "\n";

__DATA__
original 'foo      bar' 3
row      'foo    <%tab%>    bar'
retreive 'foo      bar' 3
[download]

I suspect that you do not realise that the original programmer used "\t" as the col separator. When you use "\s" in a split if will split on tabs, spaces and newlines. I would try a straight split "\t" and don't do s/^\s+// which may well produce the results you want.

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Comment on Re: Re: Re: making a single column out of a two-column text file Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: Re: making a single column out of a two-column text file by allolex (Curate) on Feb 27, 2003 at 01:40 UTC
Thanks again for your help. I went and took a closer look at the input data, which is ghostscript output from `pdftotext`. `od` showed me that the program really is producing spaces in all the spots I thought it was. (I think I originally figured this out when I loaded the file into a text editor and found myself using 'delete' a lot.) Anyway, so what you said doesn't apply 100 percent to this problem, but it will help me a lot in the future, I'm sure. I really have to agree with you about the intelligence matter, but I rely on ghostscript a lot, so I'm stuck. Plus all of this is going to come in handy when I create my research database. `od -c` output: `--- START UGLY BIT --- 0013040 h i c h w e c h o s e b +e 0013060 c a u s e o f i t s r e +l 0013100 a t i v e l y s i m - 0013120 t i f i c a t i o n +w 0013140 a s e a s i e r w i t h +s 0013160 p o n t a n e o u s s p e e +c --- END UGLY BIT ---` [download] -- Allolex	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Re: Re: Re: making a single column out of a two-column text file
by allolex (Curate) on Feb 27, 2003 at 01:40 UTC

Thanks again for your help. I went and took a closer look at the input data, which is ghostscript output from pdftotext. od showed me that the program really is producing spaces in all the spots I thought it was. (I think I originally figured this out when I loaded the file into a text editor and found myself using 'delete' a lot.) Anyway, so what you said doesn't apply 100 percent to this problem, but it will help me a lot in the future, I'm sure.

I really have to agree with you about the intelligence matter, but I rely on ghostscript a lot, so I'm stuck. Plus all of this is going to come in handy when I create my research database.

od -c output:

--- START UGLY BIT ---

0013040   h   i   c   h       w   e       c   h   o   s   e       b   
+e
0013060   c   a   u   s   e       o   f       i   t   s       r   e   
+l
0013100   a   t   i   v   e   l   y       s   i   m   -
0013120                   t   i   f   i   c   a   t   i   o   n       
+w
0013140   a   s       e   a   s   i   e   r       w   i   t   h       
+s
0013160   p   o   n   t   a   n   e   o   u   s       s   p   e   e   
+c

--- END UGLY BIT ---
[download]

--
Allolex

[reply]
[d/l]
[select]