in reply to Re: making a single column out of a two-column text file
in thread making a single column out of a two-column text file

Thanks for the help. I really like the idea, but this one looks like it will swap legitimate instances of text in the first column and not in the second. Either column has to be able to be 'blank' without affecting the other one. Or did I miss something important? (a likelihood)

Hooroo

--
Allolex

  • Comment on Re: Re: making a single column out of a two-column text file

Replies are listed 'Best First'.
Re: Re: Re: making a single column out of a two-column text file
by tachyon (Chancellor) on Feb 26, 2003 at 15:19 UTC

    Yes, you have missed somethin vital. If you require that either column may randomly blank, are using spaces and not "\t" tabs as the separator you have invalid and unparsable data. Unless you have either fixed column widths or some defined separator structure you are up the proverbial. Consider this:

    A B C D E

    You are chopping off leading spaces which will move both C and E into col 1 but there is no way to assign either to a column unless you have a fixed width or say a tab separator. If the data is really this:

    A\tB \tC D\t E\t

    which is what it should be you are fine. Just split on the "\t".

    Did you generate the data yourself? If not virtually any programmer with half a brain would do column data like:

    # first remove tabs from data and sub in 4 spaces $_ = s/\t/ /g for @cols my $row = join "\t", @cols; print SOMEFILE $row, "\n";

    This gives you a file you can parse unambiguosly as each and every tab represents a column break. Thus if @cols = ( '', '', 'foo', 'bar', '' ) the resulting record will be "\t\tfoo\tbar\t" A split "\t" on this record will give back the original col fields unambiguously regardless of the contents of @cols - the price you pay is that you can't allow tabs in your data. If you have to have tabs you would generally substitute in some token (must be very improbable in data) on the way in and remove it on the way out.

    @cols = ( "foo", "\t", "bar" ); print "original '@cols' ", scalar @cols, "\n"; s/\t/<%tab%>/g for @cols; $row = join "\t", @cols; print "row '$row'\n"; @ret_cols = split "\t", $row; s/<%tab%>/\t/g for @ret_cols; print "retreive '@ret_cols' ", scalar @ret_cols, "\n"; __DATA__ original 'foo bar' 3 row 'foo <%tab%> bar' retreive 'foo bar' 3

    I suspect that you do not realise that the original programmer used "\t" as the col separator. When you use "\s" in a split if will split on tabs, spaces and newlines. I would try a straight split "\t" and don't do s/^\s+// which may well produce the results you want.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Thanks again for your help. I went and took a closer look at the input data, which is ghostscript output from pdftotext. od showed me that the program really is producing spaces in all the spots I thought it was. (I think I originally figured this out when I loaded the file into a text editor and found myself using 'delete' a lot.) Anyway, so what you said doesn't apply 100 percent to this problem, but it will help me a lot in the future, I'm sure.

      I really have to agree with you about the intelligence matter, but I rely on ghostscript a lot, so I'm stuck. Plus all of this is going to come in handy when I create my research database.

      od -c output:

      --- START UGLY BIT --- 0013040 h i c h w e c h o s e b +e 0013060 c a u s e o f i t s r e +l 0013100 a t i v e l y s i m - 0013120 t i f i c a t i o n +w 0013140 a s e a s i e r w i t h +s 0013160 p o n t a n e o u s s p e e +c --- END UGLY BIT ---

      --
      Allolex