making a single column out of a two-column text file

allolex has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: making a single column out of a two-column text file by Enlil (Parson) on Feb 26, 2003 at 01:33 UTC
use strict; my $pattern = "A29 A*"; my (@lefthand,@righthand); while ( <DATA> ) { chomp; my ($lhs,$rhs) = unpack ($pattern,$_); push (@lefthand,$lhs); push (@righthand,$rhs); } print "\nLEFTHAND STUFF:\n"; print join "\n", @lefthand; print "\nRIGHTHAND STUFF:\n"; print join "\n", @righthand; __DATA__ LHC RHC By deleting the initial You have removed the whitespace you are going whitespace that was to move what was in the in front of the left right hand column all hand column. This the way to the left. approach might be better than what you were trying. This is because If nothing else it is a new direction. [download] -enlil	[reply] [d/l]
Re: Re: making a single column out of a two-column text file by allolex (Curate) on Feb 26, 2003 at 03:22 UTC
Well, thanks! This one works well, but the disadvantage is you have to provide a column offset as a constant. I was hoping to find something that found the gap by itself, but you were definitely right about me needing to think in different ways and try a different approach. And this is a very practical approach, quite in the spirit of what I am learning Perl for. -- Allolex	[reply]
Re: Re: Re: making a single column out of a two-column text file by dbp (Pilgrim) on Feb 26, 2003 at 05:49 UTC
This solution fixes the above problem. It actually should work for pretty much any input so long as it is close to natural English. To be more specific, it will work where the start of the right-hand column is the offset into each line of text that is most likely to have a space in front of it. I tried it out on some arbitrary input. I can't guarantee anything, but it seems likely to work in most cases. If there were more than two columns, it would mess things up. It requires two passes through the text, one to figure out the column index of the right-hand column, the second to split the text into two arrays. #!/usr/bin/perl -w use strict; my %colcnt; my (@text, @lhs, @rhs); my $max; my $maxcnt = 0; while (<DATA>) { chomp; push @text, $_; # save text for later my @chars = split //, $_; for (my $i = 1; $i < @chars; ++$i) { #skip first char (no pred) $colcnt{$i}++ if $chars[$i] ne ' ' && $chars[$i - 1] eq ' '; $colcnt{$i} \|= 0; # make sure it is init for warnings $max = $i and $maxcnt = $colcnt{$i} if $colcnt{$i} > $maxcnt; } } foreach (@text) { my ($lhs, $rhs) = unpack("A$max A", $_); push @lhs, $lhs; push @rhs, $rhs; } print "LHS:\n"; print join "\n", @lhs; print "\n\nRHS:\n"; print join "\n", @rhs; __DATA__ This script handles about how arbitrary spacing much for column skip as white space is long as there are in each column. I enough lines of text can't figure out and a "normal" dist of spaces. how to It also doesn't make do it in one pass though. any assumpions [download] Update:* As a result of jdporter's approach below, I was thinking that you could make this better by looking for more than one space before the column start (not just $i - 1). If you can set a definite minimum width on the size of the column gap you will improve the probability of this script working	[reply] [d/l]
Re: making a single column out of a two-column text file by diotalevi (Canon) on Feb 26, 2003 at 01:41 UTC
As a one-liner: `perl -ne '/(\S)\s+(\S)/; push @a,$1 if length $1; push @b,$2 if length $2; END{ print "First column output:\n\n@a\n\nSecond column output:\n\n@b\n\n" }'` And more readably: `while (<>) { # Added a conditional per merlyn's advice /(\S)\s+(\S)/ or next; # /(\S)\s+(\S)/; push @a, $1 if length $1; push @b, $2 if length $2; } print "First column output:\n\n@a\n\n", "Second column output:\n\n@b\n\n"` [download] Seeking Green geeks in Minnesota	[reply] [d/l] [select]
•Re: Re: making a single column out of a two-column text file by merlyn (Sage) on Feb 26, 2003 at 01:54 UTC
`while (<>) { /(\S)\s+(\S)/; push @a, $1 if length $1; push @b, $2 if length $2; }` [download] Broken, if the regex ever not matches. Please don't use $1 except in the conditional testing the regex that you think might match. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]
Re^3: making a single column out of a two-column text file by diotalevi (Canon) on Feb 26, 2003 at 01:56 UTC
Beter as `next unless /(\S)\s+(\S)/`?	[reply] [d/l]
•Re: Re^3: making a single column out of a two-column text file by merlyn (Sage) on Feb 26, 2003 at 02:20 UTC
Re: Re: making a single column out of a two-column text file by hv (Prior) on Feb 26, 2003 at 02:36 UTC
`/(\S)\s+(\S)/ or next;` The OP was specifically asking about the problem when one of the columns is empty. It seems this solution will just skip any such line, which I don't think was the OP's intention. Update as jasonk points out, this is rubbish. I'd misread the `*` as `+`. Hugo	[reply] [d/l] [select]
Re: Re: Re: making a single column out of a two-column text file by jasonk (Parson) on Feb 26, 2003 at 02:45 UTC
No it won't, it matches 0 or more non-white-spaces, followed by one or more white-spaces, followed by 0 or more non-white-spaces. If the first column is empty, that counts as 0 non-white-spaces and $1 will contain the empty string. If the second column is empty that counts as 0 non-white-spaces and $2 will contain the empty string. The only lines that will get skipped are lines that don't contain at least one white-space character.	[reply]
Re: Re: making a single column out of a two-column text file by allolex (Curate) on Feb 26, 2003 at 03:41 UTC
Hi Diotalevi, Unfortunately, this one didn't quite do the trick. It matches the first word of each column on the same line and prints out the first word/unit in each line, like this: `Indice 1. KETER 4. HESED + 131 1. Quando la luce dell'infinito 2. Abbiamo diversi e curiosi orologi 23. L'analogia dei contrar +i 133 24. Sauvez la faible Aisch +a 136 2. HOKMAH 25. Questi misteriosi iniz +iati 139 26. Tutte le tradizioni de +lla terra 141 3. In hanc utilitatem clementes angeli` [download] Output `First column output: Indice 1. 2. 3. Second column output: 1. 4. Quando Abbiamo 24. 2. 26. In` [download] The output reminds me of one of William Burroughs' ideas. Really cool, actually, but not what I had in mind. Obviously better than I could come up with, though. Plus you didn't have the input file to test it. And of course, the big AND... you came up with your code in about five minutes, well 16 minutes, but you were answering other questions, too. -- Allolex	[reply] [d/l] [select]
Re^3: making a single column out of a two-column text file by diotalevi (Canon) on Feb 26, 2003 at 05:18 UTC
Oh... yeah, that data is completely different than I expected. Ah well. That unpack solution someone else posted was nice for fixed length fields.	[reply]
Re: making a single column out of a two-column text file by jdporter (Paladin) on Feb 26, 2003 at 05:01 UTC
Well, the sample data you showed presents a bit of a challenge, if we're to make a general solution. It would be easy, for example, to mistakenly interpret the column of page numbers on the right as a separate column of text. My solution below requires one parameter - the minimum width (in spaces) between legitimate text columns. sub multicolumn_to_single_column { my $min_gutter_width = shift; my @lines = @_; # the first task is to make a regex pattern from the input data, # so that it knows where all the space and non-space columns are. my $mask = ''; $mask \|= $_ for @lines; # this is a hack: it only works because the ascii space character # has only one bit set. Note that this solution probably won't # handle input well that uses tabs for spacing. my @pattern = do { my $p; map { $p++ % 2 ? ".{$_}" : "(.{$_})" } map { length } split /( {$min_gutter_width,})/, $mask }; # you could dump @pattern here to see what's really going on. my $ncols = 1+@pattern >> 1; my $pattern = join '', @pattern; # Now that we have the pattern, use it to parse all the input line +s # into rows of columns of text. The separating spaces are ignored +. my @rows = map { $_ .= ' ' x (length($mask) - length($_)); [ /$pattern/ ] # oooooo! } @lines; # Finally, we're left with the simple matter of inverting the matr +ix # for output. map { my $c = $_; map { $rows[$_][$c] } 0 .. $#rows } 0 .. $#{$rows[0]} } # example. # Note that given sample data requires a gutter width of 5. # Any less, and the page number column on the right is seen # as a separate column; any more, and the two columns won't # be seen as distinct. my @lines = <DATA>; chomp @lines; for ( multicolumn_to_single_column( 5, @lines ) ) { print "$_\n"; } __DATA__ Indice 1. KETER 4. HESED + 131 1. Quando la luce dell'infinito 2. Abbiamo diversi e curiosi orologi 23. L'analogia dei contrar +i 133 24. Sauvez la faible Aisch +a 136 2. HOKMAH 25. Questi misteriosi iniz +iati 139 26. Tutte le tradizioni de +lla terra 141 3. In hanc utilitatem clementes angeli [download] Output: `Indice 1. KETER 1. Quando la luce dell'infinito 2. Abbiamo diversi e curiosi orologi 2. HOKMAH 3. In hanc utilitatem clementes angeli 4. HESED 131 23. L'analogia dei contrari 133 24. Sauvez la faible Aischa 136 25. Questi misteriosi iniziati 139 26. Tutte le tradizioni della terra 141` [download] jdporter The 6th Rule of Perl Club is -- There is no Rule #6.	[reply] [d/l] [select]
Re: Re: making a single column out of a two-column text file by allolex (Curate) on Feb 26, 2003 at 13:32 UTC
Lovely! And now all I need to do is figure out how to differentiate between a single-column formatted text (such as the title/author section and the text itself (in two columns). I've learned so much here. I particularly like the modularity (clarity and reusability) of the code you wrote. -- Allolex	[reply]
Re: making a single column out of a two-column text file by BrowserUk (Patriarch) on Feb 26, 2003 at 15:56 UTC
If your trying to come up with a generic algorthm for this, the biggest problem is deciding where the column breaks are and how many there are. This is especially problematic when the possibility for more than two columns or unevenly spaced columns exists. A possible approach to solving this would be to make a first pass over the data using perls bitwise string manipulation, \| of each line against a mask of spaces. Once a pass is complete, any chars in your mask that remain as spaces, are good candidates for column breaks. The longer the sample of data being processed, the more accurate the mask will be, with the non column-break chars tending towards values of chr(255). This assumes that you have already ensured that any tabs in the input data have been expanded to spaces appropriately. In the code below I've shown the output as the first line of the input for comparison. Whether this is useful to you will depend upon your requirements and chosen algorithm. Read more... (4 kB) ..and remember there are a lot of things monks are supposed to be but lazy is not one of them Examine what is said, not who speaks. 1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. 2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible 3) Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke.	[reply] [d/l]
Re: Re: making a single column out of a two-column text file by allolex (Curate) on Feb 27, 2003 at 01:50 UTC
Thanks for the advice and for the code. I was also thinking if I could find some way for the program to decide where the column breaks are, whether spaces or tabs like tachyon suggested (or whatever other possibilities exist), from there define that whatever as a "column separator" and go from there. I'll now try to figure out a way to make Perl decide if the mask has really found the column separation or not. Once again in your debt... -- Allolex	[reply]
Re: Re: Re: making a single column out of a two-column text file by dbp (Pilgrim) on Feb 27, 2003 at 05:47 UTC
I'll now try to figure out a way to make Perl decide if the mask has really found the column separation or not. This task is going to be really input specific. BrowserUK and I have both shown you ways to calculate the probability that the column break falls at a certain column (although BrowserUK's method is cleaner, more robust, and more fluent perl than my own). I don't really see how you can "check" this result in a general fashion short of applying some machine learning technique that is likely to be less reliable than the probabilistic approach. That said, knowing something about your input, such as the size of the column break, and how may breaks of that size will be found in a line (I'm thinking of the numbers that fall to the right of the rhc here) will let you apply the mask to various inputs with a high likelihood of success.	[reply]
Re: making a single column out of a two-column text file by tachyon (Chancellor) on Feb 26, 2003 at 13:34 UTC
`while ( <INFILE> ) { chomp; s/^\s+//g; my ($first, $second) = split; # now if first 'column' was blank then the second will be # in $first and $second will be undef so just swap them ($first, $second) = ($second, $first) unless $second; push @first, $first; push @second, $second; }` [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: making a single column out of a two-column text file by allolex (Curate) on Feb 26, 2003 at 14:20 UTC
Thanks for the help. I really like the idea, but this one looks like it will swap legitimate instances of text in the first column and not in the second. Either column has to be able to be 'blank' without affecting the other one. Or did I miss something important? (a likelihood) Hooroo -- Allolex	[reply]
Re: Re: Re: making a single column out of a two-column text file by tachyon (Chancellor) on Feb 26, 2003 at 15:19 UTC
Yes, you have missed somethin vital. If you require that either column may randomly blank, are using spaces and not "\t" tabs as the separator you have invalid and unparsable data. Unless you have either fixed column widths or some defined separator structure you are up the proverbial. Consider this: `A B C D E` [download] You are chopping off leading spaces which will move both C and E into col 1 but there is no way to assign either to a column unless you have a fixed width or say a tab separator. If the data is really this: `A\tB \tC D\t E\t` [download] which is what it should be you are fine. Just split on the "\t". Did you generate the data yourself? If not virtually any programmer with half a brain would do column data like: `# first remove tabs from data and sub in 4 spaces $_ = s/\t/ /g for @cols my $row = join "\t", @cols; print SOMEFILE $row, "\n";` [download] This gives you a file you can parse unambiguosly as each and every tab represents a column break. Thus if `@cols = ( '', '', 'foo', 'bar', '' )` the resulting record will be `"\t\tfoo\tbar\t"` A split "\t" on this record will give back the original col fields unambiguously regardless of the contents of @cols - the price you pay is that you can't allow tabs in your data. If you have to have tabs you would generally substitute in some token (must be very improbable in data) on the way in and remove it on the way out. `@cols = ( "foo", "\t", "bar" ); print "original '@cols' ", scalar @cols, "\n"; s/\t/<%tab%>/g for @cols; $row = join "\t", @cols; print "row '$row'\n"; @ret_cols = split "\t", $row; s/<%tab%>/\t/g for @ret_cols; print "retreive '@ret_cols' ", scalar @ret_cols, "\n"; __DATA__ original 'foo bar' 3 row 'foo <%tab%> bar' retreive 'foo bar' 3` [download] I suspect that you do not realise that the original programmer used "\t" as the col separator. When you use "\s" in a split if will split on tabs, spaces and newlines. I would try a straight split "\t" and don't do s/^\s+// which may well produce the results you want. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l] [select]
Re: Re: Re: Re: making a single column out of a two-column text file by allolex (Curate) on Feb 27, 2003 at 01:40 UTC