allolex has asked for the wisdom of the Perl Monks concerning the following question:

Dear Brethren,

I have been working on a small script that is supposed to take a text that has been formatted into two columns (with spaces in between the columns) and convert this text into single-column format.

So far, the script does what it is supposed to do... mostly...(ahem) What it does not do properly is handle non-existent left-hand columns--ones that are just spaces. When it processes those lines, the actual right-hand text lands in my @left array.

So, can someone point me in the right direction?

Oh, yes. And a bit of conceptual help...no, no smoove music.

LHC RHC 111111111111 222222222222 444444444444 555555555555 666666666666 777777777777 888888888888

In other words, it puts "4" into the slot for "3" (which is a bunch of spaces). I tried prefiltering with a substitution regexp, something like s/(\s{31,})/x$1x/;, but it didn't work.

'Nuff said

#!/usr/bin/perl use strict; use warnings; my $input = shift; my @first; my @second; my $counter = 0; if (!$input) {die "\nUsage: $0 inputfile\n"}; print STDERR "\nConverting file $input.\n"; open (INFILE, "< $input") or die "\nThe input file cannot be opened: \ +!\n"; while ( <INFILE> ) { chomp; s/^\s+//g; # get rid of leading spaces s/\n//g; ($first[$counter],$second[$counter]) = split(/[ ]{3,30}/,$_); # pr +etty much arbitrarily defined min and max $counter++; } close(INFILE); print "First column output:\n\n@first\n\n"; print "Second column output:\n\n@second\n\n";

Any other suggestions for the improvment of my code are welcome, too.

--
Allolex

Replies are listed 'Best First'.
Re: making a single column out of a two-column text file
by Enlil (Parson) on Feb 26, 2003 at 01:33 UTC
    use strict; my $pattern = "A29 A*"; my (@lefthand,@righthand); while ( <DATA> ) { chomp; my ($lhs,$rhs) = unpack ($pattern,$_); push (@lefthand,$lhs); push (@righthand,$rhs); } print "\nLEFTHAND STUFF:\n"; print join "\n", @lefthand; print "\nRIGHTHAND STUFF:\n"; print join "\n", @righthand; __DATA__ LHC RHC By deleting the initial You have removed the whitespace you are going whitespace that was to move what was in the in front of the left right hand column all hand column. This the way to the left. approach might be better than what you were trying. This is because If nothing else it is a new direction.

    -enlil

      Well, thanks! This one works well, but the disadvantage is you have to provide a column offset as a constant. I was hoping to find something that found the gap by itself, but you were definitely right about me needing to think in different ways and try a different approach. And this is a very practical approach, quite in the spirit of what I am learning Perl for.

      --
      Allolex

        This solution fixes the above problem. It actually should work for pretty much any input so long as it is close to natural English. To be more specific, it will work where the start of the right-hand column is the offset into each line of text that is most likely to have a space in front of it. I tried it out on some arbitrary input. I can't guarantee anything, but it seems likely to work in most cases. If there were more than two columns, it would mess things up. It requires two passes through the text, one to figure out the column index of the right-hand column, the second to split the text into two arrays.
        #!/usr/bin/perl -w use strict; my %colcnt; my (@text, @lhs, @rhs); my $max; my $maxcnt = 0; while (<DATA>) { chomp; push @text, $_; # save text for later my @chars = split //, $_; for (my $i = 1; $i < @chars; ++$i) { #skip first char (no pred) $colcnt{$i}++ if $chars[$i] ne ' ' && $chars[$i - 1] eq ' '; $colcnt{$i} |= 0; # make sure it is init for warnings $max = $i and $maxcnt = $colcnt{$i} if $colcnt{$i} > $maxcnt; } } foreach (@text) { my ($lhs, $rhs) = unpack("A$max A*", $_); push @lhs, $lhs; push @rhs, $rhs; } print "LHS:\n"; print join "\n", @lhs; print "\n\nRHS:\n"; print join "\n", @rhs; __DATA__ This script handles about how arbitrary spacing much for column skip as white space is long as there are in each column. I enough lines of text can't figure out and a "normal" dist of spaces. how to It also doesn't make do it in one pass though. any assumpions
        Update: As a result of jdporter's approach below, I was thinking that you could make this better by looking for more than one space before the column start (not just $i - 1). If you can set a definite minimum width on the size of the column gap you will improve the probability of this script working
Re: making a single column out of a two-column text file
by diotalevi (Canon) on Feb 26, 2003 at 01:41 UTC

    As a one-liner:

    perl -ne '/(\S*)\s+(\S*)/; push @a,$1 if length $1; push @b,$2 if length $2; END{ print "First column output:\n\n@a\n\nSecond column output:\n\n@b\n\n" }'

    And more readably:

    while (<>) { # Added a conditional per merlyn's advice /(\S*)\s+(\S*)/ or next; # /(\S*)\s+(\S*)/; push @a, $1 if length $1; push @b, $2 if length $2; } print "First column output:\n\n@a\n\n", "Second column output:\n\n@b\n\n"

    Seeking Green geeks in Minnesota

      while (<>) { /(\S*)\s+(\S*)/; push @a, $1 if length $1; push @b, $2 if length $2; }
      Broken, if the regex ever not matches. Please don't use $1 except in the conditional testing the regex that you think might match.

      -- Randal L. Schwartz, Perl hacker
      Be sure to read my standard disclaimer if this is a reply.

        Beter as next unless /(\S*)\s+(\S*)/?

              /(\S*)\s+(\S*)/ or next;

      The OP was specifically asking about the problem when one of the columns is empty. It seems this solution will just skip any such line, which I don't think was the OP's intention.

      Update as jasonk points out, this is rubbish. I'd misread the * as +.

      Hugo

        No it won't, it matches 0 or more non-white-spaces, followed by one or more white-spaces, followed by 0 or more non-white-spaces. If the first column is empty, that counts as 0 non-white-spaces and $1 will contain the empty string. If the second column is empty that counts as 0 non-white-spaces and $2 will contain the empty string. The only lines that will get skipped are lines that don't contain at least one white-space character.

      Hi Diotalevi,

      Unfortunately, this one didn't quite do the trick. It matches the first word of each column on the same line and prints out the first word/unit in each line, like this:

      Indice 1. KETER 4. HESED + 131 1. Quando la luce dell'infinito 2. Abbiamo diversi e curiosi orologi 23. L'analogia dei contrar +i 133 24. Sauvez la faible Aisch +a 136 2. HOKMAH 25. Questi misteriosi iniz +iati 139 26. Tutte le tradizioni de +lla terra 141 3. In hanc utilitatem clementes angeli

      Output

      First column output: Indice 1. 2. 3. Second column output: 1. 4. Quando Abbiamo 24. 2. 26. In

      The output reminds me of one of William Burroughs' ideas. Really cool, actually, but not what I had in mind. Obviously better than I could come up with, though. Plus you didn't have the input file to test it. And of course, the big AND... you came up with your code in about five minutes, well 16 minutes, but you were answering other questions, too.

      --
      Allolex

        Oh... yeah, that data is completely different than I expected. Ah well. That unpack solution someone else posted was nice for fixed length fields.

Re: making a single column out of a two-column text file
by jdporter (Paladin) on Feb 26, 2003 at 05:01 UTC
    Well, the sample data you showed presents a bit of a challenge, if we're to make a general solution. It would be easy, for example, to mistakenly interpret the column of page numbers on the right as a separate column of text. My solution below requires one parameter - the minimum width (in spaces) between legitimate text columns.
    sub multicolumn_to_single_column { my $min_gutter_width = shift; my @lines = @_; # the first task is to make a regex pattern from the input data, # so that it knows where all the space and non-space columns are. my $mask = ''; $mask |= $_ for @lines; # this is a hack: it only works because the ascii space character # has only one bit set. Note that this solution probably won't # handle input well that uses tabs for spacing. my @pattern = do { my $p; map { $p++ % 2 ? ".{$_}" : "(.{$_})" } map { length } split /( {$min_gutter_width,})/, $mask }; # you could dump @pattern here to see what's really going on. my $ncols = 1+@pattern >> 1; my $pattern = join '', @pattern; # Now that we have the pattern, use it to parse all the input line +s # into rows of columns of text. The separating spaces are ignored +. my @rows = map { $_ .= ' ' x (length($mask) - length($_)); [ /$pattern/ ] # oooooo! } @lines; # Finally, we're left with the simple matter of inverting the matr +ix # for output. map { my $c = $_; map { $rows[$_][$c] } 0 .. $#rows } 0 .. $#{$rows[0]} } # example. # Note that given sample data requires a gutter width of 5. # Any less, and the page number column on the right is seen # as a separate column; any more, and the two columns won't # be seen as distinct. my @lines = <DATA>; chomp @lines; for ( multicolumn_to_single_column( 5, @lines ) ) { print "$_\n"; } __DATA__ Indice 1. KETER 4. HESED + 131 1. Quando la luce dell'infinito 2. Abbiamo diversi e curiosi orologi 23. L'analogia dei contrar +i 133 24. Sauvez la faible Aisch +a 136 2. HOKMAH 25. Questi misteriosi iniz +iati 139 26. Tutte le tradizioni de +lla terra 141 3. In hanc utilitatem clementes angeli
    Output:
    Indice 1. KETER 1. Quando la luce dell'infinito 2. Abbiamo diversi e curiosi orologi 2. HOKMAH 3. In hanc utilitatem clementes angeli 4. HESED 131 23. L'analogia dei contrari 133 24. Sauvez la faible Aischa 136 25. Questi misteriosi iniziati 139 26. Tutte le tradizioni della terra 141

    jdporter
    The 6th Rule of Perl Club is -- There is no Rule #6.

      Lovely! And now all I need to do is figure out how to differentiate between a single-column formatted text (such as the title/author section and the text itself (in two columns). I've learned so much here.

      I particularly like the modularity (clarity and reusability) of the code you wrote.

      --
      Allolex

Re: making a single column out of a two-column text file
by BrowserUk (Patriarch) on Feb 26, 2003 at 15:56 UTC

    If your trying to come up with a generic algorthm for this, the biggest problem is deciding where the column breaks are and how many there are. This is especially problematic when the possibility for more than two columns or unevenly spaced columns exists.

    A possible approach to solving this would be to make a first pass over the data using perls bitwise string manipulation, | of each line against a mask of spaces. Once a pass is complete, any chars in your mask that remain as spaces, are good candidates for column breaks. The longer the sample of data being processed, the more accurate the mask will be, with the non column-break chars tending towards values of chr(255).

    This assumes that you have already ensured that any tabs in the input data have been expanded to spaces appropriately.

    In the code below I've shown the output as the first line of the input for comparison. Whether this is useful to you will depend upon your requirements and chosen algorithm.


    ..and remember there are a lot of things monks are supposed to be but lazy is not one of them

    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.

      Thanks for the advice and for the code. I was also thinking if I could find some way for the program to decide where the column breaks are, whether spaces or tabs like tachyon suggested (or whatever other possibilities exist), from there define that whatever as a "column separator" and go from there. I'll now try to figure out a way to make Perl decide if the mask has really found the column separation or not.

      Once again in your debt...

      --
      Allolex

        I'll now try to figure out a way to make Perl decide if the mask has really found the column separation or not.

        This task is going to be really input specific. BrowserUK and I have both shown you ways to calculate the probability that the column break falls at a certain column (although BrowserUK's method is cleaner, more robust, and more fluent perl than my own). I don't really see how you can "check" this result in a general fashion short of applying some machine learning technique that is likely to be less reliable than the probabilistic approach. That said, knowing something about your input, such as the size of the column break, and how may breaks of that size will be found in a line (I'm thinking of the numbers that fall to the right of the rhc here) will let you apply the mask to various inputs with a high likelihood of success.

Re: making a single column out of a two-column text file
by tachyon (Chancellor) on Feb 26, 2003 at 13:34 UTC
    while ( <INFILE> ) { chomp; s/^\s+//g; my ($first, $second) = split; # now if first 'column' was blank then the second will be # in $first and $second will be undef so just swap them ($first, $second) = ($second, $first) unless $second; push @first, $first; push @second, $second; }

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      Thanks for the help. I really like the idea, but this one looks like it will swap legitimate instances of text in the first column and not in the second. Either column has to be able to be 'blank' without affecting the other one. Or did I miss something important? (a likelihood)

      Hooroo

      --
      Allolex

        Yes, you have missed somethin vital. If you require that either column may randomly blank, are using spaces and not "\t" tabs as the separator you have invalid and unparsable data. Unless you have either fixed column widths or some defined separator structure you are up the proverbial. Consider this:

        A B C D E

        You are chopping off leading spaces which will move both C and E into col 1 but there is no way to assign either to a column unless you have a fixed width or say a tab separator. If the data is really this:

        A\tB \tC D\t E\t

        which is what it should be you are fine. Just split on the "\t".

        Did you generate the data yourself? If not virtually any programmer with half a brain would do column data like:

        # first remove tabs from data and sub in 4 spaces $_ = s/\t/ /g for @cols my $row = join "\t", @cols; print SOMEFILE $row, "\n";

        This gives you a file you can parse unambiguosly as each and every tab represents a column break. Thus if @cols = ( '', '', 'foo', 'bar', '' ) the resulting record will be "\t\tfoo\tbar\t" A split "\t" on this record will give back the original col fields unambiguously regardless of the contents of @cols - the price you pay is that you can't allow tabs in your data. If you have to have tabs you would generally substitute in some token (must be very improbable in data) on the way in and remove it on the way out.

        @cols = ( "foo", "\t", "bar" ); print "original '@cols' ", scalar @cols, "\n"; s/\t/<%tab%>/g for @cols; $row = join "\t", @cols; print "row '$row'\n"; @ret_cols = split "\t", $row; s/<%tab%>/\t/g for @ret_cols; print "retreive '@ret_cols' ", scalar @ret_cols, "\n"; __DATA__ original 'foo bar' 3 row 'foo <%tab%> bar' retreive 'foo bar' 3

        I suspect that you do not realise that the original programmer used "\t" as the col separator. When you use "\s" in a split if will split on tabs, spaces and newlines. I would try a straight split "\t" and don't do s/^\s+// which may well produce the results you want.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print