Re: making a single column out of a two-column text file

If your trying to come up with a generic algorthm for this, the biggest problem is deciding where the column breaks are and how many there are. This is especially problematic when the possibility for more than two columns or unevenly spaced columns exists.

A possible approach to solving this would be to make a first pass over the data using perls bitwise string manipulation, | of each line against a mask of spaces. Once a pass is complete, any chars in your mask that remain as spaces, are good candidates for column breaks. The longer the sample of data being processed, the more accurate the mask will be, with the non column-break chars tending towards values of chr(255).

This assumes that you have already ensured that any tabs in the input data have been expanded to spaces appropriately.

In the code below I've shown the output as the first line of the input for comparison. Whether this is useful to you will depend upon your requirements and chosen algorithm.

#! perl -slw
use strict;

my $mask = ' ' x 100;
while (<DATA>) {
    $mask |= $_;
}
print $mask;

__DATA__
o  
+*
Environmental           Russia. In addition,    accordance with
considerations were     they claim to have      integral development
largely ignored in Cuba examined laws and       policies" of the Cuban
for almost 200 years.   regulations from        government, and
Only in the last        Colombia, Mexico,       "with the objective of
decade, with the        Sweden and Venezuela    the best utilization o
+f
enactment of Law 33 on  together with materials the national productiv
+e
January 10, 1981, have  from the United Nations potential".
environmental laws and  Program for the
regulations begun to    Environment, even       Law 33 requires the
play a very small role  though it is known that application of these
in guiding the develop- some countries like     objectives to all
ment of natural         Mexico had practically  investment projects an
+d
resources exploitation  no environmental laws   to regional planning.
and the ecolo- gy of    at that time. It is     Environmental
the island.             also common knowledge   assessment measures
                        that the ecological     carried out and
Law 33 is a very short  situation of Russia is  approved by
document of only 25     a complete disaster;    governmental
pages. It supposedly    therefore, that coun-   institutions must be
covers all the          try's environmental     included in all
regulations from the    laws were very lax or   projects.
"principles of the      were never applied.
Cuban Communist Party                           Law 33 is divided into
concerning the          The                     four main chapters.
environment," to the    "Comisi?n Nacional de   Chapter one covers the
protection and use of   Protecci?n del Medio    main concepts of the
Cuban national          Ambiente y Conservaci?n Law. Chapter two cover
+s
resources. Law 33 has a de los Recursos         specific areas of the
good dosage of          Naturales (COMARNA)"    Law and the
political "garbage,"    was responsible for     fundamentals for the
including a section     developing Law 33 and   use, protection and
that compares the       its regulations, but    rehabilitation of
"wise use of natural    the Academy of Sciences water, soil, mineral
resources by communist  of Cuba was in charge   resources, etc. Chapte
+r
countries versus the    of defining the         three covers the
indiscriminate use of   technical terminology   organization of the
natural resources by    included in the Law.    government entity
the capitalistic                                responsible for the
world."                 Thus, the Law on        Law: the Comisi?n
                        Environmental           Nacional de Protecci?n
BACKGROUND              Protection and the      del Medio Ambiente y
                        Rational Use of Natural Conservaci?n de los
As a guide for drafting Resources (Law 33) was  Recursos Naturales. Th
+e
Law 33, the Cuban       passed in order to      last chapter, chapter
Government claims they  "establish the basic    four, is an attempt to
relied on legislation   principles to conserve  legislate a system of
enacted by some former  protect, improve and    fines for violating th
+e
socialist countries     transform the           Law including a
such as the German      environment and the     mechanism to insure
Democratic Republic,    rational use of natural that they are obeyed.
Bulgaria, Hungary and   resources, in
[download]

..and remember there are a lot of things monks are supposed to be but lazy is not one of them

Examine what is said, not who speaks.

1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.

Comment on Re: making a single column out of a two-column text file Download Code

Replies are listed 'Best First'.
Re: Re: making a single column out of a two-column text file by allolex (Curate) on Feb 27, 2003 at 01:50 UTC
Thanks for the advice and for the code. I was also thinking if I could find some way for the program to decide where the column breaks are, whether spaces or tabs like tachyon suggested (or whatever other possibilities exist), from there define that whatever as a "column separator" and go from there. I'll now try to figure out a way to make Perl decide if the mask has really found the column separation or not. Once again in your debt... -- Allolex	[reply]
Re: Re: Re: making a single column out of a two-column text file by dbp (Pilgrim) on Feb 27, 2003 at 05:47 UTC
I'll now try to figure out a way to make Perl decide if the mask has really found the column separation or not. This task is going to be really input specific. BrowserUK and I have both shown you ways to calculate the probability that the column break falls at a certain column (although BrowserUK's method is cleaner, more robust, and more fluent perl than my own). I don't really see how you can "check" this result in a general fashion short of applying some machine learning technique that is likely to be less reliable than the probabilistic approach. That said, knowing something about your input, such as the size of the column break, and how may breaks of that size will be found in a line (I'm thinking of the numbers that fall to the right of the rhc here) will let you apply the mask to various inputs with a high likelihood of success.	[reply]