in reply to making a single column out of a two-column text file

If your trying to come up with a generic algorthm for this, the biggest problem is deciding where the column breaks are and how many there are. This is especially problematic when the possibility for more than two columns or unevenly spaced columns exists.

A possible approach to solving this would be to make a first pass over the data using perls bitwise string manipulation, | of each line against a mask of spaces. Once a pass is complete, any chars in your mask that remain as spaces, are good candidates for column breaks. The longer the sample of data being processed, the more accurate the mask will be, with the non column-break chars tending towards values of chr(255).

This assumes that you have already ensured that any tabs in the input data have been expanded to spaces appropriately.

In the code below I've shown the output as the first line of the input for comparison. Whether this is useful to you will depend upon your requirements and chosen algorithm.

#! perl -slw use strict; my $mask = ' ' x 100; while (<DATA>) { $mask |= $_; } print $mask; __DATA__ o   +* Environmental Russia. In addition, accordance with considerations were they claim to have integral development largely ignored in Cuba examined laws and policies" of the Cuban for almost 200 years. regulations from government, and Only in the last Colombia, Mexico, "with the objective of decade, with the Sweden and Venezuela the best utilization o +f enactment of Law 33 on together with materials the national productiv +e January 10, 1981, have from the United Nations potential". environmental laws and Program for the regulations begun to Environment, even Law 33 requires the play a very small role though it is known that application of these in guiding the develop- some countries like objectives to all ment of natural Mexico had practically investment projects an +d resources exploitation no environmental laws to regional planning. and the ecolo- gy of at that time. It is Environmental the island. also common knowledge assessment measures that the ecological carried out and Law 33 is a very short situation of Russia is approved by document of only 25 a complete disaster; governmental pages. It supposedly therefore, that coun- institutions must be covers all the try's environmental included in all regulations from the laws were very lax or projects. "principles of the were never applied. Cuban Communist Party Law 33 is divided into concerning the The four main chapters. environment," to the "Comisi?n Nacional de Chapter one covers the protection and use of Protecci?n del Medio main concepts of the Cuban national Ambiente y Conservaci?n Law. Chapter two cover +s resources. Law 33 has a de los Recursos specific areas of the good dosage of Naturales (COMARNA)" Law and the political "garbage," was responsible for fundamentals for the including a section developing Law 33 and use, protection and that compares the its regulations, but rehabilitation of "wise use of natural the Academy of Sciences water, soil, mineral resources by communist of Cuba was in charge resources, etc. Chapte +r countries versus the of defining the three covers the indiscriminate use of technical terminology organization of the natural resources by included in the Law. government entity the capitalistic responsible for the world." Thus, the Law on Law: the Comisi?n Environmental Nacional de Protecci?n BACKGROUND Protection and the del Medio Ambiente y Rational Use of Natural Conservaci?n de los As a guide for drafting Resources (Law 33) was Recursos Naturales. Th +e Law 33, the Cuban passed in order to last chapter, chapter Government claims they "establish the basic four, is an attempt to relied on legislation principles to conserve legislate a system of enacted by some former protect, improve and fines for violating th +e socialist countries transform the Law including a such as the German environment and the mechanism to insure Democratic Republic, rational use of natural that they are obeyed. Bulgaria, Hungary and resources, in

..and remember there are a lot of things monks are supposed to be but lazy is not one of them

Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.

Replies are listed 'Best First'.
Re: Re: making a single column out of a two-column text file
by allolex (Curate) on Feb 27, 2003 at 01:50 UTC

    Thanks for the advice and for the code. I was also thinking if I could find some way for the program to decide where the column breaks are, whether spaces or tabs like tachyon suggested (or whatever other possibilities exist), from there define that whatever as a "column separator" and go from there. I'll now try to figure out a way to make Perl decide if the mask has really found the column separation or not.

    Once again in your debt...

    --
    Allolex

      I'll now try to figure out a way to make Perl decide if the mask has really found the column separation or not.

      This task is going to be really input specific. BrowserUK and I have both shown you ways to calculate the probability that the column break falls at a certain column (although BrowserUK's method is cleaner, more robust, and more fluent perl than my own). I don't really see how you can "check" this result in a general fashion short of applying some machine learning technique that is likely to be less reliable than the probabilistic approach. That said, knowing something about your input, such as the size of the column break, and how may breaks of that size will be found in a line (I'm thinking of the numbers that fall to the right of the rhc here) will let you apply the mask to various inputs with a high likelihood of success.