Re: making a single column out of a two-column text file
by Enlil (Parson) on Feb 26, 2003 at 01:33 UTC
|
use strict;
my $pattern = "A29 A*";
my (@lefthand,@righthand);
while ( <DATA> ) {
chomp;
my ($lhs,$rhs) = unpack ($pattern,$_);
push (@lefthand,$lhs);
push (@righthand,$rhs);
}
print "\nLEFTHAND STUFF:\n";
print join "\n", @lefthand;
print "\nRIGHTHAND STUFF:\n";
print join "\n", @righthand;
__DATA__
LHC RHC
By deleting the initial You have removed the
whitespace you are going whitespace that was
to move what was in the in front of the left
right hand column all hand column. This
the way to the left. approach might be better
than what you were trying.
This is because If nothing else it is a new
direction.
-enlil | [reply] [d/l] |
|
|
Well, thanks! This one works well, but the disadvantage is you have to provide a column offset as a constant. I was hoping to find something that found the gap by itself, but you were definitely right about me needing to think in different ways and try a different approach. And this is a
very practical approach, quite in the spirit of what I am learning Perl for.
--
Allolex
| [reply] |
|
|
This solution fixes the above problem. It actually should work for pretty much any input so long as it is close to natural English. To be more specific, it will work where the start of the right-hand column is the offset into each line of text that is most likely to have a space in front of it. I tried it out on some arbitrary input. I can't guarantee anything, but it seems likely to work in most cases. If there were more than two columns, it would mess things up. It requires two passes through the text, one to figure out the column index of the right-hand column, the second to split the text into two arrays.
#!/usr/bin/perl -w
use strict;
my %colcnt;
my (@text, @lhs, @rhs);
my $max;
my $maxcnt = 0;
while (<DATA>) {
chomp;
push @text, $_; # save text for later
my @chars = split //, $_;
for (my $i = 1; $i < @chars; ++$i) { #skip first char (no pred)
$colcnt{$i}++ if $chars[$i] ne ' ' && $chars[$i - 1] eq ' ';
$colcnt{$i} |= 0; # make sure it is init for warnings
$max = $i and $maxcnt = $colcnt{$i} if $colcnt{$i} > $maxcnt;
}
}
foreach (@text) {
my ($lhs, $rhs) = unpack("A$max A*", $_);
push @lhs, $lhs;
push @rhs, $rhs;
}
print "LHS:\n";
print join "\n", @lhs;
print "\n\nRHS:\n";
print join "\n", @rhs;
__DATA__
This script handles about how
arbitrary spacing much
for column skip as white space is
long as there are in each column. I
enough lines of text can't figure out
and a "normal" dist
of spaces.
how to
It also doesn't make do it in one pass though.
any assumpions
Update: As a result of jdporter's approach below, I was thinking that you could make this better by looking for more than one space before the column start (not just $i - 1). If you can set a definite minimum width on the size of the column gap you will improve the probability of this script working | [reply] [d/l] |
Re: making a single column out of a two-column text file
by diotalevi (Canon) on Feb 26, 2003 at 01:41 UTC
|
As a one-liner: perl -ne '/(\S*)\s+(\S*)/; push @a,$1 if length $1; push @b,$2 if length $2; END{ print "First column output:\n\n@a\n\nSecond column output:\n\n@b\n\n" }'And more readably:
while (<>) {
# Added a conditional per merlyn's advice
/(\S*)\s+(\S*)/ or next;
# /(\S*)\s+(\S*)/;
push @a, $1 if length $1;
push @b, $2 if length $2;
}
print "First column output:\n\n@a\n\n",
"Second column output:\n\n@b\n\n"
Seeking Green geeks in Minnesota | [reply] [d/l] [select] |
|
|
| [reply] [d/l] |
|
|
| [reply] [d/l] |
|
|
|
|
/(\S*)\s+(\S*)/ or next;
The OP was specifically asking about the problem when one of the columns is empty. It seems this solution will just skip any such line, which I don't think was the OP's intention.
Update as jasonk points out, this is rubbish. I'd misread the * as +.
Hugo
| [reply] [d/l] [select] |
|
|
No it won't, it matches 0 or more non-white-spaces, followed by one or more white-spaces, followed by 0 or more non-white-spaces. If the first column is empty, that counts as 0 non-white-spaces and $1 will contain the empty string. If the second column is empty that counts as 0 non-white-spaces and $2 will contain the empty string. The only lines that will get skipped are lines that don't contain at least one white-space character.
| [reply] |
|
|
Indice
1. KETER
4. HESED
+ 131
1. Quando la luce dell'infinito
2. Abbiamo diversi e curiosi orologi 23. L'analogia dei contrar
+i 133
24. Sauvez la faible Aisch
+a 136
2. HOKMAH 25. Questi misteriosi iniz
+iati 139
26. Tutte le tradizioni de
+lla terra 141
3. In hanc utilitatem clementes angeli
Output
First column output:
Indice 1. 2. 3.
Second column output:
1. 4. Quando Abbiamo 24. 2. 26. In
The output reminds me of one of William Burroughs'
ideas. Really cool, actually, but not what I had in mind. Obviously better than I could come up with, though. Plus you didn't have the input file to test it. And of course, the big AND... you came up with your code in about five minutes, well 16 minutes, but you were answering other questions, too.
--
Allolex
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: making a single column out of a two-column text file
by jdporter (Paladin) on Feb 26, 2003 at 05:01 UTC
|
Well, the sample data you showed presents a bit of a challenge, if we're to make a general
solution. It would be easy, for example, to mistakenly interpret the column of page numbers
on the right as a separate column of text. My solution below requires one parameter - the
minimum width (in spaces) between legitimate text columns.
sub multicolumn_to_single_column
{
my $min_gutter_width = shift;
my @lines = @_;
# the first task is to make a regex pattern from the input data,
# so that it knows where all the space and non-space columns are.
my $mask = '';
$mask |= $_ for @lines;
# this is a hack: it only works because the ascii space character
# has only one bit set. Note that this solution probably won't
# handle input well that uses tabs for spacing.
my @pattern = do
{
my $p;
map { $p++ % 2 ? ".{$_}" : "(.{$_})" }
map { length }
split /( {$min_gutter_width,})/, $mask
};
# you could dump @pattern here to see what's really going on.
my $ncols = 1+@pattern >> 1;
my $pattern = join '', @pattern;
# Now that we have the pattern, use it to parse all the input line
+s
# into rows of columns of text. The separating spaces are ignored
+.
my @rows = map
{
$_ .= ' ' x (length($mask) - length($_));
[ /$pattern/ ] # oooooo!
}
@lines;
# Finally, we're left with the simple matter of inverting the matr
+ix
# for output.
map
{
my $c = $_;
map { $rows[$_][$c] } 0 .. $#rows
}
0 .. $#{$rows[0]}
}
# example.
# Note that given sample data requires a gutter width of 5.
# Any less, and the page number column on the right is seen
# as a separate column; any more, and the two columns won't
# be seen as distinct.
my @lines = <DATA>; chomp @lines;
for ( multicolumn_to_single_column( 5, @lines ) )
{
print "$_\n";
}
__DATA__
Indice
1. KETER
4. HESED
+ 131
1. Quando la luce dell'infinito
2. Abbiamo diversi e curiosi orologi 23. L'analogia dei contrar
+i 133
24. Sauvez la faible Aisch
+a 136
2. HOKMAH 25. Questi misteriosi iniz
+iati 139
26. Tutte le tradizioni de
+lla terra 141
3. In hanc utilitatem clementes angeli
Output:
Indice
1. KETER
1. Quando la luce dell'infinito
2. Abbiamo diversi e curiosi orologi
2. HOKMAH
3. In hanc utilitatem clementes angeli
4. HESED 131
23. L'analogia dei contrari 133
24. Sauvez la faible Aischa 136
25. Questi misteriosi iniziati 139
26. Tutte le tradizioni della terra 141
jdporter The 6th Rule of Perl Club is -- There is no Rule #6. | [reply] [d/l] [select] |
|
|
Lovely! And now all I need to do is figure out how to differentiate between a single-column formatted text (such as the title/author section and the text itself (in two columns). I've learned so much here.
I particularly like the modularity (clarity and reusability) of the code you wrote.
--
Allolex
| [reply] |
Re: making a single column out of a two-column text file
by BrowserUk (Patriarch) on Feb 26, 2003 at 15:56 UTC
|
If your trying to come up with a generic algorthm for this, the biggest problem is deciding where the column breaks are and how many there are. This is especially problematic when the possibility for more than two columns or unevenly spaced columns exists.
A possible approach to solving this would be to make a first pass over the data using perls bitwise string manipulation, | of each line against a mask of spaces.
Once a pass is complete, any chars in your mask that remain as spaces, are good candidates for column breaks. The longer the sample of data being processed, the more accurate the mask will be, with the non column-break chars tending towards values of chr(255).
This assumes that you have already ensured that any tabs in the input data have been expanded to spaces appropriately.
In the code below I've shown the output as the first line of the input for comparison. Whether this is useful to you will depend upon your requirements and chosen algorithm.
..and remember there are a lot of things monks are supposed to be but lazy is not one of them
Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.
| [reply] [d/l] |
|
|
Thanks for the advice and for the code. I was also thinking if I could find some way for the program to decide where the column breaks are, whether spaces or tabs like tachyon suggested (or whatever other possibilities exist), from there define that whatever as a "column separator" and go from there. I'll now try to figure out a way to make Perl decide if the mask has really found the column separation or not.
Once again in your debt...
--
Allolex
| [reply] |
|
|
I'll now try to figure out a way to make Perl decide if the mask has really found the column separation or not.
This task is going to be really input specific. BrowserUK and I have both shown you ways to calculate the probability that the column break falls at a certain column (although BrowserUK's method is cleaner, more robust, and more fluent perl than my own). I don't really see how you can "check" this result in a general fashion short of applying some machine learning technique that is likely to be less reliable than the probabilistic approach. That said, knowing something about your input, such as the size of the column break, and how may breaks of that size will be found in a line (I'm thinking of the numbers that fall to the right of the rhc here) will let you apply the mask to various inputs with a high likelihood of success.
| [reply] |
Re: making a single column out of a two-column text file
by tachyon (Chancellor) on Feb 26, 2003 at 13:34 UTC
|
while ( <INFILE> ) {
chomp;
s/^\s+//g;
my ($first, $second) = split;
# now if first 'column' was blank then the second will be
# in $first and $second will be undef so just swap them
($first, $second) = ($second, $first) unless $second;
push @first, $first;
push @second, $second;
}
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] [d/l] |
|
|
| [reply] |
|
|
Yes, you have missed somethin vital. If you require that either column may randomly blank, are using spaces and not "\t" tabs as the separator you have invalid and unparsable data. Unless you have either fixed column widths or some defined separator structure you are up the proverbial. Consider this:
A B
C
D
E
You are chopping off leading spaces which will move both C and E into col 1 but there is no way to assign either to a column unless you have a fixed width or say a tab separator. If the data is really this:
A\tB
\tC
D\t
E\t
which is what it should be you are fine. Just split on the "\t".
Did you generate the data yourself? If not virtually any programmer with half a brain would do column data like:
# first remove tabs from data and sub in 4 spaces
$_ = s/\t/ /g for @cols
my $row = join "\t", @cols;
print SOMEFILE $row, "\n";
This gives you a file you can parse unambiguosly as each and every tab represents a column break. Thus if @cols = ( '', '', 'foo', 'bar', '' ) the resulting record will be "\t\tfoo\tbar\t" A split "\t" on this record will give back the original col fields unambiguously regardless of the contents of @cols - the price you pay is that you can't allow tabs in your data. If you have to have tabs you would generally substitute in some token (must be very improbable in data) on the way in and remove it on the way out.
@cols = ( "foo", "\t", "bar" );
print "original '@cols' ", scalar @cols, "\n";
s/\t/<%tab%>/g for @cols;
$row = join "\t", @cols;
print "row '$row'\n";
@ret_cols = split "\t", $row;
s/<%tab%>/\t/g for @ret_cols;
print "retreive '@ret_cols' ", scalar @ret_cols, "\n";
__DATA__
original 'foo bar' 3
row 'foo <%tab%> bar'
retreive 'foo bar' 3
I suspect that you do not realise that the original programmer used "\t" as the col separator. When you use "\s" in a split if will split on tabs, spaces and newlines. I would try a straight split "\t" and don't do s/^\s+// which may well produce the results you want.
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
| [reply] [d/l] [select] |
|
|