in reply to misalined TABs using substr,LAST_MATCH_START/END,regex

G'day perl_boy,

[A couple of notes on presentation: It's good that you've put code within <code>...</code> tags; please also do the same for data and program output (e.g. error messages) so that we can see a verbatim copy of what you're seeing — HTML can modify what you write, e.g. by collapsing whitespace into a single space, which can make a huge difference in many cases. Please also "linkify" URLs; in this case, changing URL to [URL] would have sufficed; again, this helps us to help you (see "What shortcuts can I use for linking to other information?" for more details about that).]

You've shown your input data as having the same number of characters in each column (columns 1, 2 & 3 have 3 characters: foo, bar, baz; columns 4 & 5 have 4 characters: booz & qaaz; and so on). This could be a realistic representation of your data; for instance, order numbers, product codes, client IDs, and so on, are likely to have the same lengths. If this is the case, the following is a much simpler solution.

#!/usr/bin/env perl use 5.014; use warnings; use autodie; my $infile = 'pm_11140114_tab_align_even.dat'; my $outfile = 'pm_11140114_tab_align_even.out'; { open my $in_fh, '<', $infile; open my $out_fh, '>', $outfile; while (<$in_fh>) { print $out_fh $_ =~ y/\t/\t/rs; } }

pm_11140114_tab_align_even.dat:

foo bar baz booz qaaz + abc foo bar baz booz qaaz abc 123 foo bar baz booz qaaz + abc

pm_11140114_tab_align_even.out:

foo bar baz booz qaaz abc foo bar baz booz qaaz abc 123 foo bar baz booz qaaz abc

Note that this uses the /r option which was introduced in Perl 5.14: "perl5140delta: Non-destructive substitution". If you're using an older version of Perl, change use 5.014; to use strict; and the print statement will need to be split into two statements:

y/\t/\t/s; print $out_fh $_;

This gives exactly the same result.

Your "SHOULD output to" shows two tabs between columns (except for "abc\t123" which I'm going to assume is just a typo). Because y///r and s///r can be chained, you can change

print $out_fh $_ =~ y/\t/\t/rs;

to

print $out_fh $_ =~ y/\t/\t/rs =~ s/\t/\t\t/gr;

Now, pm_11140114_tab_align_even.out will be:

foo bar baz booz qaaz + abc foo bar baz booz qaaz + abc 123 foo bar baz booz qaaz + abc

For older Perls, you'll need to split the print statement into three statements:

y/\t/\t/s; s/\t/\t\t/g; print $out_fh $_;

Again, this gives exactly the same result.

Please either advise whether the input data in you OP is representative or, if not, provide something more realistic such that we can provide better help.

It would also be useful to know what you intend to do with the output; e.g. print to screen, write to a plain text file, use for CSV, generate an HTML table, etc. With this information, we may be able to provide different (better) advice.

— Ken

Replies are listed 'Best First'.
Re^2: misalined TABs using substr,LAST_MATCH_START/END,regex
by tybalt89 (Monsignor) on Jan 04, 2022 at 15:57 UTC

    Or just:

    perl -pe 's/\t+/\t\t/g' infile >outfile
Re^2: misalined TABs using substr,LAST_MATCH_START/END,regex
by perl_boy (Novice) on Jan 11, 2022 at 20:56 UTC
    thank s for the code. though the lines 1,2,3 aline well, there should not be a TAB in the beggining of line 0 and group of wordstext with spaces X should be alined from the line containing most of them (max TAB), for example (ex 0) for the input (in 0)
    text with spaces 1 text with spaces 2 text with +spaces 3 text with spaces 4 line 0 text with spaces 1 text with spaces 2 text with spaces 3 + text with spaces 4 line 1 text with spaces 1 text with spaces 2 text with spaces 3 + text with spaces 4 line 2 text with spaces 1 text with spaces 2 te +xt with spaces 3 text with spaces 4 line 3
    There are 4 TAB ranges (3 TABs ranges between text with spaces (1 and 2, 2 and 3, 3 and 4) and 1 TABs range between text with spaces 4 and line X, X being 0,1,2,3) in each for the 4 lines. Here is a 4 * 4 array (ar 0) (TAB_range * line),that is, for each of the 4 lines, the 4 TAB ranges
    1 4 4 2 1 3 3 4 3 1 2 1 4 4 1 3
    the max is 4 for each of the columns the max number of TABs for each of the columns is 4 4 4 4 there should be 4 TABs between text with spaces 1 and text with spaces 2 , text with spaces 2 and text with spaces 3 text with spaces 3 and text with spaces 4 and text with spaces 4 and line X X being 0,1,2,3 and the output should be
    text with spaces 1 text with spaces 2 te +xt with spaces 3 text with spaces 4 lin +e 0 text with spaces 1 text with spaces 2 te +xt with spaces 3 text with spaces 4 lin +e 1 text with spaces 1 text with spaces 2 te +xt with spaces 3 text with spaces 4 lin +e 2 text with spaces 1 text with spaces 2 te +xt with spaces 3 text with spaces 4 lin +e 3
    the above example (ex 0) (in 0) is a bit unclear as all TAB ranges contain 4 TABs. In the following example (ex 1) (in 1) temp9.txt TAB ranges (max TAB) have a different number of TABs between group of wordstext with spaces X
    text with spaces 1 text with spaces 2 text with spac +es 3 text with spaces 4 line 0 text with spaces 1 text with spaces 2 text with spaces 3 + text with spaces 4 line 1 text with spaces 1 text with spaces 2 text with +spaces 3 text with spaces 4 line 2 text with spaces 1 text with spaces 2 text with spaces 3 text + with spaces 4 line 3
    As in above example (ex 0), there are 4 TAB ranges (3 TABs ranges between text with spaces (1 and 2, 2 and 3, 3 and 4) and 1 TABs range between text with spaces 4 and line X, X being 0,1,2,3) in each for the 4 lines. Here is a 4 * 4 array (ar 1) (TAB_range * line),that is, for each of the 4 lines, the 4 TAB ranges
    1 3 3 2 1 3 5 3 2 4 2 1 1 1 1 2
    the max number of TABs for each of the columns is 2 4 5 3 as opposed to 4 4 4 4 for the above array (ar 0) and thus output should be
    text with spaces 1 text with spaces 2 text with spaces 3 text + with spaces 4 line 0 text with spaces 1 text with spaces 2 text with spaces 3 text + with spaces 4 line 1 text with spaces 1 text with spaces 2 text with spaces 3 text + with spaces 4 line 2 text with spaces 1 text with spaces 2 text with spaces 3 text + with spaces 4 line 3
    I have printed the begin/end of each TAB range using print  $-[0], ' ', $+[0], ' '; line 20
    0 1 19 20 38 41 59 62 80 82 18 19 37 40 58 63 81 84 18 20 38 42 60 62 80 81 18 19 37 38 56 57 75 77
    I have printed the maximum column numbers for both TAB bebin and end for each of the TAB ranges between group of wordstext with spaces X
    # printing max number of TABs for each of the columns begin ($max_b) a +nd end ($max_e) print "max begin\t: " ; for ($x=0;$x<=$nbr_tab;$x++) { print $max_b[$x], ' '; } print "\nmax end\t\t: "; for ($x=0;$x<=$nbr_tab;$x++) { print $max_e[$x], ' '; }
    for debugging only and outputs
    max begin : 18 38 60 81 80 max end : 20 42 63 84 82
    here text with spaces 1,2,3,4 for all 4 lines 0,1,2,3 all have the same width,that is,18, but in real life text with spaces have variable length as in here lines 2,3,4 taken from https://www.poetryfoundation.org/poems/55038/phrases (lak of insperation after line 1)
    Bob, the rabbit jump above the fence Jack, the cat hid under the po +rch of the red house Rex, the dog ran after Jack the birds +fly When the world is reduced to a single dark wood for our tw +o pairs of dazzled eyes to a musical house for our clear understan +ding then I shall find you When we are very strong who draws back? very happy + who collapses from ridicule When we are very bad what can + they do to us. The taste of ashes in the air the smell of wood sweating in the hea +rth steeped flowers the devastation of paths + drizzle over the canals in the fields why not already pla +ythings and incense? Arousing a pleasant taste of Chinese ink a black powder gently rain +s on my night I lower the jets of the chandelier throw myse +lf on the bed and turning toward thedark I see you + O my daughters and queens!
    should output as follows temp11.txt formatted using Ubuntu Mousepad
    Bob, the rabbit jump above the fence Jack, the cat hid under th +e porch of the red house Rex, the dog ran after Jack the bi +rds fly When the world is reduced to a single dark wood fo +r our two pairs of dazzled eyes to a musical house for our clear u +nderstanding then I shall find you When we are very strong who draws back? + very happy who collapses from ridicule When + we are very bad what can they do +to us. The taste of ashes in the air the smell of wood sweating in + the hearth steeped flowers the devastation of +paths drizzle over the canals in the fields + why not already playthings and incense? Arousing a pleasant taste of Chinese ink a black powder gently rain +s on my night I lower the jets of the chandelier throw +myself on the bed and turning toward thedark + I see you O my daughters and queens!
    I would like TABs to aline at the rightmost/maximum column number (between $max_b[] and $max_e[]) for each TAB ranges between group of wordstext with spaces X remember arrays @max_b and @max_e contain the beginning ending column numbers of TABs found using
    $max_b[$max_tab] = $-[0] if $max_b[$max_tab] < $-[0] ; $max_e[$max_tab] = $+[0] if $max_e[$max_tab] < $+[0] ;
    and insert missing TABs (to aline with the longest line) using
    print $out_fh $_ =~ y/\t/\t/rs;
    I have tried
    print $out_fh $_ =~ y/\t/\t{3}/rs;
    just to test the use of {} for making the output 3 tabs wide but got the same output. I have also tried
    print $out_fh $_ =~ tr/\t/\t/rs;
    I have read https://perldoc.perl.org/perlre on perl regex but can t figure out where to insert $max_b[$tab_index] and $max_e[$tab_index] within y/\t/\t/rs or tr/\t/\t/rs
    I have changed the following lines in the code
    my $infile = 'pm_11140114_tab_align_even.dat'; my $outfile = 'pm_11140114_tab_align_even.out';
    for
    my $infile = $ARGV[0]; my $outfile = $ARGV[1];
    because I need to read the input and output filenames from command line arguement $ARGV[0] $ARGV[1] and added
    close $in_fh; close $out_fh;
    I guess you forgot to close files
    my ($max_tab,$nbr_tab,$valid_line,@max_b,@max_e);
    for my array of TABs I also added
    while(<$in_fh>) { ... }
    at the begginning to read the file before writing to the output file in the second
    while(<$in_fh>) { ... }
    in order to get the positions of the begin/end of each TAB range inside $max_b and $max_e arrays before repositioning the file pointer at the beginning using
    seek $in_fh,0,0;
    I have commented
    # use 5.014; # use warnings; # use autodie;
    because I have useless warnings
    Parentheses missing around "my" list at 02-00.pl line 9. Global symbol "$nbr_tab" requires explicit package name (did you forge +t to declare "my $nbr_tab"?) at 02-00.pl line 9. Global symbol "$valid_line" requires explicit package name (did you fo +rget to declare "my $valid_line"?) at 02-00.pl line 9. Global symbol "$valid_line" requires explicit package name (did you fo +rget to declare "my $valid_line"?) at 02-00.pl line 13. Global symbol "@max" requires explicit package name (did you forget to + declare "my @max"?) at 02-00.pl line 17. Global symbol "@max" requires explicit package name (did you forget to + declare "my @max"?) at 02-00.pl line 17. Global symbol "$nbr_tab" requires explicit package name (did you forge +t to declare "my $nbr_tab"?) at 02-00.pl line 20. Global symbol "$nbr_tab" requires explicit package name (did you forge +t to declare "my $nbr_tab"?) at 02-00.pl line 20. Execution of 02-00.pl aborted due to compilation errors.
    I don t know where to put ($max_e[]) , rightmost/maximum column number within
    print $out_fh $_ =~ y/\t/\t/rs;
    I am sorry for this but I m quite new to perl and regex and its quite confusing thank s in advance here is the re-written code
    # use 5.014; # use warnings; # use autodie; $infile = $ARGV[0]; $outfile = $ARGV[1]; my ($max_tab,$nbr_tab,$valid_line,@max_b,@max_e); open my $in_fh, '<', $infile; open my $out_fh, '>', $outfile; while(<$in_fh>) { # print "$_\n"; if (/[a-zA-Z0-9]/) { $valid_line++; $max_tab = 0; while (/\t+/g) { # print $-[0], ' ', $+[0], ' '; $max_b[$max_tab] = $-[0] if $max_b[$max_tab] < $-[0] ; $max_e[$max_tab] = $+[0] if $max_e[$max_tab] < $+[0] ; print $max_b[$max_tab], ' ', $max_e[$max_tab], ' ', $max_t +ab, ' '; $max_tab++; } # print "\n"; $nbr_tab = $max_tab if $nbr_tab < $max_tab; } } # printing max number of TABs for each of the columns begin ($max_b) a +nd end ($max_e) DEBUG print "max begin\t: " ; for ($x=0;$x<=$nbr_tab;$x++) { print $max_b[$x], ' '; } print "\nmax end\t\t: "; for ($x=0;$x<=$nbr_tab;$x++) { print $max_e[$x], ' '; } seek $in_fh,0,0; while (<$in_fh>) { print $out_fh $_ =~ y/\t/\t/rs; } close $in_fh; close $out_fh;

      Much of what you wrote in reply to my post seems to have no bearing whatsoever on my post. For instance, "there should not be a TAB in the beggining ...": none of the output I showed had a tab at the beginning of any line. I'm going to ignore all such content. Please ensure you're replying to the correct post; and, for clarity, add some indication showing to what your response refers (e.g. You wrote X; I think Y).

      Where and how you get your filenames is entirely up to you. I used hard-coded filenames for demo purposes only. I often use a prefix of pm_NODE-ID_ for demo files: it provides unique names as well as a reference back to the associated PM node. If you're reading from @ARGV, you should include some sanity checking; in this instance, check that @ARGV has two elements, with the first being a valid file. Also take a look at Getopt::Long.

      "I have tried print $out_fh $_ =~ y/\t/\t{3}/rs;"

      That's not how transliteration works. See y/// and consider:

      $ perl -E 'my $x = "A\tB\t\tC"; say $x; say $x =~ y/\tABC/\t{3}/rs;' A B C { 3 }
      "I guess you forgot to close files"

      No, I certainly did not forget to do that. I declared, and used, lexical filehandles in the smallest scope possible (the anonymous block). Perl automatically closes files at the end of that scope.

      I also didn't forget to check for I/O exceptions. Again, Perl does this for me via the autodie pragma.

      I have commented

      # use 5.014; # use warnings; # use autodie;

      because I have useless warnings ...

      That's a very bad move and I strongly recommend that you do not do this. Parentheses missing around ... is the only warning; all the rest are errors (note the Execution of 02-00.pl aborted due to compilation errors. as the last line). Furthermore, none of those messages are "useless"!

      As you're not checking for I/O exceptions, you should definitely use the autodie pragma and let Perl do it for you.

      — Ken