BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I have a set of large files of numbers output by some program, that are in a very wide layout -- ie. there are many numbers listed per line separated by spaces -- which is convenient for the program but not for human inspection.

I wanted to 'wrap' the lines at a convenient place. Whilst the numbers are not sequential, they are ordered and so I chose to wrap them such that all the numbers within each 100 range (eg. nnn100 .. nnn199 ) are on a single line.

To clarify, the input (vastly cut down) looks like this:

105 106 107 108 109 110 111 112 113 1115 1116 1117 1118 1119 1120 1121 1122 1123 12345 12346 12347 12348 12349 12350 12351 12353

And I decided to wrap it to look like this:

105 106 107 108 109 110 111 112 113 1115 1116 1117 1118 1119 1120 1121 1122 1123 12345 12346 12347 12348 12349 12350 12351 12353

I though to do this by replacing the space between (n1)\d\d and (n1+1)\d\d with a newline. Hence I tried the following regex:

#! perl -slw use strict; while( <DATA> ) { s[\s(\d+)\d\d\K\s(?=(\d+)\d\d\s)]{ $1 + 1 == $2 ? "\n" : ' ' }ge; print; } __DATA__ 105 106 107 108 109 110 111 112 113 1115 1116 1117 1118 1119 1120 1121 1122 1123 12345 12346 12347 12348 12349 12350 12351 12353

But it does nothing. I've since done the job (with multiple passes due to the fixed-length look-behind limitation) of:

perl -ple"s[(?<=\s(\d{3})\d\d)\s(?=(\d{3})\d\d\s)][$1+1==$2 ? qq[\n] : + ' ' ]ge" in > out

But I still don't understand why the above regex fails?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: Why doesn't this regex work?
by choroba (Cardinal) on Aug 15, 2013 at 09:45 UTC
    Have you tried
    use re 'debug';
    Its output is very long, but as you know what the regex should do, you might understand it.

    Update: It seems your example data miss a digit at the end of each number. I tried it with a different data:

    2120 2140 2180 2197 2200 2203 2205 2234 2238 2259 2280 2299 2401

    If I understood your specification, the following should do the work, being a bit more readable:

    s/\b(\d*)(\d\d) (?=\1\d\d\b)/$1$2_/g; s/ /\n/g; s/_/ /g;
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Have you tried use re 'debug';

      Yes. But it didn't help. It shows that matches do occur; but doesn't explain why the newlines are never inserted.

      It seems your example data miss a digit at the end of each number.

      My mistake trying to simplify the data. Two corrections are possible:

      1. Remove one \d from each side of the regex:
        while( <DATA> ) { s[\s(\d+)\d\K\s(?=(\d+)\d\s)]{ print "$1:$2"; $1 + 1 == $2 ? "\n" : ' ' }ge; print; }
      2. Add a digit to the end of each number in the data:
        1051 1061 1071 1081 1091 1101 1111 1121 1131 11151 11161 11171 11181 11191 11201 11211 11221 11231 123451 123461 123471 123481 123491 123501 123511 123531
      as you know what the regex should do

      The idea of the regex is to match each pair of numbers in a line and capture the first d-2 digits of each number.

      Eg. Match the pair and capture the first two digits of each:  (10)91 (11)01.

      Then if $1+1 (10+1) == $2 (11) replace the space between them with a newline.

      The \K prevents the number before the replaced space being replaced. And the lookahead prevents the number after ebing replaced.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Why doesn't this regex work? ( lookaround backtracking pos)
by Anonymous Monk on Aug 15, 2013 at 11:07 UTC

    I think I got it, there is one \s and it is outside of the lookaround, so it gets consumed (advances pos), and backtracking doesn't go back before it, so its checking every other number

    Look at the pos, in your version (\s outside (?=)) it jumps 7/15 but with \s inside (?=) it jumps 7/11/15

    #! perl -slw use strict; my $data = '105 106 107 108 109 110 111 112 113 1115 1116 1117 1118 1119 1120 1121 1122 1123 12345 12346 12347 12348 12349 12350 12351 12353 '; { my $other = @ARGV ? qr{(?=\s(\d+)\d\s)} : qr{\s(?=(\d+)\d\s)}; open my($DATA),'<',\$data; my $what = ""; while( readline $DATA ) { s{ \s(\d+)\d \K $other }{ warn "WHAT($1)($2)POS(@{[pos()]})\n"; $1 + 1 == $2 ? ";\n" : '! ' }gex; print; } print "\n$what"; } __END__ $ perl junk WHAT(10)(10)POS(7) WHAT(10)(10)POS(15) WHAT(11)(11)POS(23) WHAT(11)(11)POS(31) 105 106! 107 108! 109 110! 111 112! 113 WHAT(111)(111)POS(9) WHAT(111)(111)POS(19) WHAT(112)(112)POS(29) WHAT(112)(112)POS(39) 1115 1116! 1117 1118! 1119 1120! 1121 1122! 1123 WHAT(1234)(1234)POS(11) WHAT(1234)(1234)POS(23) WHAT(1235)(1235)POS(35) 12345 12346! 12347 12348! 12349 12350! 12351 12353 $ perl junk lookaround WHAT(10)(10)POS(7) WHAT(10)(10)POS(11) WHAT(10)(10)POS(15) WHAT(10)(11)POS(19) WHAT(11)(11)POS(23) WHAT(11)(11)POS(27) WHAT(11)(11)POS(31) 105 106! 107! 108! 109; 110! 111! 112! 113 WHAT(111)(111)POS(9) WHAT(111)(111)POS(14) WHAT(111)(111)POS(19) WHAT(111)(112)POS(24) WHAT(112)(112)POS(29) WHAT(112)(112)POS(34) WHAT(112)(112)POS(39) 1115 1116! 1117! 1118! 1119; 1120! 1121! 1122! 1123 WHAT(1234)(1234)POS(11) WHAT(1234)(1234)POS(17) WHAT(1234)(1234)POS(23) WHAT(1234)(1235)POS(29) WHAT(1235)(1235)POS(35) WHAT(1235)(1235)POS(41) 12345 12346! 12347! 12348! 12349; 12350! 12351! 12353

    rxrx gave me the idea to pos it

      there is one \s and it is outside of the lookaround ... so its checking every other number

      BINGO! Thank you.

      If I put the first space into a lookbehind: s[(?<=\s)(\d+)\d\K\s(?=(\d+)\d\s)]{

      Or substitute the zero length \b--which will serve the same purpose: s[\b(\d+)\d\K\s(?=(\d+)\d\s)]{

      The substitution works as I wanted it to.

      But damn I could not see that for looking.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Took me a few dozen looks, I started to compose some funny(wrong) answers some 3-4 times

        The experimental (??{ code }) feature looks kinda neat

        perl -le " $_ = shift; s{(\d+)\d\s\K(?=(??{$1+1})\d\s)}{\n}g; print " +"11 12 21 22 32 33 41 44" 11 12 21 22 32 33 41 44
Re: Why doesn't this regex work?
by mtmcc (Hermit) on Aug 15, 2013 at 10:40 UTC
    I've looked through it with Regexp::Debugger.

    Maybe (probably) I'm misunderstanding the question, but in your sample data, the hundred groups in each line seem to be the same. i.e. $1 + 1 will always not equal $2 in this data, for each line.

    I think the second point is about the spaces. When the regex matches, it skips a space, so misses out on matching the next number to the first part of the regex.

    For example, this inserts a newline before the 200s start:

    #!/usr/bin/perl use strict; use warnings; #use Regexp::Debugger; while( <DATA> ) { s[\s(\d+)\d\d\K\s(?=(\d+)\d\d)]{ $1 + 1 == $2 ? "\n" : ' ' }ge; print; } __DATA__ 105 106 107 108 109 110 211 212 213 1115 1116 1117 1118 1119 1120 1121 1122 1123 12345 12346 12347 12348 12349 12350 12351 12353

    But when you change the data to:

    105 106 107 108 109 210 211 212 213 1115 1116 1117 1118 1119 1120 1121 1122 1123 12345 12346 12347 12348 12349 12350 12351 12353

    the first part of the regex doesn't match the 109, because the match fails on the first space.

    Apologies if I'm way off!

Re: Why doesn't this regex work?
by Athanasius (Archbishop) on Aug 15, 2013 at 10:40 UTC

    Using printf within the substitution:

    #! perl use strict; use warnings; while (<DATA>) { chomp; s[\s(\d+)\d\K\s(?=(\d+)\d\s)] { printf("%s: (%s) (%s)\n", $_, $1, $2); }ge; } __DATA__ 105 106 107 108 109 110 111 112 113 1115 1116 1117 1118 1119 1120 1121 1122 1123 12345 12346 12347 12348 12349 12350 12351 12353

    gives:

    20:32 >perl 687_SoPW.pl 105 106 107 108 109 110 111 112 113: (10) (10) 105 106 107 108 109 110 111 112 113: (10) (10) 105 106 107 108 109 110 111 112 113: (11) (11) 1115 1116 1117 1118 1119 1120 1121 1122 1123: (111) (111) 1115 1116 1117 1118 1119 1120 1121 1122 1123: (111) (111) 1115 1116 1117 1118 1119 1120 1121 1122 1123: (112) (112) 12345 12346 12347 12348 12349 12350 12351 12353: (1234) (1234) 12345 12346 12347 12348 12349 12350 12351 12353: (1234) (1234) 12345 12346 12347 12348 12349 12350 12351 12353: (1235) (1235) 20:32 >

    which seems to show that each comparison is consuming three terms from the data line, instead of one. I don’t understand why this is, but I’m posting in the hope that it provides someone with a useful clue as to what is going on. :-)

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Why doesn't this regex work?
by boftx (Deacon) on Aug 15, 2013 at 10:47 UTC

    Please understand that I am new to the Monastery. That said, why wouldn't something like this work if the data always has spaces between the numbers?

    #!/usr/bin/perl use strict; while( <DATA> ) { my @numbers = split; my $half = int(@numbers/2); my @first = splice(@numbers,0,$half); print "@first\n"; print " @numbers\n"; } __DATA__ 105 106 107 108 109 110 111 112 113 1115 1116 1117 1118 1119 1120 1121 1122 1123 12345 12346 12347 12348 12349 12350 12351 12353

    The first snippet just cuts the line in half, but it could easily be adapted to print out lines of X number of values (with continuation rows indented.)

    #!/usr/bin/perl use strict; my $maxitems = 3; while( <DATA> ) { my @numbers = split; my $indent = ''; while ( @numbers > 0) { my @rowdata = splice(@numbers,0,$maxitems); print "${indent}@rowdata\n"; $indent = ' '; } } __DATA__ 105 106 107 108 109 110 111 112 113 116 1115 1116 1117 1118 1119 1120 1121 1122 1123 1125 12345 12346 12347 12348 12349 12350 12351 12353 12355

    UPDATE: I understand that this doesn't answer the question posed (which is interesting in itself) but I have to wonder if that is the right approach to the task at hand to begin with.

Re: Why doesn't this regex work?
by Laurent_R (Canon) on Aug 15, 2013 at 10:16 UTC

    In your data, the third (and fourth and fifth) digits from the right are the same. And this is what you are capturing. You will not get any match where the captured numbers are different.

    Second thing, I think you should probably make your spaces optional if you do not have spaces at the beginning or at the end of your lines (or perhaps use a word boundary anchor).

    With these two points in mind:

    DB<8> $_ = "105 106 107 108 109 110 111 112 113 213 214"; DB<9> s[\s?(\d+)\d\d\K\s(?=(\d+)\d\d\s?)]{$1 + 1 == $2 ? "\n" : ' '} +ge; DB<10> x $_ 0 '105 106 107 108 109 110 111 112 113 213 214' DB<11>

    I also think that this:

    $1 + 1 == $2 ? "\n" : ' '

    is going to fail on sequence jumps, e.g. on data looking like this:

    $_ = qw /105 106 107 108 109 110 111 112 113 313 314/;
Re: Why doesn't this regex work? (Solved!)
by AnomalousMonk (Archbishop) on Aug 15, 2013 at 14:34 UTC

    As your original question seems to have been answered, here, FWIW, is an approach that is neither multi-pass nor dependent on  /e replacement code execution or embedded code blocks as most others seem to be. I was a bit confused by the data output example in the OP, but I have set this up to insert newlines on transitions from one 100s group to the next.

    >perl -wMstrict -le "my @lines = ( '105 106 107 108 109 110 111 112 113 220 221 223', '100 101 198 199 200 201 298 299 300 301 398 399 400 401', '1115 1116 1117 1118 1119 1120 1121 1122 1123 1200 1201 1202', '1100 1101 1102 1198 1199 1200 1201 1202 1298 1299 1300 1301', '12345 12346 12347 12348 12349 12450 12451 12453 12466 12467', '12300 12301 12398 12399 12400 12401 12498 12499 12500 12501', ); print qq{'$_'} for @lines; print ''; ;; s{ (\d) \d\d \K [^\n\S]+ (?! \d* \1 \d\d \b) }{\n}xmsg for @lines; print qq{'$_'} for @lines; " '105 106 107 108 109 110 111 112 113 220 221 223' '100 101 198 199 200 201 298 299 300 301 398 399 400 401' '1115 1116 1117 1118 1119 1120 1121 1122 1123 1200 1201 1202' '1100 1101 1102 1198 1199 1200 1201 1202 1298 1299 1300 1301' '12345 12346 12347 12348 12349 12450 12451 12453 12466 12467' '12300 12301 12398 12399 12400 12401 12498 12499 12500 12501' '105 106 107 108 109 110 111 112 113 220 221 223' '100 101 198 199 200 201 298 299 300 301 398 399 400 401' '1115 1116 1117 1118 1119 1120 1121 1122 1123 1200 1201 1202' '1100 1101 1102 1198 1199 1200 1201 1202 1298 1299 1300 1301' '12345 12346 12347 12348 12349 12450 12451 12453 12466 12467' '12300 12301 12398 12399 12400 12401 12498 12499 12500 12501'

      Nice, doing the least amount of work is the fastest

      5.016003 Rate anoNa anoNb bukA bukB anoM anoNa 461/s -- -59% -90% -90% -94% anoNb 1126/s 144% -- -76% -76% -85% bukA 4662/s 910% 314% -- -0% -40% bukB 4675/s 913% 315% 0% -- -40% anoM 7751/s 1580% 588% 66% 66% --
      FWIW the outputs aren't identical but they're close enough :)

        Even the multi-pass solution I actually used to do the job probably only took 3 or 4 minutes, including the time it took to type the original and then retrieve, edit and re-run each of 5 passes.

        Conversely, I must have spent an hour or two trying to figure out why my first attempt didn't work.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.