How to split a string based on the length of a sequence of characters within the string

thanos1983 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I need you wisdom once again. Although that I have found an answer to my question, I need your skills to help me understand why the output is formatted like this.

I want to split a user input based on the length of characters.

Update

I modified the title, to actually describe better my question and to avoid confusion for future readers. The observation came from ww and thank you for the suggestion.

Sample of a working example is given bellow with the output as expected:

#! /usr/local/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $input = "1234567890abcdefghij0987654321ABCDEFGHIJlmnop";

chomp $input;

# A  "A text (ASCII) string, will be space padded."

my @chunks = unpack("(A10)*", $input);

print Dumper(\@chunks);

__END__

$VAR1 = [
          '1234567890',
          'abcdefghij',
          '0987654321',
          'ABCDEFGHIJ',
          'lmnop'
        ];
[download]

While I was trying different solutions in order to reach to the solution I thought about using split and regex. The program seems to be working correctly but I do get blank lines, why? I can not understand where I am going wrong. I tried to use chomp but as it looks like there are no trailing "new line" characters. Does anyone understands why this is happening?

Sample of working code provided under:

#! /usr/local/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $input = "1234567890abcdefghij0987654321ABCDEFGHIJlmnop";

chomp $input;


my @chunks = split(/(.{10})/,$input);

print Dumper(\@chunks);

__END__

$VAR1 = [
          '',
          '1234567890',
          '',
          'abcdefghij',
          '',
          '0987654321',
          '',
          'ABCDEFGHIJ',
          'lmnop'
        ];
[download]

Thank you all for your time and effort reading and replying to my question.

Seeking for Perl wisdom...on the process of learning...not there...yet!

Comment on How to split a string based on the length of a sequence of characters within the string Select or Download Code

Replies are listed 'Best First'.

Re: How to split a string based on character(s) length
by AppleFritter (Vicar) on Aug 13, 2014 at 16:19 UTC

The regex you provide to split specifies a separator, so Perl is splitting your string into chunks separated by sequences of ten characters each. What you get is the following:

$VAR1 = [
          <chunk1>,
          <separator>,
          <chunk2>,
          <separator>,
          <chunk3>,
          <separator>,
          <chunk4>,
          <separator>,
          <chunk5>
]
[download]

And all the chunks are empty except for the last one, which makes sense since there are no constraints on the separators (other than their length): Perl searches for the separator, finds it right at the beginning of the string, and splits it off along with a preceding empty chunk; then the whole process repeats, until you've only got five characters left, not enough to match the separator, so that's your last chunk.

Here's how I'd split a string into ten-character chunks using a regex:

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my $input = "1234567890abcdefghij0987654321ABCDEFGHIJlmnop";

my @chunks = ($input =~ /.{1,10}/g);

print Dumper \@chunks;
[download]

This produces:

$VAR1 = [
          '1234567890',
          'abcdefghij',
          '0987654321',
          'ABCDEFGHIJ',
          'lmnop'
        ];
[download]

Note that the quantifier needs to be {1,10} rather than just {10} here to accomodate the final chunk of less than ten characters. The regex engine's greediness will ensure that all chunks but the last will get the full ten characters.

[reply]
[d/l]
[select]

Re^2: How to split a string based on character(s) length

by thanos1983 (Parson) on Aug 13, 2014 at 20:14 UTC

Hello AppleFritter,

It starts to make sense, I was wondering how was that possible but thanks to your explanation I finally got it. Thank you for your time and effort. :D

Seeking for Perl wisdom...on the process of learning...not there...yet!

[reply]
[d/l]
[select]

Re^3: How to split a string based on character(s) length

by AppleFritter (Vicar) on Aug 13, 2014 at 20:37 UTC

You're very welcome! *tips hat* Always happy to help, and I'm learning more than just a thing or two in the process as well.

[reply]

Re: How to split a string based on character(s) length
by Athanasius (Archbishop) on Aug 13, 2014 at 16:31 UTC

To add to AppleFritter’s excellent explanation:

The only reason split is giving you the contents of the separators is that you have capturing parentheses in your regex. From split:

If the PATTERN contains capturing groups, then for each separator, an additional field is produced for each substring captured by a group...

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]

Re^2: How to split a string based on character(s) length

by thanos1983 (Parson) on Aug 13, 2014 at 20:21 UTC

Hello Athanasius,

As always your minor details provide even more light to my questions. Thank you for your time and effort it is really nice to learn all these small details to understand the behavior of the code.

Seeking for Perl wisdom...on the process of learning...not there...yet!

[reply]
[d/l]
[select]

Re: How to split a string based on character(s) length
by Laurent_R (Canon) on Aug 13, 2014 at 17:08 UTC

substr

It is probably faster than a regex or a split, but unpack is likely to be the fastest solution (if speed matters in your case). This is a link to a post where I used several different methods to solve your exact problem and to a benchmark on those various methods I did some time ago: Re: Performance problems on splitting long strings.

[reply]
[d/l]
[select]

Re^2: How to split a string based on character(s) length

by thanos1983 (Parson) on Aug 13, 2014 at 20:26 UTC

Hello Laurent_R,

Thanks a lot for the link. In my case time is not an issue but I ques when files while start to grow, the time will start to be an important factor. Again thank you for your time and effort it is really nice that people offer advice on such a small issues that are easy to understand. At least for beginners like me.

Seeking for Perl wisdom...on the process of learning...not there...yet!

[reply]
[d/l]
[select]

Re^3: How to split a string based on character(s) length

by Laurent_R (Canon) on Aug 13, 2014 at 21:18 UTC

thanos1983

You are welcome, it is my pleasure to help you if I can. Even if speed is not your concern (it was for me, having to deal with very large files, several GB), the post I linked to offers 8 different solutions to your problem (3 of which, at least, perhaps more, were suggested by other monks in previous posts of the same thread, you can look at the history of the thread if you want to know; well, in brief, they are not "my" solutions, but solutions derived from a very fruitful discussion with many monks), you may find some interesting ideas from them.

[reply]

Re: How to split a string based on character(s) length
by ikegami (Patriarch) on Aug 13, 2014 at 16:31 UTC

The first argument to split should define what separates the items you want to extract. Wrong tool. You want a vanilla regex match.

my @chunks = $input =~ /(.{1,10})/g;
[download]

[reply]
[d/l]
[select]

Re^2: How to split a string based on character(s) length

by thanos1983 (Parson) on Aug 13, 2014 at 20:19 UTC

Hello ikegami,

Your code works perfectly also. By the way I never though to right the code like this my @chunks = $input =~ /(.{1,10})/g;. I did not imagine that can work the array equal to the string straight. Thanks for the tip and for your time to assist me with my question.

Seeking for Perl wisdom...on the process of learning...not there...yet!

[reply]
[d/l]
[select]

Re^3: How to split a string based on character(s) length

by ikegami (Patriarch) on Aug 15, 2014 at 03:55 UTC

$input =~ /(.{1,10})/g
[download]

unpack("(A10)*", $input)
[download]

[reply]
[d/l]
[select]

Re: How to split a string based on character(s) length
by ww (Archbishop) on Aug 13, 2014 at 18:28 UTC

... and just to add another sprig of julep:

C:\>perl -E "my $input='#abcd#ef#ghi#jklmnopqrstuv'; my @chunks = spli
+t(/([a-z]{3})/,$input); 
    use Data::Dumper; say Dumper @chunks;"
$VAR1 = '#';
$VAR2 = 'abc';
$VAR3 = 'd#ef#';
$VAR4 = 'ghi';
$VAR5 = '#';
$VAR6 = 'jkl';
$VAR7 = '';
$VAR8 = 'mno';
$VAR9 = '';
$VAR10 = 'pqr';
$VAR11 = '';
$VAR12 = 'stu';
$VAR13 = 'v';
[download]

Whereas without the capturing parens:

C:\>perl -E "my $input='#abcd#ef#ghi#jklmnopqrstuv'; my @chunks = spli
+t(/[a-z]{3}/,$input); use Data::Dumper; say Dumper @chunks;"
$VAR1 = '#';
$VAR2 = 'd#ef#';
$VAR3 = '#';
$VAR4 = '';
$VAR5 = '';
$VAR6 = '';
$VAR7 = 'v';
[download]

As suggested above, a simple regex might serve you better, and with fewer "gotcha's" ... whereas, while this might be a legitimate approach for certain problem cases, it is most certainly likely to be a HUGE! PITA for some successor_dev).

Afterthought (i.e., UPDATE) re OP's title: You're not trying "to split a string based on character(s) length" (as your title says, because

Only when dealing with unicode chars does the length of a character vary :-)

Rather, you are trying to split a string based on a the length of a sequence of characters within the string (AKA: 'a substring length').

check Ln42!

[reply]
[d/l]
[select]

Re: How to split a string based on character(s) length
by Anonymous Monk on Aug 13, 2014 at 16:36 UTC

See the documentation of split: if you include a capture group in the regular expression, split will include the separator in its output.

$ perl -wMstrict -le 'print join ",", split /[a-z]/, "1a2b3c4"'
1,2,3,4
$ perl -wMstrict -le 'print join ",", split /([a-z])/, "1a2b3c4"'
1,a,2,b,3,c,4
[download]

[reply]
[d/l]
[select]

Re^2: How to split a string based on character(s) length

by thanos1983 (Parson) on Aug 13, 2014 at 20:35 UTC

Hello Anonymous Monk,

Hmmm I guess I have to read closer the split, I missed this part. Thank your time and for pointing to the correct direction.

Seeking for Perl wisdom...on the process of learning...not there...yet!

[reply]
[d/l]
[select]