rectangularizing input to become array

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: rectangularizing input to become array
by haukex (Archbishop) on Feb 27, 2019 at 06:58 UTC

I'm attempting to present an SSCCE.

Note that an SSCCE is all about making things clear and easy for the people trying to help: no extra editing of the inputs, no guessing which input produces which output, guessing whether commented-out code is relevant to the question, and so on. For me, an ideal question would contain three of PerlMonks' [download] links that I can right-click and do "Save As" on: the input file, the output file, and the source code, such that I can just download them and run perl script.pl input.txt | diff - output.txt and see what's going on. There are also a couple other options, such as embedding the input in the source in the form of a __DATA__ section or just a variable, and embedding the output in the source and using e.g. Test::More to check if it matches.

In your question, you've included a link to GitHub, which might go down sometime in the future (plus it's a couple more clicks), and also, in your <code> tags you've included the command-line invocations that I'd have to trim, and that aren't really necessary here to show others how to run the script or what's going on. Also, since the question is about whitespace, the dd invocation you commented out is actually more useful than a plain print here.

I read somewhere that a good name for a default directory in perl development is 'lib'.

Yes, but mostly for .pm files - the lib directory is usually what would get added to @INC such that use and require (and in some cases do) can find the files. Typically, the content of such directories is not what would get modified by a user or get modified during the run of a script.

specified by $width and $length.

Personally, I find those variable names a little confusing: You've got an array of strings, and $length sounds like it refers to the strings' length, but it seems like that's what $width is for. I might suggest $width/$height, $length/$height, $length/$rows, or $cols/$rows.

What I would like is for these inputs to get trimmed to an array of the size specified by $width and $length . Spaces added/substituted on right if necessary to pad each vector to the same size.

You could use substr for the trimming and sprintf for the output with padding.

use warnings;
use strict;
use Data::Dump;

my $input = <<'(END INPUT)';
abcdef
abcdefg
abcde
 bcdefgh
 bcd    
(END INPUT)

my @lines = split /\n/, $input;
dd \@lines;
my $out = make_rectangular( \@lines, 4, 6 );
dd $out;

sub make_rectangular {
    my ( $lines, $maxrows, $maxlength ) = @_;
    my @out;
    my $rowcount=1;
    for my $line (@$lines) {
        my $trimmed = substr $line, 0, $maxlength;
        push @out, sprintf "%-*s", $maxlength, $trimmed;
        last if ++$rowcount>$maxrows;
    }
    return \@out;
}

__END__

["abcdef", "abcdefg", "abcde", " bcdefgh", " bcd\t"]
["abcdef", "abcdef", "abcde ", " bcdef"]
[download]

(Note I've assumed you want to not modify the input array here.) Of course TIMTOWTDI, I could've mashed the code into a single map statement, but I hope this is a little more clear.

BTW, I'm not sure what you want to do with the tab character in your input file? This code counts it as a single character.

Also, what character is it that renders an entire space dark?

Do you mean the block drawing character U+2588, "█"? That would be "\N{U+2588}" or "\N{FULL BLOCK}". I might suggest U+2420, "␠" ("\N{U+2420}" or "\N{SYMBOL FOR SPACE}", see also). Note that for the \N{CHARNAME} variant, you may have to add use charnames ':full'; to your script, depending on your Perl version (newer versions load it automatically).

[reply]
[d/l]
[select]

Re^2: rectangularizing input to become array

by Aldebaran (Curate) on Feb 28, 2019 at 22:25 UTC

BTW, I'm not sure what you want to do with the tab character in your input file? This code counts it as a single character.

I'm making every effort to quote haukex fairly, but I will re-order for thematic and write-up reasons. I threw the tab character in to be a possible problem. I think I deal with it with:

$input=~ s/\t/ /g;

I'm also trying to make the write-up as austere as it can be in terms of using vertical space, so I will continue in readmore tags. I think I get more eyes if people don't have to scroll down to continue finding good content, and the thread might read more about the solutions as opposed to the problem. I haven't even gotten to the third one yet.

Read more... (5 kB)

using e.g. Test::More to check if it matches

What I seek to do is pass the first test...then others....

Vielen Dank und Schoenen Gruss aus Amiland,

[reply]
[d/l]
[select]

Re^3: rectangularizing input to become array

by haukex (Archbishop) on Feb 28, 2019 at 23:08 UTC

So there's the base directory of the script. I wouldn't want output there. ... would you rather put such a thing on our one and only subdirectory or split input and output into deparate directories?

Here's an idea for how to handle a script with a library.pm file or two that goes with it:

Say /home/user is the base directory.
I put my script in e.g. /home/user/myscript.

Libraries (.pms) could go in the same directory, or in /home/user/myscript/lib, that doesn't really make a difference for small scripts - if you've got a lot of .pm files then a lib dir is a good idea.
Ideally, /home/user/myscript is also a git working copy - in which case input and output data doesn't really belong in that directory anyway, as otherwise it'd have to be added to .gitignore.

The script can be made to not worry about which directory it is located in using code like this:
```
use FindBin;
use lib $FindBin::Bin;
[download]
```
Or, if there's a lib subdirectory, using the following (platform-independent) code:
```
use FindBin;
use File::Spec::Functions qw/catdir/;
use lib catdir($FindBin::Bin, 'lib');
[download]
```
You can put your input data in e.g. /home/user/mydata, cd to that directory, and run your script with e.g. perl ../myscript/script.pl input.txt, and it should generate its output in the current directory.
If it's a script you use a lot, and you don't want to type out its path all the time, you could add it to your PATH. For example, on a couple of my boxes, I have lines like this in my ~/.profile: test -d "$HOME/myscript" && PATH="$HOME/myscript:$PATH" (the script needs to be chmod u+x for this to work).

Should I go update that on the original post?

I think in this case you don't need to, it's just for future reference, thanks.

in your code tags you've included the command-line invocations that I'd have to trim
I tend to think that it provides context ... Might pre tags work here?

Yes you're right - I didn't mean to make it sound like it's not a good idea, context can certainly be useful in some cases - the main point was not to put it in the same <code> tag as the code, to make downloading easier. <pre> tags have the issue that HTML and PerlMonks special characters have to be escaped (as you can see your <pre> tag has been rendered with links in it), so two separate sets of <code> tags work. Or, here's how I might have written that post (note you can use <code> tags in paragraphs as well):

Here is the script 3.rm.pl, which I run via ./3.rm.pl:
#!/usr/bin/perl -w
use 5.011;
...
[download]
And here is the output:
["abcdef", "abcdefg", "abcde", " bcdefgh", " bcd "]
["abcdef", "abcdef", "abcde ", " bcdef"]
...
[download]

Also, command lines like cat or perl script.pl are simple enough that we usually don't need to see them, it only becomes important when there are additional arguments involved. (And for some questions, it can be relevant whether a script was invoked as ./script.pl or perl script.pl, but that's not too often.)

What I seek to do is pass the first test...then others....

Sometimes it can be very useful to write the tests first, as it forces one to think about the API and what the output should ideally look like.

Can't use string ("abcdef") as an ARRAY ref while "strict refs" in use at ./3.rm.pl line 59.

getsubset expects an array of arrays, but $out is just an array of strings. Assuming you want each character to be a "column", you could do $out = [ map { [split //] } @$out ]; after $out = make_rectangular(..., or you integrate it directly in the push in your make_rectangular like so: push @out, [ split //, sprintf "%-*s", $maxlength, $trimmed ]; - either of those changes make your test pass. (Note you should call done_testing; after your tests.)

[reply]
[d/l]
[select]

Re^4: rectangularizing input to become array

by Aldebaran (Curate) on Mar 02, 2019 at 23:12 UTC

Re^5: rectangularizing input to become array

by haukex (Archbishop) on Mar 03, 2019 at 11:24 UTC

Re^4: rectangularizing input to become array

by Aldebaran (Curate) on Mar 01, 2019 at 00:27 UTC

Re: rectangularizing input to become array
by Athanasius (Archbishop) on Feb 27, 2019 at 07:31 UTC

Hello Aldebaran,

Just a couple of points in haukex’s excellent answer that I would like to emphasise:

The most important data missing from your SSCCE is your desired output. “A picture is worth a thousand words.”
This line:
```
my @lines = $path_to_file->slurp;
[download]
```
almost certainly doesn’t do what you think it does. The documentation for Path::Tiny::slurp says that it “Reads file contents into a scalar.” So after that line of code is executed, the array @lines contains a single entry, identical to the string previously assigned to $guts. To get an array of lines, you need to split the string, either on newlines as haukex showed:
```
my $guts  = $path_to_file->slurp;
my @lines = split /\n/, $guts;
[download]
```
or using the special multiline pattern documented in split:
```
my $guts  = $path_to_file->slurp;
my @lines = split /^/, $guts;
[download]
```
The latter preserves newlines in the input data, including blank lines at the end of the input file; the former does not.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: rectangularizing input to become array

by Aldebaran (Curate) on Feb 28, 2019 at 19:58 UTC

my @lines = $path_to_file->slurp;

almost certainly doesn�t do what you think it does.

It did not. I've been whittling this SSCCE down from a river of mojibake and woe, my soaked and freezing body wondering where my skills to deal with such environments have been. Well, getting the logic from Path::Tiny wrong was one thing that had me beat. I was reading this as:

@lines = $file->lines;

, which, I believe would produce different results. It's very difficult to diagnose path and file input problems from the net, but you and haukex have done exactly that. Thank you.

What can the OP do about misapprehension? (Open question) I would like to introduce a little bit of code to test whether I have these data represented correctly. I frequently find that I'm off by a pair of square brackets or quotes and commas. I'll use readmore tags for output then new source for the caller.