Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
So I've got this data:
Title Line
Author Line
Link Line
-blank-
... repeat ad nauseum
Only right now it's in an array, with the items separated by blanks:
[ 'Title', 'Author', 'Link', '', 'Title', ... ]
I really would like it to end up as an array of hashes of the form:
[
{ Title => 'Title Data',
Author => 'Author Data',
Link => 'Link Data'
},
etc.
]
Any ideas how I go through the array, putting things
where they should go?
--Jay
Re: Sort this data (alternate method)
by extremely (Priest) on Nov 19, 2000 at 11:42 UTC
|
Or, if you just read that data into that array from disk
you can likely do it all in one pass. The trick here is
that you said that there are blank lines. You can rig
perl up so that it looks for double blanks when reading
"lines" and then split your data on the single linefeed
that divides your data up:
my @LoH;
{
local $/="\n\n"; # may be "\r\n\r\n" under windows...
while (<>) {
my ($t, $a, $l) = split /\n/;
push @LoH, { Title => $t, Author => $a, List => $l };
}
}
I should probably have added this to the other post but
I had just edited it like 5 times. =P
--
$you = new YOU;
honk() if $you->love(perl) | [reply] [d/l] |
Re: Sort this data
by extremely (Priest) on Nov 19, 2000 at 11:31 UTC
|
Well, I'd probably use perlfunc:splice to rip 4 items
at a time off the list and then push hashes onto a new list,
tested now =):
@bigarray = ... ; #your data
my @LoH;
while (my ($t, $a, $l, $j)= splice (@bigarray,0,4)) {
push @LoH, { Title => $t, Author => $a, Link => $l };
}
--
$you = new YOU;
honk() if $you->love(perl) | [reply] [d/l] |
|
I'd guard against splice() continually from the
beginning of an array. I'd much rather call it from the end
of the array. "But Jeff," you'd say, "what if the order of
the hash references matters? You couldn't do:
while (my ($a,$b,$c) = splice(@data, -3)) {
push @hashrefs, { a => $a, b => $b, c => $c };
pop @data; # null field
}
because then @hashrefs would be in reverse!" Yes,
that's absolutely right. And it might be inefficient to
reverse() the array when we're done with it, and
it's also inefficient to keep unshift()ing to the
array. So what possible efficient solution could I come up
with to combine the splice() speed with the
insertion speed?
Pre-extend the array! (Dun, dun, DUN!)
my @hashrefs;
$#hashrefs = int(@data / 4);
my $i = $#hashrefs;
while (@data and my ($a,$b,$c) = splice(@data, -3)) {
$hashrefs[$i--] = { a => $a, b => $b, c => $c };
pop @data;
}
What a sneaky trick I've done.
Update
Oops, used -4 above, when I meant -3. Thanks, jcwren.
Update
splice() will wig out when the array is empty. The
while loop has been adjusted.
japhy --
Perl and Regex Hacker | [reply] [d/l] [select] |
|
I would have to check, but I thought that splice was only
slow if you have to move the array around. But I don't
know whether pulling from the beginning of the array has
been optimized. And (like jcwren) I am too lazy to
benchmark it at the moment. In any case shift is fast
and has the benefit of avoiding tracking indices. (Note
that you had a bug in your code which jcwren caught?
Yes, I am talking about that.)
while (@big_array) {
my $href;
@$href{'title', 'author', 'link'} = map shift(@big_array), 1..4;
push @structs, $href;
}
This might be faster than your sneaky trick. It might be
slower. It certainly has fewer indices.
Also the cost of reverse is overstated. You have just
walked through a list of n things in Perl. You then want
to reverse a list of n/4 things. What is the relative cost
of those two operations? Right.
Pick up good material on optimization. Such as this
sample
chapter from Code Complete. Or RE: Efficient Perl Programming. You will
find that experienced people understand that getting
maintainable code with good algorithms can result in better
overall speed wins than trying to optimize every line.
Now noticing the splice, that matters. If it isn't
optimized then that is an order(n) operation n times -
which is n^2 and therefore is likely to be slow. But one
reverse at the end is an order n operation once. Should
the body of the loop be slightly more efficient from doing
the slice rather than repeated manipulation of indices
(something I would have to benchmark to have a feeling for
either way) then your attempt to optimize would actually
lose.
To summarize, don't worry about slow operations, worry
about bad algorithms. A slow operation inside a loop may
matter. A slow operation outside a loop which speeds up
the loop can go either way. An order n (or worse)
operation inside a
loop - that is the only one which should cause you to want
to care up front about optimizing the structure of the
code!
EDIT
I had messed up the final paragraph. | [reply] [d/l] |
|
Nice solution! But a couple of questions:
It would be interesting to benchmark if using unshift() into the $hashrefs array would be more or less efficient than messing with $i. If nothing else, it would be two lines shorter, and more Perlish.
And perhaps this is too early in the morning, but why are you splicing by -4, and then popping @data?
It's Sunday morning where I am, and I'm too lazy to try to write a coherent benchmark...
--Chris
e-mail jcwren
| [reply] [d/l] [select] |
|
|
What if there is a blank line at the END of the list? You
might want to sniff for that and pop off blank lines first.
Also, what if fields are allowed to be null? If so, you
HAVE to read from the front...
--
$you = new YOU;
honk() if $you->love(perl)
| [reply] |
|
Re: Sort this data
by japhy (Canon) on Nov 19, 2000 at 23:04 UTC
|
UPDATE
There was a huge error in this test (and I'm stupid for not
using strict in it). I was testing @c
where I should have been testing @a. I am now going
to replace the bad results with the GOOD results.
Here's a benchmark. Six methods were tested, and run for
5 seconds, three times; once on a set of 100 elements, once
on 1000, and once on 10000.
- in_fr_pu
- slices the array in chunks of 4, push()es
the hash ref to the new array
140.9 Hz @ 100, 13.9 Hz @ 1000, 1.3 Hz @ 10000
- sh_fr_pu
- shift()s the array, push()es
the hash ref to the new array
156.7 Hz @ 100, 15.7 Hz @ 1000, 1.4 Hz @ 10000
- sp_bk_in
- presizes the new array, splice()s from the
back, inserts the hash ref via index into the new array
149.4 Hz @ 100, 14.8 Hz @ 1000, 1.4 Hz @ 10000
- sp_bk_rv
- splice()s from the back, push()es the
hash ref to the new array, the reverse()s
151.5 Hz @ 100, 14.7 Hz @ 1000, 1.3 Hz @ 10000
- sp_bk_un
- splice()s from the back, unshift()s
the hash ref to the new array
151.6 Hz @ 100, 11.9 Hz @ 1000, .4 Hz @ 10000
- sp_fr_pu
- splice()s from the front, push()es
the hash ref to the new array
160.0 Hz @ 100, 15.6 Hz @ 1000, 1.4 Hz @ 10000
Notice how all of them appear rather linear in form, EXCEPT
for sp_bk_un. This is because unshift()
is rather inefficient. sp_fr_pu is the fastest
because splice()ing from the front just moves the
beginning-of-array marker forward in the array. Thus, the
next fastest is sh_fr_pu, which just uses 4 calls
to shift() instead of splice().
use Benchmark;
for $SIZE (100, 1000, 10000) {
timethese(-5, {
sp_fr_pu => sub {
my @a = (1..$SIZE);
my @b;
while (my @c = splice(@a, 0, 4)) { push @b, { @c } }
},
sp_bk_un => sub {
my @a = (1..$SIZE);
my @b;
while (@a and my @c = splice(@a, -4)) { unshift @b, { @c } }
},
sp_bk_in => sub {
my @a = (1..$SIZE);
my @b;
$#b = int(@a / 4);
my $i = $#b;
while (@a and my @c = splice(@a, -4)) { $b[$i--] = { @c } }
},
in_fr_pu => sub {
my @a = (1..$SIZE);
my @b;
my $i = 0;
while ($i < @a) { push @b, { @a[$i .. $i + 3] }; $i += 4; }
},
sh_fr_pu => sub {
my @a = (1..$SIZE);
my @b;
while (@a) { push @b, { map shift(@a), 1..4 } }
},
sp_bk_rv => sub {
my @a = (1..$SIZE);
my @b;
while (@a and my @c = splice(@a, -4)) { push @b, { @c } }
@b = reverse @b;
},
});
}
japhy --
Perl and Regex Hacker | [reply] [d/l] |
|
UPDATE
These results were wrong. I'll get correct ones later.
Those benchmarks were under 5.005_02. Here are the 5.6
results. They're for different sizes of the list, since the
test hung horribly at 10,000 elements.
Name |
Hz @ 100 |
Hz @ 1000 |
Hz @ 2000 |
in_fr_pu |
466.1 |
45.2 |
22.5 |
sh_fr_pu |
509.7 |
49.5 |
24.6 |
sp_bk_in |
??? |
??? |
??? |
sp_bk_rv |
??? |
??? |
??? |
sp_bk_un |
??? |
??? |
??? |
sp_fr_pu |
489.3 |
47.2 |
23.4 |
japhy --
Perl and Regex Hacker
| [reply] |
|
Updated. I was stupid -- not using strict ended up
kicking me in the ass with my benchmark. It's now working
properly, and the results are far more... expected.
japhy --
Perl and Regex Hacker
| [reply] |
japhy looks at av.c (not av.h)
by japhy (Canon) on Nov 20, 2000 at 00:46 UTC
|
UPDATE
As per my revelations in sort this data, here is a bit of
an adjusted report on the av.c source.
Here's the relevant portion from av.c in the 5.6
source:
/* this is Perl_av_unshift()
it unshifts 'num' undef values to an array */
/* determine how much non-used spaced is left
that's been allocated for this array */
i = AvARRAY(av) - AvALLOC(av);
/* if there's room left... */
if (i) {
/* if there's more room than we need, just use 'num' */
if (i > num) i = num;
/* this will set 'num' to 0 if we had enough room */
/* 'num' is now how many new undef values we need added */
num -= i;
AvMAX(av) += i; /* set the highest subscript??? */
AvFILLp(av) += i; /* add to highest subscript */
SvPVX(av) = (char*)(AvARRAY(av) - i); /* where Perl's array starts
+*/
}
/* if there wasn't enough room already... */
if (num) {
i = AvFILLp(av); /* highest subscript */
av_extend(av, i + num); /* extend array to i+num elements */
AvFILLp(av) += num; /* add to highest subscript */
ary = AvARRAY(av); /* get at the array */
Move(ary, ary + num, i + 1, SV*); /* slide elements up */
do {
ary[--num] = &PL_sv_undef; /* set new elements to undef */
} while (num);
}
Ok, so unshift() has to do a lot of sliding the
elements up if there's not already some free space generated
by a call to shift().
What shift() does is simply slide the pointer up
one notch, which is very fast (which provides free space for
unshift() later). It stores the value in
*AvARRAY(av) into retval, and then sets
*AvARRAY(av) to the C equivalent of Perl's
undef value. Then it moves the pointer that marks
the beginning of the array to AvARRAY(av) + 1. It
also subtracts one from the max subscript and the length of
the array. Then it returns retval.
<revelation>I'm beginning to grok the source
code! Look out!</revelation>.
japhy --
Perl and Regex Hacker | [reply] [d/l] |
|
Nice!
For those who are not following the code, the logic here
is based on first trying to allocate elements which there
is room for (this is the "if (i)" bit). If it cannot get
them all in it then makes sure it has enough space, does
some accounting, copies everything over, then inserts
some new stuff. Note that japhy's comment about unused
space is misleading, he means unused at the beginning of
the array. There is also unused stuff at the other end,
but we cannot directly use that.
For full details you have to also look at av.h. The call
to AvMAX is just
a macro to set xav_max which is the largest element which
there is space for (adding to the front increases that),
and AvFILL is setting xav_fill which is how many you have
(adding obviously increases that as well).
The call to av_extend is where the array size increases.
What it does is extends the definition of what the array
is occupying. In fact it is in a buffer whose size is
allocated in powers of two. If the extend takes the array
beyond the size of the buffer, then a new buffer is
allocated and the array is moved. If it does not then the
array is left where it is.
Now it is clear how to make repeated calls to unshift
fairly efficient. Right now it moves stuff by the
minimum necessary. What we need is to have a new variable
for how much to move it. That variable should be the
maximum of num and the length of the array. This will cause
space wastage, but it will also mean that when you hit an
unshift and it has to move the array, it will not hit that
copying logic again right away.
Even so building up an array using n calls to push will
still be faster than unshift because there is less
accounting to do. But both will be order n rather than
having n calls to unshift being order n^2 as it is today.
| [reply] |
|
| [reply] |
Re: Sort this data
by princepawn (Parson) on Nov 20, 2000 at 04:48 UTC
|
Boulder was made for this type of problem.
| [reply] |
|
|