I'm not saying it is a perl bug, but I am saying it is not a bug in my code, or an inconsistency in the data. I'm completely satisfied with this explanation (and thank you for egging me on to that point!) since I do not see anything going wrong beyond the fact that @Notes simply remains unchanged at the end of the sort. If you want to explain further about what exactly you mean about the use of Data::Dumper that I haven't done already (that's where the previous dump came from), I'm listening (the bit about getting "a string that is valid Perl code, so you can use Data::Dumper to get a canned version of the data that exhibits this behaviour" is not clear to me). For the curious, here's the package for the object:

package PNSearch; use strict; use warnings FATAL => qw(all); use CGIutil; our $Log = "PNSearch.log"; $SIG{__WARN__} = \&logwarn; sub logwarn { CGIutil->logger($Log,shift) } sub new { # represents a note my $self = {}; (my $fname, $self->{terms}, my $ln, my $dbh) = (pop,pop,pop,pop); my $cur = 0; while (<$dbh>) { next unless (++$cur == $ln); $_ =~ s/^([^>]+?)<\|>(.*?)\((.*?)\)\s*<\|>//; if (!defined $1) { CGIutil->logger($Log, "No href defined: $fname\n$_\n\n"); return undef; } my $href = $1; if (!defined $2) { $self->{date} = "&mdash;"; $self->{title} = "[no title]"; } else { $self->{date} = $2; if (!defined $3) { $self->{title} = "[no title]" } else { $self->{title} = $3 }; } $self->{href} = "<a class=\"ntitle\" href=\"/archives/$fname.h +tml#$href\">"; $self->{body} = $_; last; } bless($self); } sub hilight { (my $self, my $term) = (shift,shift); my @left = split /</,$self->{body}; foreach (@left) { # @right halves each elem of @left my @right = split />/,$_; next if ($#right < 1); # no half = <tag><tag> $right[1] =~ s/($term)/<em class="hlite">$1<\/em>/g; $_ = join(">",@right); } $self->{body} = join("<",@left); $self->{title} =~ s/($term)/<em class="hlite">$1<\/em>/g; } 1;

Here's the actual construction:

my @Notes; # array of PNSearch objects foreach my $file (@Files) { next unless (-f "$DBDir/$file" && !-z "$DBDir/$file"); # scan text only database # each file represents one .html page, each line represents one wh +ole note unless (open(DB, "<$DBDir/$file")) { CGIutil->logger($Log,"!!Could not open $DBDir/$file: $!"); next; } my @lines = (); # array of arrays, 0 = line number 1 = terms fo +und: qv. checkline() below my $ln = 1; while (<DB>) { my @found = ($ln,checkline($_)); push @lines, \@found if $found[1]; $ln++; } close(DB); # pull selected notes from markup database my $cur = 0; # last line in db my $MUH; unless (open($MUH, "<$DBDir/markup/$file")) { CGIutil->logger($Log,"!!Can't open /markup/$file: $!"); next; } foreach my $l (@lines) { my $pns = PNSearch->new($MUH, $l->[0]-$cur, $l->[1], $file); push @Notes, $pns if ($pns); $cur = $l->[0]; } close($MUH); } sub checkline { my $line = pop; my $c = 0; foreach (@Terms) { $c += 1 if ($line =~ /$Pfix<\|>.*?$_/); # anchor name is befor +e first <|> (don't search that) } # nb: return value is the number of terms found, not the number of + individual hits # ie, if there is only one search term, this will be 0 or 1 return $c; }

The databases are not relational, they are flat files. Example of the plaintext source:

30 April 2008 (Possession of <|>30 April 2008 (Possession of "extreme +pornography") <|>SNIP 29 April 2008 (Labor Department and whistleblower law)<|>29 April 2008 + (Labor Department and whistleblower law) <|>SNIP 29 April 2008 (Dalit woman refused treatment and dies)<|>29 April 2008 + (Dalit woman refused treatment and dies) <|>SNIP 29 April 2008 (Veterans and suicide)<|>29 April 2008 (Veterans and sui +cide) <|>SNIP 28 April 2008 (Cluster bombs in Iraq)<|>28 April 2008 (Cluster bombs i +n Iraq)<|>SNIP
The only difference between that one and the "markup" one is the SNIPPED part contains html.

Nb, that all the db files have already been verified line by line to ensure they are structured correctly. And as I've said, the final output demonstrates no mistakes in the data set. I can't force anyone to believe that, of course. Notice I'm using fatal warning and logging inconsistencies (there are none reported at this point, and the db has about 15000 notes in it).

ps. for the astute: yes, those hrefs are not uri_encoded, however, that was not my decision, I'm working to spec.


In reply to Re^7: sort != sort by halfcountplus
in thread sort != sort by halfcountplus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.