in reply to Search function for guestbook history?

As requested, this is just a starting place based upon the information you've provided and testing is limited to exactly that shown in the output. You'll probably want to tweak the regex to better match your data.

In terms of performance, I tested it against 10 years worth of monthly files with 100 comments per file, which I think far exceeds your requirements, and it takes all of 6/100ths of a second. That's probably how long it would take to make a connection to a database.

#! perl -slw use strict; use Time::HiRes qw[ time ]; sub slurp { local( $/, @ARGV ) = ( -s( $_[0] ), $_[0] ); <>; } my $start = time; my $user = $ARGV[0] or die 'No user supplied'; my @matches; my $files = 0; while( glob 'hist/*.htm' ) { $files++; my $file = slurp $_; push @matches, $_ if $file =~ m[ <b>Comment\s+\d+</b><br>\n \Q$user\E ]mx; } printf "Searched $files files in %g seconds\n", time() - $start; die "No match found for user $user" unless @matches; print "User $user found in files:\n", join "\n", @matches; __END__ P:\test>522029 Buk Searched 133 files in 0.0694289 seconds No match found for user Buk at P:\test\522029.pl line 26, <> line 133. P:\test>522029 Doug Searched 133 files in 0.069006 seconds User Doug found in files: hist/h0511.htm hist/h0601.htm P:\test>522029 "J H" Searched 133 files in 0.0696189 seconds User J H found in files: hist/h0511.htm hist/h0601.htm

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Search function for guestbook history?
by Anonymous Monk on Jan 10, 2006 at 07:15 UTC
    Now THAT is a reply that I can sort-of follow...and understand by doing some additional reading...you've added some functionality that was to be part 2, such as time taken.

    However, could you possibly provide a clue on how I'd:
    1) Restrict the search for just the text/code within the range between the "--begin--" and "--end--" comment markers? Because there could be some HTML that might follow the pattern used for a signature in the headers/footers of the web page. I think I know, but maybe there's a better way.

    2) In the sample I gave, I used the "Comment #" as pplaceholders of what could be there...in fact, at that point, there could be anything from one word to 25+- lines of user feedback/reply, so the word "Comment" can't be searched for. The start of a user comment block does always begin with a "HR" tag, followed by a "B" tag, but there could be one or more spaces/CR/LF's between them. So what might be appropriate for that test? I could rework history to put starting HR tag of each comment on line by itself, if that would make it easier.

    3) Once a matching USER is found, what I need to do is dump that guestbook entry (delimited by "HR" tags) to a results web page that I'll build on-the-fly, so that's why I believe I'll have to scan the hYYMM.htm files on a line-by-line basis, at least with the skills that I posess now. I presently scan the HTML guestbook file with current months entries looking for "--begin--" tag to find location of where to add newest comment, so I have some clue on how that would be done.

    BTW, the code you gave me already saved me MANY hours of research time, so THANKS a LOT!

      The following should come close to what you want. It runs a bit more slowly now, a couple of 10ths of a second, but given the small size of your user base, it should not be a problem.

      1. Having slurped the file into a scalar, you can quite easily extract the bit between your begin/end comments (provided your users will never be allowed to embed html in their posts?), using a greedy match and capture brackets:
        ## And extract the comments next unless $contents =~ m[ <!--begin--> (.+) <!--end\s--> ]six;

        Note the need to explicitly include the \s in the end delimiter, because I'm using /x to make things a little easier to read. Also /s to allow . to match newlines and /i for case insensitive (though that may be unnecessary for this part).

      2. Okay. I misunderstood your original examples. See the comments below for some explanation of what is going on and ask questions, for anything you don't understand.

        The best thing you can do is play with this script in conjunction with your data and test out the effect of adjusting the regex. Comment bits out; add bits; change the options used to see the effect it has upon the results you get.

        If you have questions, try and supply a 10 or so line script, (not just post back my code!), that demonstrates the problem you are having.

      3. You'll see that having extracted the block of comments, I then separate them into individual comment blocks in the inner while loop. Once you have that, you can then inspect each one for the presence of the user name, and if found, push that onto an array along with the filename. (You could add html markup as appropriate here!).

        Having built the array of matches, you are then in a position to combine them with the rest of the html or return a "Nothing found" page.

      #! perl -slw use strict; use Time::HiRes qw[ time ]; sub slurp { local( $/, @ARGV ) = ( -s( $_[0] ), $_[0] ); <>; } my $start = time; my $user = $ARGV[0] or die 'No user supplied'; my @matches; my $files = 0; ## For each matching file in the Hist directory while( my $file = glob 'hist/*.htm' ) { $files++; ## Slurp the contents my $contents = slurp $file; ## And extract the comments next unless $contents =~ m[ <!--begin--> (.+) <!--end\s--> ]six; my $comments = $1; ## Break out each individual comment while( $comments =~ m[ ( ## Capture \n \s* <hr> ## from the <hr> .+? ## Everything (non-greedy) ) (?= \n \s* <hr> ) ## Up to the next <hr> ]gsix ) { my $comment = $1; ## And save it if it contains the specified user name push @matches, "$file\n$comment" if $comment =~ m[ \n \s* ## On a line, possible leading whitespace \Q$user\E ## The user name [^\n]* <br> ## maybe other (non-newline) stuff <br> \s* \n ## maybe whitespace and newline ]mxi; } } printf "Searched $files files in %g seconds\n", time() - $start; die "No match found for user $user" unless @matches; print "User $user found in files:\n-------\n", join "\n---------\n", @ +matches; __END__ P:\test>522029 Buk Searched 133 files in 0.144615 seconds No match found for user Buk at P:\test\522029.pl line 49, <> line 133. P:\test>522029 Doug Searched 133 files in 0.147709 seconds User Doug found in files: ------- hist/h0511.htm <HR> <b>Comment 1</b><br> Doug &lt;<a href="mailto:hun@tele.com">hun@tele.com</a>&gt;<br> USA - Thu 11/29/2005 - 22:05:51 --------- hist/h0601.htm <HR> <b>Comment 1</b><br> Doug &lt;<a href="mailto:hun@tele.com">hun@tele.com</a>&gt;<br> USA - Thu 01/05/2006 - 22:05:51 P:\test>522029 "J H" Searched 133 files in 0.149665 seconds User J H found in files: ------- hist/h0511.htm <hr> <b>Comment 2</b><br> J H<br> Clearwater, FL USA - Wed 01/04/2006 - 02:05:12 --------- hist/h0601.htm <hr> <b>Comment 2</b><br> J H<br> Clearwater, FL USA - Wed 01/04/2006 - 02:05:12

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Ok, BrowserUK...I've looked...done some reading... and the part I still don't follow is the simple "slurp" subroutine. Could you/others translate it to English a bit more, please? The rest of the code I believe I can understand, at least for now.
Re^2: Search function for guestbook history?
by JCHallgren (Sexton) on Jan 10, 2006 at 22:21 UTC
    Ok, BrowserUK...I've looked...done some reading... and the part I still don't follow is the simple "slurp" subroutine. Could you/others translate it to English a bit more, please? The rest of the code I believe I can understand, at least for now. BTW, I still a bit puzzled by how this posting thread thing works, so not sure if this shows twice or once, cause I didn't see it.