JCHallgren has asked for the wisdom of the Perl Monks concerning the following question:

I'm a Perl newbie/novice who managed, with GREATLY appreciated help from this forum about a year ago (THANKS!), managed to rework/code a script for a site guestbook into exactly what I wanted. That was: a simple chat room for my town.
However, now I'd like to add a user search function to my site. The web pages I'd want to scan are all in one folder/sub-dir (/Hist) and are named in style of "h0512.html" where it's YYMM after "h". There is a special comment that delimits the start and end of actual user data within the page, and each comment/post is delimited by "hr" or "HR" tags. There can be multiple lines in each post, using "BR" tag but the comment is always separated from signature by a "/B" tag.
To better show what I have as data, here is a condensed page with some sample data:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML><HEAD><META http-equiv="Content-Type" CONTENT="text/html; charse +t=iso-8859-1"> <TITLE>Last Post </TITLE></HEAD> <BODY> <!--begin--> <HR> <b>Comment 1</b><br> Doug &lt;<a href="mailto:hun@tele.com">hun@tele.com</a>&gt;<br> USA - Thu 01/05/2006 - 22:05:51 <hr> <b>Comment 2</b><br> J H<br> Clearwater, FL USA - Wed 01/04/2006 - 02:05:12 <hr> <!--end --> </BODY></HTML>
What I'd like to do is first provide a way to find all posts by a certain user.
Creating the input form is not a problem for me, but how to do the search is. I know that I'll have to have a loop to read each month's history file and then a loop to scan thru entries, so how to do that is my first task. Then locating the signature line while keeping the post data saved so I can output it to a on-the-fly results web page is another issue.
So...are there any pre-existing code blocks or routines that I can adapt to this? I'd just like some good pointers on where to look...not expecting anyone to do the code!

Replies are listed 'Best First'.
Re: Search function for guestbook history?
by BrowserUk (Patriarch) on Jan 09, 2006 at 23:29 UTC

    As requested, this is just a starting place based upon the information you've provided and testing is limited to exactly that shown in the output. You'll probably want to tweak the regex to better match your data.

    In terms of performance, I tested it against 10 years worth of monthly files with 100 comments per file, which I think far exceeds your requirements, and it takes all of 6/100ths of a second. That's probably how long it would take to make a connection to a database.

    #! perl -slw use strict; use Time::HiRes qw[ time ]; sub slurp { local( $/, @ARGV ) = ( -s( $_[0] ), $_[0] ); <>; } my $start = time; my $user = $ARGV[0] or die 'No user supplied'; my @matches; my $files = 0; while( glob 'hist/*.htm' ) { $files++; my $file = slurp $_; push @matches, $_ if $file =~ m[ <b>Comment\s+\d+</b><br>\n \Q$user\E ]mx; } printf "Searched $files files in %g seconds\n", time() - $start; die "No match found for user $user" unless @matches; print "User $user found in files:\n", join "\n", @matches; __END__ P:\test>522029 Buk Searched 133 files in 0.0694289 seconds No match found for user Buk at P:\test\522029.pl line 26, <> line 133. P:\test>522029 Doug Searched 133 files in 0.069006 seconds User Doug found in files: hist/h0511.htm hist/h0601.htm P:\test>522029 "J H" Searched 133 files in 0.0696189 seconds User J H found in files: hist/h0511.htm hist/h0601.htm

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Now THAT is a reply that I can sort-of follow...and understand by doing some additional reading...you've added some functionality that was to be part 2, such as time taken.

      However, could you possibly provide a clue on how I'd:
      1) Restrict the search for just the text/code within the range between the "--begin--" and "--end--" comment markers? Because there could be some HTML that might follow the pattern used for a signature in the headers/footers of the web page. I think I know, but maybe there's a better way.

      2) In the sample I gave, I used the "Comment #" as pplaceholders of what could be there...in fact, at that point, there could be anything from one word to 25+- lines of user feedback/reply, so the word "Comment" can't be searched for. The start of a user comment block does always begin with a "HR" tag, followed by a "B" tag, but there could be one or more spaces/CR/LF's between them. So what might be appropriate for that test? I could rework history to put starting HR tag of each comment on line by itself, if that would make it easier.

      3) Once a matching USER is found, what I need to do is dump that guestbook entry (delimited by "HR" tags) to a results web page that I'll build on-the-fly, so that's why I believe I'll have to scan the hYYMM.htm files on a line-by-line basis, at least with the skills that I posess now. I presently scan the HTML guestbook file with current months entries looking for "--begin--" tag to find location of where to add newest comment, so I have some clue on how that would be done.

      BTW, the code you gave me already saved me MANY hours of research time, so THANKS a LOT!

        The following should come close to what you want. It runs a bit more slowly now, a couple of 10ths of a second, but given the small size of your user base, it should not be a problem.

        1. Having slurped the file into a scalar, you can quite easily extract the bit between your begin/end comments (provided your users will never be allowed to embed html in their posts?), using a greedy match and capture brackets:
          ## And extract the comments next unless $contents =~ m[ <!--begin--> (.+) <!--end\s--> ]six;

          Note the need to explicitly include the \s in the end delimiter, because I'm using /x to make things a little easier to read. Also /s to allow . to match newlines and /i for case insensitive (though that may be unnecessary for this part).

        2. Okay. I misunderstood your original examples. See the comments below for some explanation of what is going on and ask questions, for anything you don't understand.

          The best thing you can do is play with this script in conjunction with your data and test out the effect of adjusting the regex. Comment bits out; add bits; change the options used to see the effect it has upon the results you get.

          If you have questions, try and supply a 10 or so line script, (not just post back my code!), that demonstrates the problem you are having.

        3. You'll see that having extracted the block of comments, I then separate them into individual comment blocks in the inner while loop. Once you have that, you can then inspect each one for the presence of the user name, and if found, push that onto an array along with the filename. (You could add html markup as appropriate here!).

          Having built the array of matches, you are then in a position to combine them with the rest of the html or return a "Nothing found" page.

        #! perl -slw use strict; use Time::HiRes qw[ time ]; sub slurp { local( $/, @ARGV ) = ( -s( $_[0] ), $_[0] ); <>; } my $start = time; my $user = $ARGV[0] or die 'No user supplied'; my @matches; my $files = 0; ## For each matching file in the Hist directory while( my $file = glob 'hist/*.htm' ) { $files++; ## Slurp the contents my $contents = slurp $file; ## And extract the comments next unless $contents =~ m[ <!--begin--> (.+) <!--end\s--> ]six; my $comments = $1; ## Break out each individual comment while( $comments =~ m[ ( ## Capture \n \s* <hr> ## from the <hr> .+? ## Everything (non-greedy) ) (?= \n \s* <hr> ) ## Up to the next <hr> ]gsix ) { my $comment = $1; ## And save it if it contains the specified user name push @matches, "$file\n$comment" if $comment =~ m[ \n \s* ## On a line, possible leading whitespace \Q$user\E ## The user name [^\n]* <br> ## maybe other (non-newline) stuff <br> \s* \n ## maybe whitespace and newline ]mxi; } } printf "Searched $files files in %g seconds\n", time() - $start; die "No match found for user $user" unless @matches; print "User $user found in files:\n-------\n", join "\n---------\n", @ +matches; __END__ P:\test>522029 Buk Searched 133 files in 0.144615 seconds No match found for user Buk at P:\test\522029.pl line 49, <> line 133. P:\test>522029 Doug Searched 133 files in 0.147709 seconds User Doug found in files: ------- hist/h0511.htm <HR> <b>Comment 1</b><br> Doug &lt;<a href="mailto:hun@tele.com">hun@tele.com</a>&gt;<br> USA - Thu 11/29/2005 - 22:05:51 --------- hist/h0601.htm <HR> <b>Comment 1</b><br> Doug &lt;<a href="mailto:hun@tele.com">hun@tele.com</a>&gt;<br> USA - Thu 01/05/2006 - 22:05:51 P:\test>522029 "J H" Searched 133 files in 0.149665 seconds User J H found in files: ------- hist/h0511.htm <hr> <b>Comment 2</b><br> J H<br> Clearwater, FL USA - Wed 01/04/2006 - 02:05:12 --------- hist/h0601.htm <hr> <b>Comment 2</b><br> J H<br> Clearwater, FL USA - Wed 01/04/2006 - 02:05:12

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Ok, BrowserUK...I've looked...done some reading... and the part I still don't follow is the simple "slurp" subroutine. Could you/others translate it to English a bit more, please? The rest of the code I believe I can understand, at least for now. BTW, I still a bit puzzled by how this posting thread thing works, so not sure if this shows twice or once, cause I didn't see it.
Re: Search function for guestbook history?
by radiantmatrix (Parson) on Jan 09, 2006 at 21:24 UTC

    My suggestion is to parse this all once (perhaps with something like HTML::TokeParser) and place your records in a database. If you don't want to (or, for some reason, can't) run a full-on RDMBS, check out DBD::SQLite2 for a file-based RDBMS-in-a-Perl-module solution. For what you're doing, it should perform well enough.

    Once you have these in a database, you can revise your guestbook code to (a)insert records instead of editing files, (b)generate HTML dynamically using something like HTML::Template, and (c)add your search function.

    Your user table might look something like this:

    ID UserName === ============ 1 Doug 2 JH

    And your comments table would have a PostingUserID column that references the ID above. Your search, then, would look up the ID number of the user you want and then scan your comments table for posts by that ID. This is pretty standard DB stuff, so I apologize if I'm oversimplifying -- I don't know how familiar with DBs you are.

    End result, you do a bit more work up-front, but you have code that's much more scalable both in terms of adding new features and handling more users. For example, migrating to a more robust DB like MySQL in the future (in order to deal with tens of thousands of users, perhaps) becomes relatively simple.

    <-radiant.matrix->
    A collection of thoughts and links from the minds of geeks
    The Code that can be seen is not the true Code
    "In any sufficiently large group of people, most are idiots" - Kaa's Law
Re: Search function for guestbook history?
by wfsp (Abbot) on Jan 10, 2006 at 10:28 UTC
    BrowserUK suggested an elegant solution but you've hit two snags.

    Parsing HTML is tricky and feature creep!

    The following snippet uses an HTML parser to do the heavy lifting.

    #!/bin/perl5 use strict; use warnings; use HTML::TokeParser::Simple; my $html; { local $/; $html = <DATA>; } my $p = HTML::TokeParser::Simple->new(\$html); my ($start, @data); while (my $t = $p->get_token) { $start++, next if $t->is_comment and $t->as_is eq '<!--begin-->'; last if $t->is_comment and $t->as_is eq '<!--end -->'; next unless $start; if ($t->is_start_tag('b')){ my $comment = $p->get_trimmed_text('/b'); my $sig = $p->get_trimmed_text('hr'); $sig =~ s/\s+-.*//; # crudely strip the timestamp push @data, join '|', $sig, $comment; } } print "$_\n" for @data; __DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML><HEAD><META http-equiv="Content-Type" CONTENT="text/html; charse +t=iso-8859-1"> <TITLE>Last Post </TITLE></HEAD> <BODY> <!--begin--> <HR> <b>Comment 1</b><br> Doug &lt;<a href="mailto:hun@tele.com">hun@tele.com</a>&gt;<br> USA - Thu 01/05/2006 - 22:05:51 <hr> <b>Comment 2</b><br> J H<br> Clearwater, FL USA - Wed 01/04/2006 - 02:05:12 <hr> <!--end --> </BODY></HTML>
    output:
    ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl Doug <hun@tele.com> USA|Comment 1 J H Clearwater, FL USA|Comment 2 > Terminated with exit code 0.

    Hope this helps.

      Your replies are ABSOLUTELY amazing! Code that is commented and clear enough for newbie/novice me to follow AND even tested a bit! What more could I ask for? NOT much! :)
      I have no idea of how long y'all worked on those replies, but let me say again: It's a slightly delayed Christmas gift! Shouting THANKS is not quite enough, but will have to do!

      We are now up to DAYS (maybe even a WEEK-10days) of work (and frustration) that you have saved me! The first year anniversary of my Chat-m-room is on Jan 27th, and I may be able to have a search function ready by then! :) :)

      Yes, it is true that users cannot use HTML (to block spammers) and thus the format of code is tightly controlled. The layout/sample I gave above shows all the possible embedded code. And I'm able to edit history (and chg new output for current month) to reformat slightly if needed.

      I presumed that by giving full details of my data, it made it easier to give a reply.

      The only thing I see that I really need to add to pgrm output is my common page header/footer code to wrap around the "dump" of comments, as that is how they show on history pages.
      So the output from the first but longer pgrm is a bit more appropriate for my situation, since it has the raw data output.
        Addendum FYI: The history from the previous owner of the chatroom is 48 months, with the average size of each months HTML file being about 80kb, with largest about 220kb. So at present, I have 60 files to scan. And my users are patient, so a bit of delay to accomplish the search is ok. It certainly beats the current situtation, which is browsing EACH months history one-by-one and doing search via "Find"!
Re: Search function for guestbook history?
by JCHallgren (Sexton) on Jan 09, 2006 at 22:16 UTC
    While I do understand the concept of using a DB, given the task I have to do, and the time I have to do it (almost none), and the lack of knowledge I have, I think converting it to a DB is overkill, given that I have a small nbr of users (maybe 40?) and nbr of posts per day is small (avg of 1 to 5).