in reply to Search function for guestbook history?

BrowserUK suggested an elegant solution but you've hit two snags.

Parsing HTML is tricky and feature creep!

The following snippet uses an HTML parser to do the heavy lifting.

#!/bin/perl5 use strict; use warnings; use HTML::TokeParser::Simple; my $html; { local $/; $html = <DATA>; } my $p = HTML::TokeParser::Simple->new(\$html); my ($start, @data); while (my $t = $p->get_token) { $start++, next if $t->is_comment and $t->as_is eq '<!--begin-->'; last if $t->is_comment and $t->as_is eq '<!--end -->'; next unless $start; if ($t->is_start_tag('b')){ my $comment = $p->get_trimmed_text('/b'); my $sig = $p->get_trimmed_text('hr'); $sig =~ s/\s+-.*//; # crudely strip the timestamp push @data, join '|', $sig, $comment; } } print "$_\n" for @data; __DATA__ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML><HEAD><META http-equiv="Content-Type" CONTENT="text/html; charse +t=iso-8859-1"> <TITLE>Last Post </TITLE></HEAD> <BODY> <!--begin--> <HR> <b>Comment 1</b><br> Doug &lt;<a href="mailto:hun@tele.com">hun@tele.com</a>&gt;<br> USA - Thu 01/05/2006 - 22:05:51 <hr> <b>Comment 2</b><br> J H<br> Clearwater, FL USA - Wed 01/04/2006 - 02:05:12 <hr> <!--end --> </BODY></HTML>
output:
---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl Doug <hun@tele.com> USA|Comment 1 J H Clearwater, FL USA|Comment 2 > Terminated with exit code 0.

Hope this helps.

Replies are listed 'Best First'.
Re^2: Search function for guestbook history?
by JCHallgren (Sexton) on Jan 10, 2006 at 15:06 UTC
    Your replies are ABSOLUTELY amazing! Code that is commented and clear enough for newbie/novice me to follow AND even tested a bit! What more could I ask for? NOT much! :)
    I have no idea of how long y'all worked on those replies, but let me say again: It's a slightly delayed Christmas gift! Shouting THANKS is not quite enough, but will have to do!

    We are now up to DAYS (maybe even a WEEK-10days) of work (and frustration) that you have saved me! The first year anniversary of my Chat-m-room is on Jan 27th, and I may be able to have a search function ready by then! :) :)

    Yes, it is true that users cannot use HTML (to block spammers) and thus the format of code is tightly controlled. The layout/sample I gave above shows all the possible embedded code. And I'm able to edit history (and chg new output for current month) to reformat slightly if needed.

    I presumed that by giving full details of my data, it made it easier to give a reply.

    The only thing I see that I really need to add to pgrm output is my common page header/footer code to wrap around the "dump" of comments, as that is how they show on history pages.
    So the output from the first but longer pgrm is a bit more appropriate for my situation, since it has the raw data output.
      Addendum FYI: The history from the previous owner of the chatroom is 48 months, with the average size of each months HTML file being about 80kb, with largest about 220kb. So at present, I have 60 files to scan. And my users are patient, so a bit of delay to accomplish the search is ok. It certainly beats the current situtation, which is browsing EACH months history one-by-one and doing search via "Find"!

        From my quick test, wsfp's (very excellent) code using HTML::TokeParser::Simple takes around 7 seconds to run the same test as my rather crude regex solution takes 2/10ths.

        As you are processing your own, controlled data, you can choose either with a fair degree of safety. If you were processing html from another source where you didn't control the layout, the parser route would be preferable.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.