in reply to Re: Perl Array Question, combining HTML::HeadParser and regex
in thread Perl Array Question, combining HTML::HeadParser and regex

Try
#!perl use strict; use warnings; use File::Find; use HTTP::Headers; use HTML::HeadParser; use Text::CSV; # config my $dfile = 'all_tags.csv'; my $dir = 'Test'; my @TAGS = ('Content-Base', 'Title', 'X-Meta-author', 'X-Meta-description', 'X-Meta-keywords', 'X-Meta-name',); # match words my @WORDS = qw( press founder professor Dr. Ph.D M.D called receives joins timing find two self bottom true amazing forget night next day ); my $words = join '|',map { quotemeta } @WORDS ; my $regex = qr/.{0,25} (?:$words) .{0,25}/; # output my $csv = Text::CSV->new({eol => $/}); open my $fh1, ">:encoding(utf8)", $dfile or die "Error opening $dfile: $!"; $csv->print($fh1,['Search Words',@WORDS]); # header $csv->print($fh1,['Filename',@TAGS,'Search Results']); # header # input find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir); close $fh1 or die "Error closing $dfile: $!"; exit; sub HTML_Files { parse_HTML_Header($File::Find::name) if /\.html?$/; } sub parse_HTML_Header { my $ifile = shift; print "parsing $ifile\n"; open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n"; my $text = do{ local $/; <$fh0> }; close $fh0; my @matches = ($text =~ /($regex)/gisx); #print join "\n",@matches; my $h = HTTP::Headers->new; my $p = HTML::HeadParser->new($h); $p->parse($text); my @cols = map{ $h->header($_) || '' }@TAGS; $csv->print($fh1, [$ifile,@cols,@matches]); }
poj

Replies are listed 'Best First'.
Re^3: Perl Array Question, combining HTML::HeadParser and regex
by Anonymous Monk on Feb 01, 2016 at 15:50 UTC

    OK, thanks again for the modified script. Way better than what I have. I did some research and figured out that the lotame info returned was actually in the html files. Here's the section that was extracted from:

    </script> <script> /** * Trigger the backfill event with lotame params * @param lotameInfo * @returns no return */ var logBackfillEvent = function (lotameInfo) { setTimeout(function () { if (typeof _Anemone === "object") { payload = {}; payload.pl_ltmedata = {}; payload.presentation = {"count": 0}; payload.provider = {}; payload.pl_bfj = JSON.stringify(payload); payload.pl_ltmedata = lotameInfo; payload.pl_ltmedata = JSON.stringify(lotameInfo); payload['anxi'] = JSUtil.defaultVal(_AnemoneParams2.eventId, ''); _Anemone.logEvent('BackFill', payload); } }, 0); } /** * Read the lotame cookie * If lotame cookie not exist then make a request for the Lotame Audien +ce Extraction async call * Drop the lotame cookie * Trigger the backfill with lotame params * @param e * @param t * @param id */

    I'm not sure why the above values were returned as I don't see any keywords in there. Maybe something is close enough that it's pulling it. I'll have to take a closer look. Maybe I need to do an exact match. I also am still having trouble on search for two words at a time. I'll keep trying. Any insight you have would be greatly appreciated.

Re^3: Perl Array Question, combining HTML::HeadParser and regex
by Anonymous Monk on Feb 01, 2016 at 14:36 UTC

    Thanks very much poj for the assistance. The script now runs without errors on my side and the formatting is perfect, 1 row per file with comma separated values. But I don't seem to be getting the key words from the regex search that I'm looking for and instead get some code like items that start with an "*" that I can't figure out: Test/Boulder_Personalized medicine.html.result.txt_parsed_for_news.txt.html Boulder%2520Personalized%2520medicine , News Search | Ask.com * Trigger the backfill event with lotam * Read the lotame cookie t then make a request for the Lotame Audience Extractio * Drop the lotame cookie * Trigger the backfill with lotame para //Setting the cookie.raw to true to avo id the encoding I tried a google search on "* Trigger the backfill event with lotam" and the other phrases and I'm stumped. Also how can I use multiple words here:

    # match words my @WORDS = qw( press founder professor Dr. Ph.D M.D called receives joins timing find two self bottom true amazing forget night next day );

    For example what if I want to search for the phase 'founding member'? I tried using single quotes with no luck. Thanks very much for getting me over the hump and I look forward to your reply and a little more guidance.

      Add phrases like this

      push @WORDS,'founding member','another phrase';

      I not sure what your regex is trying to do. Are you expecting to capture up to 25 characters either side of the match with this ?

      /.{0,25} (?:$words) .{0,25}/

      Have you considered using an HTML parser ?

      poj