Re^2: Perl Array Question, combining HTML::HeadParser and regex

Try

#!perl
use strict;
use warnings;
use File::Find;
use HTTP::Headers;
use HTML::HeadParser;
use Text::CSV;

# config
my $dfile  = 'all_tags.csv';
my $dir    = 'Test';
my @TAGS = ('Content-Base', 'Title', 
            'X-Meta-author', 'X-Meta-description', 
            'X-Meta-keywords', 'X-Meta-name',);

# match words            
my @WORDS = qw( press founder professor Dr. Ph.D M.D
      called receives joins timing find two 
      self bottom true amazing forget night next day );
my $words =  join '|',map { quotemeta } @WORDS ;
my $regex = qr/.{0,25} (?:$words) .{0,25}/;
              
# output
my $csv = Text::CSV->new({eol => $/});
open my $fh1, ">:encoding(utf8)", $dfile 
    or die "Error opening $dfile: $!";
$csv->print($fh1,['Search Words',@WORDS]); # header
$csv->print($fh1,['Filename',@TAGS,'Search Results']); # header

# input              
find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir);
close $fh1 or die "Error closing $dfile: $!";
exit;

sub HTML_Files {
  parse_HTML_Header($File::Find::name) if /\.html?$/;
}

sub parse_HTML_Header {

  my $ifile = shift;
  print "parsing $ifile\n";
  
  open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n";
  my $text = do{ local $/; <$fh0> };
  close $fh0;

  my @matches = ($text =~ /($regex)/gisx);
  #print join "\n",@matches;
  
  my $h = HTTP::Headers->new;
  my $p = HTML::HeadParser->new($h);
  $p->parse($text);
   
  my @cols = map{ $h->header($_) || '' }@TAGS;
  $csv->print($fh1, [$ifile,@cols,@matches]);

}
[download]

poj

Comment on Re^2: Perl Array Question, combining HTML::HeadParser and regex Download Code

Replies are listed 'Best First'.
Re^3: Perl Array Question, combining HTML::HeadParser and regex by Anonymous Monk on Feb 01, 2016 at 15:50 UTC
OK, thanks again for the modified script. Way better than what I have. I did some research and figured out that the lotame info returned was actually in the html files. Here's the section that was extracted from: </script> <script> /** * Trigger the backfill event with lotame params * @param lotameInfo * @returns no return / var logBackfillEvent = function (lotameInfo) { setTimeout(function () { if (typeof _Anemone === "object") { payload = {}; payload.pl_ltmedata = {}; payload.presentation = {"count": 0}; payload.provider = {}; payload.pl_bfj = JSON.stringify(payload); payload.pl_ltmedata = lotameInfo; payload.pl_ltmedata = JSON.stringify(lotameInfo); payload['anxi'] = JSUtil.defaultVal(_AnemoneParams2.eventId, ''); _Anemone.logEvent('BackFill', payload); } }, 0); } /* * Read the lotame cookie * If lotame cookie not exist then make a request for the Lotame Audien +ce Extraction async call * Drop the lotame cookie * Trigger the backfill with lotame params * @param e * @param t * @param id */ [download] I'm not sure why the above values were returned as I don't see any keywords in there. Maybe something is close enough that it's pulling it. I'll have to take a closer look. Maybe I need to do an exact match. I also am still having trouble on search for two words at a time. I'll keep trying. Any insight you have would be greatly appreciated.	[reply] [d/l]
Re^3: Perl Array Question, combining HTML::HeadParser and regex by Anonymous Monk on Feb 01, 2016 at 14:36 UTC
Thanks very much poj for the assistance. The script now runs without errors on my side and the formatting is perfect, 1 row per file with comma separated values. But I don't seem to be getting the key words from the regex search that I'm looking for and instead get some code like items that start with an "" that I can't figure out: Test/Boulder_Personalized medicine.html.result.txt_parsed_for_news.txt.html Boulder%2520Personalized%2520medicine , News Search \| Ask.com Trigger the backfill event with lotam * Read the lotame cookie t then make a request for the Lotame Audience Extractio * Drop the lotame cookie * Trigger the backfill with lotame para //Setting the cookie.raw to true to avo id the encoding I tried a google search on "* Trigger the backfill event with lotam" and the other phrases and I'm stumped. Also how can I use multiple words here: `# match words my @WORDS = qw( press founder professor Dr. Ph.D M.D called receives joins timing find two self bottom true amazing forget night next day );` [download] For example what if I want to search for the phase 'founding member'? I tried using single quotes with no luck. Thanks very much for getting me over the hump and I look forward to your reply and a little more guidance.	[reply] [d/l]
Re^4: Perl Array Question, combining HTML::HeadParser and regex by poj (Abbot) on Feb 01, 2016 at 17:13 UTC
Add phrases like this `push @WORDS,'founding member','another phrase';` I not sure what your regex is trying to do. Are you expecting to capture up to 25 characters either side of the match with this ? `/.{0,25} (?:$words) .{0,25}/` Have you considered using an HTML parser ? poj	[reply] [d/l] [select]