Re: Perl Array Question, combining HTML::HeadParser and regex

Thanks for your response and your constructive criticism. I can definitely see your points. Here are some examples of my html input:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
<head>
  <meta name="generator" content=
  "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 15.
+15), see www.w3.org">

  <title>Aberdeen%20Genetic%20purity , News Search |
  Ask.com</title>
  <link rel="shortcut icon" href="http:/">
  <link rel="apple-touch-icon" href=
  "http://www.">
  <link rel="apple-touch-icon" sizes="76x76" href=
  "http://www.">
  <link rel="apple-touch-icon" sizes="120x120" href=
  "http://www">
  <link rel="apple-touch-icon" sizes="152x152" href=
  "http://www.">
  <link rel="apple-touch-icon" sizes="180x180" href=
  "http://www.">
  <meta name="robots" content="noindex, nofollow">
  <style type="text/css">
#midblock,#rightblock,#mobile-footer,.mobile-web-results,.hcsa,.footer
+ {
  visibility:visible !important;
  }
  </style>
  <link type="text/css" rel="stylesheet" href=
  "/st/c/css/ask_news.min.1fbfe92a.css">
  <style type="text/css">
.hcsa {
  visibility:hidden;
  }
  .sprite {
  background-image: url(http://);
  }
  .news-top-video-arrow{
  background-image: url(http://);
  }
  @media
  (-webkit-min-device-pixel-ratio: 2),
  (min-resolution: 192dpi) {
  .sprite {
  background-image: url(http://);
  background-size: 111px 181px;
  }
  }
  </style>
</head>
[download]

Here's an example of my output:

Filename    Content-Base    Title    X-Meta-author    X-Meta-descripti
+on    X-Meta-keywords    X-Meta-name
Test/1.html    Aberdeen%20Animal%20trait%20analysis , News Search | As
+k.com
[download]

I'm trying to get the regex results to print above after the last entry X-Meta-name. If I use the follow code instead of the array that I'm using:

my $string = quotemeta 'CEO';
  while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {    
    print $fh1 $1, ",";
     }
[download]

I can get the following

Test/Ames_Animal trait analysis.html.result.txt_parsed_for_news.txt.ht
+ml    Ames Animal trait analysis , News Search | Ask.com             
+           
Test/Ames_Biobank.html.result.txt_parsed_for_clinic.txt.html          
+                  
both adults and infants. Dr. Kocher has requested who    hrough a sepa
+rate study. Dr. Lazaridis' samples     alon    colon and rectal cance
+r. Dr. Nelson has requested sto    rointestinal microbiome. Dr. Nelso
+n and her colleague    in a new research study. Dr. Ames is recruitin
+g parti    ers.</p>    
<p>In addition     Dr. Thibodeau has expanded t    sh; who have PKD.</
+p>                    
<p>Dr. Harris' goal is to bette    h another study.                   
+     
</p>                            
<p>Dr. Heit has also asked for     ients who've had a clot. Dr. Heit's
+ goal is to identi     To study microvesicles     Dr. Jayachandran is
+ requesti    pice caregivers.            
</p>                            
<p>Dr. Kaur is researching whet    18">Nilufer Taner     M.D.     Ph.D
+.</a>     is studying geneti    0027660">Janet E. Olson     Ph.D.</a>
+     and <a href="http:
Test/Ames_Biobank.html.result.txt_parsed_for_jobs.txt.html            
+                
ompany-overview/hamilton-awards">Awards</a></li>                      
+      
    <l    Test/Ames_Biobank.html.result.txt_parsed_for_news.txt.html  
+  Ames Biobank , News Search | Ask.com                    
Test/Ames_Biorepository.html.result.txt_parsed_for_news.txt.html    Am
+es%2520Biorepository , News Search | Ask.com
[download]

but as you can see I have spacing and formatting issues as I'm trying to get 1 row per file with all items from that file listed on that 1 line

I can certainly copy this:

my $string = quotemeta 'CEO';
  while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {    
    print $fh1 $1, ",";
     }
[download]

20 times and replace the $string value with what I'm looking for but that seems like an inelegant and wasteful way as I should be able to do it in an array

I've modified my code to the following

#!perl
use strict;
use warnings;
use File::Find;
use HTTP::Headers;
use HTML::HeadParser;
use Text::CSV;

# config
my $dfile  = 'all_tags.csv';
my $dir    = 'Test';
my @TAGS = ('Content-Base', 'Title', 
            'X-Meta-author', 'X-Meta-description', 
            'X-Meta-keywords', 'X-Meta-name',);
            
my @TAGS2 = ('CEO', 'founder', 
            'professor', 'Dr.', 
            'Ph.D', 'M.D.',
            'company called', 'startup called',
            'joins', 'receives funding',
            'SBIR', 'receiving the grant',
            'seed investment', 'seed fund',
            'appointed', 'chosen',
            'secures', 'award',
            'seed investment', 'awarded',
            );    
            
            
              
# output
my $csv = Text::CSV->new({eol => $/});
open my $fh1, ">:encoding(utf8)", $dfile 
    or die "Error opening $dfile: $!";
$csv->print($fh1,['Filename',@TAGS]); # parser header

my $string = map {quotemeta} @TAGS2;
#my $text = 
while ( my $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {    
$string->print($fh1, ['Filename',@TAGS2]);# regex header
     }

# input              
find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir);
close $fh1 or die "Error closing $dfile: $!";
exit;

sub HTML_Files {
  parse_HTML_Header($File::Find::name) if /\.html?$/;
}

sub parse_HTML_Header {

  my $ifile = shift;
  print "parsing $ifile\n";
  
  open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n";
  my $text = do{ local $/; <$fh0> };
  close $fh0;

  my $h = HTTP::Headers->new;
  my $p = HTML::HeadParser->new($h);
  $p->parse($text);
   
  my @cols = map{ $h->header($_) }@TAGS;
  $csv->print($fh1, [$ifile,@cols]);
  my @cols2 = map{ $h->$string($_) }@TAGS2;
  $string->print($fh1, [$ifile,@cols2]);

  #my $string = quotemeta 'awarded';
  #while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {    
  #print $fh1 $1,"\n";
#     }
    
 }
[download]

and get the following errors: Use of uninitialized value $text in pattern match (m//) at header_parser12.pl line 38. parsing Test/1.html Can't locate object method "20" via package "HTTP::Headers" at header_parser12.pl line 66.

Comment on Re: Perl Array Question, combining HTML::HeadParser and regex Select or Download Code

Replies are listed 'Best First'.
Re^2: Perl Array Question, combining HTML::HeadParser and regex by poj (Abbot) on Feb 01, 2016 at 08:43 UTC
Try #!perl use strict; use warnings; use File::Find; use HTTP::Headers; use HTML::HeadParser; use Text::CSV; # config my $dfile = 'all_tags.csv'; my $dir = 'Test'; my @TAGS = ('Content-Base', 'Title', 'X-Meta-author', 'X-Meta-description', 'X-Meta-keywords', 'X-Meta-name',); # match words my @WORDS = qw( press founder professor Dr. Ph.D M.D called receives joins timing find two self bottom true amazing forget night next day ); my $words = join '\|',map { quotemeta } @WORDS ; my $regex = qr/.{0,25} (?:$words) .{0,25}/; # output my $csv = Text::CSV->new({eol => $/}); open my $fh1, ">:encoding(utf8)", $dfile or die "Error opening $dfile: $!"; $csv->print($fh1,['Search Words',@WORDS]); # header $csv->print($fh1,['Filename',@TAGS,'Search Results']); # header # input find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir); close $fh1 or die "Error closing $dfile: $!"; exit; sub HTML_Files { parse_HTML_Header($File::Find::name) if /\.html?$/; } sub parse_HTML_Header { my $ifile = shift; print "parsing $ifile\n"; open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n"; my $text = do{ local $/; <$fh0> }; close $fh0; my @matches = ($text =~ /($regex)/gisx); #print join "\n",@matches; my $h = HTTP::Headers->new; my $p = HTML::HeadParser->new($h); $p->parse($text); my @cols = map{ $h->header($_) \|\| '' }@TAGS; $csv->print($fh1, [$ifile,@cols,@matches]); } [download] poj	[reply] [d/l]
Re^3: Perl Array Question, combining HTML::HeadParser and regex by Anonymous Monk on Feb 01, 2016 at 15:50 UTC
OK, thanks again for the modified script. Way better than what I have. I did some research and figured out that the lotame info returned was actually in the html files. Here's the section that was extracted from: </script> <script> /** * Trigger the backfill event with lotame params * @param lotameInfo * @returns no return / var logBackfillEvent = function (lotameInfo) { setTimeout(function () { if (typeof _Anemone === "object") { payload = {}; payload.pl_ltmedata = {}; payload.presentation = {"count": 0}; payload.provider = {}; payload.pl_bfj = JSON.stringify(payload); payload.pl_ltmedata = lotameInfo; payload.pl_ltmedata = JSON.stringify(lotameInfo); payload['anxi'] = JSUtil.defaultVal(_AnemoneParams2.eventId, ''); _Anemone.logEvent('BackFill', payload); } }, 0); } /* * Read the lotame cookie * If lotame cookie not exist then make a request for the Lotame Audien +ce Extraction async call * Drop the lotame cookie * Trigger the backfill with lotame params * @param e * @param t * @param id */ [download] I'm not sure why the above values were returned as I don't see any keywords in there. Maybe something is close enough that it's pulling it. I'll have to take a closer look. Maybe I need to do an exact match. I also am still having trouble on search for two words at a time. I'll keep trying. Any insight you have would be greatly appreciated.	[reply] [d/l]
Re^3: Perl Array Question, combining HTML::HeadParser and regex by Anonymous Monk on Feb 01, 2016 at 14:36 UTC
Thanks very much poj for the assistance. The script now runs without errors on my side and the formatting is perfect, 1 row per file with comma separated values. But I don't seem to be getting the key words from the regex search that I'm looking for and instead get some code like items that start with an "" that I can't figure out: Test/Boulder_Personalized medicine.html.result.txt_parsed_for_news.txt.html Boulder%2520Personalized%2520medicine , News Search \| Ask.com Trigger the backfill event with lotam * Read the lotame cookie t then make a request for the Lotame Audience Extractio * Drop the lotame cookie * Trigger the backfill with lotame para //Setting the cookie.raw to true to avo id the encoding I tried a google search on "* Trigger the backfill event with lotam" and the other phrases and I'm stumped. Also how can I use multiple words here: `# match words my @WORDS = qw( press founder professor Dr. Ph.D M.D called receives joins timing find two self bottom true amazing forget night next day );` [download] For example what if I want to search for the phase 'founding member'? I tried using single quotes with no luck. Thanks very much for getting me over the hump and I look forward to your reply and a little more guidance.	[reply] [d/l]
Re^4: Perl Array Question, combining HTML::HeadParser and regex by poj (Abbot) on Feb 01, 2016 at 17:13 UTC
Add phrases like this `push @WORDS,'founding member','another phrase';` I not sure what your regex is trying to do. Are you expecting to capture up to 25 characters either side of the match with this ? `/.{0,25} (?:$words) .{0,25}/` Have you considered using an HTML parser ? poj	[reply] [d/l] [select]