in reply to Re: Perl Array Question, combining HTML::HeadParser and regex
in thread Perl Array Question, combining HTML::HeadParser and regex

Thanks for your response and your constructive criticism. I can definitely see your points. Here are some examples of my html input:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta name="generator" content= "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 15. +15), see www.w3.org"> <title>Aberdeen%20Genetic%20purity , News Search | Ask.com</title> <link rel="shortcut icon" href="http:/"> <link rel="apple-touch-icon" href= "http://www."> <link rel="apple-touch-icon" sizes="76x76" href= "http://www."> <link rel="apple-touch-icon" sizes="120x120" href= "http://www"> <link rel="apple-touch-icon" sizes="152x152" href= "http://www."> <link rel="apple-touch-icon" sizes="180x180" href= "http://www."> <meta name="robots" content="noindex, nofollow"> <style type="text/css"> #midblock,#rightblock,#mobile-footer,.mobile-web-results,.hcsa,.footer + { visibility:visible !important; } </style> <link type="text/css" rel="stylesheet" href= "/st/c/css/ask_news.min.1fbfe92a.css"> <style type="text/css"> .hcsa { visibility:hidden; } .sprite { background-image: url(http://); } .news-top-video-arrow{ background-image: url(http://); } @media (-webkit-min-device-pixel-ratio: 2), (min-resolution: 192dpi) { .sprite { background-image: url(http://); background-size: 111px 181px; } } </style> </head>

Here's an example of my output:

Filename Content-Base Title X-Meta-author X-Meta-descripti +on X-Meta-keywords X-Meta-name Test/1.html Aberdeen%20Animal%20trait%20analysis , News Search | As +k.com

I'm trying to get the regex results to print above after the last entry X-Meta-name. If I use the follow code instead of the array that I'm using:

my $string = quotemeta 'CEO'; while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { print $fh1 $1, ","; }

I can get the following

Test/Ames_Animal trait analysis.html.result.txt_parsed_for_news.txt.ht +ml Ames Animal trait analysis , News Search | Ask.com + Test/Ames_Biobank.html.result.txt_parsed_for_clinic.txt.html + both adults and infants. Dr. Kocher has requested who hrough a sepa +rate study. Dr. Lazaridis' samples alon colon and rectal cance +r. Dr. Nelson has requested sto rointestinal microbiome. Dr. Nelso +n and her colleague in a new research study. Dr. Ames is recruitin +g parti ers.</p> <p>In addition Dr. Thibodeau has expanded t sh; who have PKD.</ +p> <p>Dr. Harris' goal is to bette h another study. + </p> <p>Dr. Heit has also asked for ients who've had a clot. Dr. Heit's + goal is to identi To study microvesicles Dr. Jayachandran is + requesti pice caregivers. </p> <p>Dr. Kaur is researching whet 18">Nilufer Taner M.D. Ph.D +.</a> is studying geneti 0027660">Janet E. Olson Ph.D.</a> + and <a href="http: Test/Ames_Biobank.html.result.txt_parsed_for_jobs.txt.html + ompany-overview/hamilton-awards">Awards</a></li> + <l Test/Ames_Biobank.html.result.txt_parsed_for_news.txt.html + Ames Biobank , News Search | Ask.com Test/Ames_Biorepository.html.result.txt_parsed_for_news.txt.html Am +es%2520Biorepository , News Search | Ask.com

but as you can see I have spacing and formatting issues as I'm trying to get 1 row per file with all items from that file listed on that 1 line

I can certainly copy this:

my $string = quotemeta 'CEO'; while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { print $fh1 $1, ","; }

20 times and replace the $string value with what I'm looking for but that seems like an inelegant and wasteful way as I should be able to do it in an array

I've modified my code to the following

#!perl use strict; use warnings; use File::Find; use HTTP::Headers; use HTML::HeadParser; use Text::CSV; # config my $dfile = 'all_tags.csv'; my $dir = 'Test'; my @TAGS = ('Content-Base', 'Title', 'X-Meta-author', 'X-Meta-description', 'X-Meta-keywords', 'X-Meta-name',); my @TAGS2 = ('CEO', 'founder', 'professor', 'Dr.', 'Ph.D', 'M.D.', 'company called', 'startup called', 'joins', 'receives funding', 'SBIR', 'receiving the grant', 'seed investment', 'seed fund', 'appointed', 'chosen', 'secures', 'award', 'seed investment', 'awarded', ); # output my $csv = Text::CSV->new({eol => $/}); open my $fh1, ">:encoding(utf8)", $dfile or die "Error opening $dfile: $!"; $csv->print($fh1,['Filename',@TAGS]); # parser header my $string = map {quotemeta} @TAGS2; #my $text = while ( my $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { $string->print($fh1, ['Filename',@TAGS2]);# regex header } # input find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir); close $fh1 or die "Error closing $dfile: $!"; exit; sub HTML_Files { parse_HTML_Header($File::Find::name) if /\.html?$/; } sub parse_HTML_Header { my $ifile = shift; print "parsing $ifile\n"; open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n"; my $text = do{ local $/; <$fh0> }; close $fh0; my $h = HTTP::Headers->new; my $p = HTML::HeadParser->new($h); $p->parse($text); my @cols = map{ $h->header($_) }@TAGS; $csv->print($fh1, [$ifile,@cols]); my @cols2 = map{ $h->$string($_) }@TAGS2; $string->print($fh1, [$ifile,@cols2]); #my $string = quotemeta 'awarded'; #while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { #print $fh1 $1,"\n"; # } }

and get the following errors: Use of uninitialized value $text in pattern match (m//) at header_parser12.pl line 38. parsing Test/1.html Can't locate object method "20" via package "HTTP::Headers" at header_parser12.pl line 66.

Replies are listed 'Best First'.
Re^2: Perl Array Question, combining HTML::HeadParser and regex
by poj (Abbot) on Feb 01, 2016 at 08:43 UTC
    Try
    #!perl use strict; use warnings; use File::Find; use HTTP::Headers; use HTML::HeadParser; use Text::CSV; # config my $dfile = 'all_tags.csv'; my $dir = 'Test'; my @TAGS = ('Content-Base', 'Title', 'X-Meta-author', 'X-Meta-description', 'X-Meta-keywords', 'X-Meta-name',); # match words my @WORDS = qw( press founder professor Dr. Ph.D M.D called receives joins timing find two self bottom true amazing forget night next day ); my $words = join '|',map { quotemeta } @WORDS ; my $regex = qr/.{0,25} (?:$words) .{0,25}/; # output my $csv = Text::CSV->new({eol => $/}); open my $fh1, ">:encoding(utf8)", $dfile or die "Error opening $dfile: $!"; $csv->print($fh1,['Search Words',@WORDS]); # header $csv->print($fh1,['Filename',@TAGS,'Search Results']); # header # input find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir); close $fh1 or die "Error closing $dfile: $!"; exit; sub HTML_Files { parse_HTML_Header($File::Find::name) if /\.html?$/; } sub parse_HTML_Header { my $ifile = shift; print "parsing $ifile\n"; open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n"; my $text = do{ local $/; <$fh0> }; close $fh0; my @matches = ($text =~ /($regex)/gisx); #print join "\n",@matches; my $h = HTTP::Headers->new; my $p = HTML::HeadParser->new($h); $p->parse($text); my @cols = map{ $h->header($_) || '' }@TAGS; $csv->print($fh1, [$ifile,@cols,@matches]); }
    poj

      OK, thanks again for the modified script. Way better than what I have. I did some research and figured out that the lotame info returned was actually in the html files. Here's the section that was extracted from:

      </script> <script> /** * Trigger the backfill event with lotame params * @param lotameInfo * @returns no return */ var logBackfillEvent = function (lotameInfo) { setTimeout(function () { if (typeof _Anemone === "object") { payload = {}; payload.pl_ltmedata = {}; payload.presentation = {"count": 0}; payload.provider = {}; payload.pl_bfj = JSON.stringify(payload); payload.pl_ltmedata = lotameInfo; payload.pl_ltmedata = JSON.stringify(lotameInfo); payload['anxi'] = JSUtil.defaultVal(_AnemoneParams2.eventId, ''); _Anemone.logEvent('BackFill', payload); } }, 0); } /** * Read the lotame cookie * If lotame cookie not exist then make a request for the Lotame Audien +ce Extraction async call * Drop the lotame cookie * Trigger the backfill with lotame params * @param e * @param t * @param id */

      I'm not sure why the above values were returned as I don't see any keywords in there. Maybe something is close enough that it's pulling it. I'll have to take a closer look. Maybe I need to do an exact match. I also am still having trouble on search for two words at a time. I'll keep trying. Any insight you have would be greatly appreciated.

      Thanks very much poj for the assistance. The script now runs without errors on my side and the formatting is perfect, 1 row per file with comma separated values. But I don't seem to be getting the key words from the regex search that I'm looking for and instead get some code like items that start with an "*" that I can't figure out: Test/Boulder_Personalized medicine.html.result.txt_parsed_for_news.txt.html Boulder%2520Personalized%2520medicine , News Search | Ask.com * Trigger the backfill event with lotam * Read the lotame cookie t then make a request for the Lotame Audience Extractio * Drop the lotame cookie * Trigger the backfill with lotame para //Setting the cookie.raw to true to avo id the encoding I tried a google search on "* Trigger the backfill event with lotam" and the other phrases and I'm stumped. Also how can I use multiple words here:

      # match words my @WORDS = qw( press founder professor Dr. Ph.D M.D called receives joins timing find two self bottom true amazing forget night next day );

      For example what if I want to search for the phase 'founding member'? I tried using single quotes with no luck. Thanks very much for getting me over the hump and I look forward to your reply and a little more guidance.

        Add phrases like this

        push @WORDS,'founding member','another phrase';

        I not sure what your regex is trying to do. Are you expecting to capture up to 25 characters either side of the match with this ?

        /.{0,25} (?:$words) .{0,25}/

        Have you considered using an HTML parser ?

        poj