Thanks for your response and your constructive criticism. I can definitely see your points. Here are some examples of my html input:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta name="generator" content= "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 15. +15), see www.w3.org"> <title>Aberdeen%20Genetic%20purity , News Search | Ask.com</title> <link rel="shortcut icon" href="http:/"> <link rel="apple-touch-icon" href= "http://www."> <link rel="apple-touch-icon" sizes="76x76" href= "http://www."> <link rel="apple-touch-icon" sizes="120x120" href= "http://www"> <link rel="apple-touch-icon" sizes="152x152" href= "http://www."> <link rel="apple-touch-icon" sizes="180x180" href= "http://www."> <meta name="robots" content="noindex, nofollow"> <style type="text/css"> #midblock,#rightblock,#mobile-footer,.mobile-web-results,.hcsa,.footer + { visibility:visible !important; } </style> <link type="text/css" rel="stylesheet" href= "/st/c/css/ask_news.min.1fbfe92a.css"> <style type="text/css"> .hcsa { visibility:hidden; } .sprite { background-image: url(http://); } .news-top-video-arrow{ background-image: url(http://); } @media (-webkit-min-device-pixel-ratio: 2), (min-resolution: 192dpi) { .sprite { background-image: url(http://); background-size: 111px 181px; } } </style> </head>

Here's an example of my output:

Filename Content-Base Title X-Meta-author X-Meta-descripti +on X-Meta-keywords X-Meta-name Test/1.html Aberdeen%20Animal%20trait%20analysis , News Search | As +k.com

I'm trying to get the regex results to print above after the last entry X-Meta-name. If I use the follow code instead of the array that I'm using:

my $string = quotemeta 'CEO'; while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { print $fh1 $1, ","; }

I can get the following

Test/Ames_Animal trait analysis.html.result.txt_parsed_for_news.txt.ht +ml Ames Animal trait analysis , News Search | Ask.com + Test/Ames_Biobank.html.result.txt_parsed_for_clinic.txt.html + both adults and infants. Dr. Kocher has requested who hrough a sepa +rate study. Dr. Lazaridis' samples alon colon and rectal cance +r. Dr. Nelson has requested sto rointestinal microbiome. Dr. Nelso +n and her colleague in a new research study. Dr. Ames is recruitin +g parti ers.</p> <p>In addition Dr. Thibodeau has expanded t sh; who have PKD.</ +p> <p>Dr. Harris' goal is to bette h another study. + </p> <p>Dr. Heit has also asked for ients who've had a clot. Dr. Heit's + goal is to identi To study microvesicles Dr. Jayachandran is + requesti pice caregivers. </p> <p>Dr. Kaur is researching whet 18">Nilufer Taner M.D. Ph.D +.</a> is studying geneti 0027660">Janet E. Olson Ph.D.</a> + and <a href="http: Test/Ames_Biobank.html.result.txt_parsed_for_jobs.txt.html + ompany-overview/hamilton-awards">Awards</a></li> + <l Test/Ames_Biobank.html.result.txt_parsed_for_news.txt.html + Ames Biobank , News Search | Ask.com Test/Ames_Biorepository.html.result.txt_parsed_for_news.txt.html Am +es%2520Biorepository , News Search | Ask.com

but as you can see I have spacing and formatting issues as I'm trying to get 1 row per file with all items from that file listed on that 1 line

I can certainly copy this:

my $string = quotemeta 'CEO'; while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { print $fh1 $1, ","; }

20 times and replace the $string value with what I'm looking for but that seems like an inelegant and wasteful way as I should be able to do it in an array

I've modified my code to the following

#!perl use strict; use warnings; use File::Find; use HTTP::Headers; use HTML::HeadParser; use Text::CSV; # config my $dfile = 'all_tags.csv'; my $dir = 'Test'; my @TAGS = ('Content-Base', 'Title', 'X-Meta-author', 'X-Meta-description', 'X-Meta-keywords', 'X-Meta-name',); my @TAGS2 = ('CEO', 'founder', 'professor', 'Dr.', 'Ph.D', 'M.D.', 'company called', 'startup called', 'joins', 'receives funding', 'SBIR', 'receiving the grant', 'seed investment', 'seed fund', 'appointed', 'chosen', 'secures', 'award', 'seed investment', 'awarded', ); # output my $csv = Text::CSV->new({eol => $/}); open my $fh1, ">:encoding(utf8)", $dfile or die "Error opening $dfile: $!"; $csv->print($fh1,['Filename',@TAGS]); # parser header my $string = map {quotemeta} @TAGS2; #my $text = while ( my $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { $string->print($fh1, ['Filename',@TAGS2]);# regex header } # input find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir); close $fh1 or die "Error closing $dfile: $!"; exit; sub HTML_Files { parse_HTML_Header($File::Find::name) if /\.html?$/; } sub parse_HTML_Header { my $ifile = shift; print "parsing $ifile\n"; open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n"; my $text = do{ local $/; <$fh0> }; close $fh0; my $h = HTTP::Headers->new; my $p = HTML::HeadParser->new($h); $p->parse($text); my @cols = map{ $h->header($_) }@TAGS; $csv->print($fh1, [$ifile,@cols]); my @cols2 = map{ $h->$string($_) }@TAGS2; $string->print($fh1, [$ifile,@cols2]); #my $string = quotemeta 'awarded'; #while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) { #print $fh1 $1,"\n"; # } }

and get the following errors: Use of uninitialized value $text in pattern match (m//) at header_parser12.pl line 38. parsing Test/1.html Can't locate object method "20" via package "HTTP::Headers" at header_parser12.pl line 66.


In reply to Re: Perl Array Question, combining HTML::HeadParser and regex by Anonymous Monk
in thread Perl Array Question, combining HTML::HeadParser and regex by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.