comment on

Thanks for your response and your constructive criticism. I can definitely see your points. Here are some examples of my html input:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>
<head>
  <meta name="generator" content=
  "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 15.
+15), see www.w3.org">

  <title>Aberdeen%20Genetic%20purity , News Search |
  Ask.com</title>
  <link rel="shortcut icon" href="http:/">
  <link rel="apple-touch-icon" href=
  "http://www.">
  <link rel="apple-touch-icon" sizes="76x76" href=
  "http://www.">
  <link rel="apple-touch-icon" sizes="120x120" href=
  "http://www">
  <link rel="apple-touch-icon" sizes="152x152" href=
  "http://www.">
  <link rel="apple-touch-icon" sizes="180x180" href=
  "http://www.">
  <meta name="robots" content="noindex, nofollow">
  <style type="text/css">
#midblock,#rightblock,#mobile-footer,.mobile-web-results,.hcsa,.footer
+ {
  visibility:visible !important;
  }
  </style>
  <link type="text/css" rel="stylesheet" href=
  "/st/c/css/ask_news.min.1fbfe92a.css">
  <style type="text/css">
.hcsa {
  visibility:hidden;
  }
  .sprite {
  background-image: url(http://);
  }
  .news-top-video-arrow{
  background-image: url(http://);
  }
  @media
  (-webkit-min-device-pixel-ratio: 2),
  (min-resolution: 192dpi) {
  .sprite {
  background-image: url(http://);
  background-size: 111px 181px;
  }
  }
  </style>
</head>
[download]

Here's an example of my output:

Filename    Content-Base    Title    X-Meta-author    X-Meta-descripti
+on    X-Meta-keywords    X-Meta-name
Test/1.html    Aberdeen%20Animal%20trait%20analysis , News Search | As
+k.com
[download]

I'm trying to get the regex results to print above after the last entry X-Meta-name. If I use the follow code instead of the array that I'm using:

my $string = quotemeta 'CEO';
  while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {    
    print $fh1 $1, ",";
     }
[download]

I can get the following

Test/Ames_Animal trait analysis.html.result.txt_parsed_for_news.txt.ht
+ml    Ames Animal trait analysis , News Search | Ask.com             
+           
Test/Ames_Biobank.html.result.txt_parsed_for_clinic.txt.html          
+                  
both adults and infants. Dr. Kocher has requested who    hrough a sepa
+rate study. Dr. Lazaridis' samples     alon    colon and rectal cance
+r. Dr. Nelson has requested sto    rointestinal microbiome. Dr. Nelso
+n and her colleague    in a new research study. Dr. Ames is recruitin
+g parti    ers.</p>    
<p>In addition     Dr. Thibodeau has expanded t    sh; who have PKD.</
+p>                    
<p>Dr. Harris' goal is to bette    h another study.                   
+     
</p>                            
<p>Dr. Heit has also asked for     ients who've had a clot. Dr. Heit's
+ goal is to identi     To study microvesicles     Dr. Jayachandran is
+ requesti    pice caregivers.            
</p>                            
<p>Dr. Kaur is researching whet    18">Nilufer Taner     M.D.     Ph.D
+.</a>     is studying geneti    0027660">Janet E. Olson     Ph.D.</a>
+     and <a href="http:
Test/Ames_Biobank.html.result.txt_parsed_for_jobs.txt.html            
+                
ompany-overview/hamilton-awards">Awards</a></li>                      
+      
    <l    Test/Ames_Biobank.html.result.txt_parsed_for_news.txt.html  
+  Ames Biobank , News Search | Ask.com                    
Test/Ames_Biorepository.html.result.txt_parsed_for_news.txt.html    Am
+es%2520Biorepository , News Search | Ask.com
[download]

but as you can see I have spacing and formatting issues as I'm trying to get 1 row per file with all items from that file listed on that 1 line

I can certainly copy this:

my $string = quotemeta 'CEO';
  while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {    
    print $fh1 $1, ",";
     }
[download]

20 times and replace the $string value with what I'm looking for but that seems like an inelegant and wasteful way as I should be able to do it in an array

I've modified my code to the following

#!perl
use strict;
use warnings;
use File::Find;
use HTTP::Headers;
use HTML::HeadParser;
use Text::CSV;

# config
my $dfile  = 'all_tags.csv';
my $dir    = 'Test';
my @TAGS = ('Content-Base', 'Title', 
            'X-Meta-author', 'X-Meta-description', 
            'X-Meta-keywords', 'X-Meta-name',);
            
my @TAGS2 = ('CEO', 'founder', 
            'professor', 'Dr.', 
            'Ph.D', 'M.D.',
            'company called', 'startup called',
            'joins', 'receives funding',
            'SBIR', 'receiving the grant',
            'seed investment', 'seed fund',
            'appointed', 'chosen',
            'secures', 'award',
            'seed investment', 'awarded',
            );    
            
            
              
# output
my $csv = Text::CSV->new({eol => $/});
open my $fh1, ">:encoding(utf8)", $dfile 
    or die "Error opening $dfile: $!";
$csv->print($fh1,['Filename',@TAGS]); # parser header

my $string = map {quotemeta} @TAGS2;
#my $text = 
while ( my $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {    
$string->print($fh1, ['Filename',@TAGS2]);# regex header
     }

# input              
find ({wanted =>\&HTML_Files, no_chdir => 1}, $dir);
close $fh1 or die "Error closing $dfile: $!";
exit;

sub HTML_Files {
  parse_HTML_Header($File::Find::name) if /\.html?$/;
}

sub parse_HTML_Header {

  my $ifile = shift;
  print "parsing $ifile\n";
  
  open my $fh0, '<', $ifile or die "Error opening $ifile: $!\n";
  my $text = do{ local $/; <$fh0> };
  close $fh0;

  my $h = HTTP::Headers->new;
  my $p = HTML::HeadParser->new($h);
  $p->parse($text);
   
  my @cols = map{ $h->header($_) }@TAGS;
  $csv->print($fh1, [$ifile,@cols]);
  my @cols2 = map{ $h->$string($_) }@TAGS2;
  $string->print($fh1, [$ifile,@cols2]);

  #my $string = quotemeta 'awarded';
  #while ( $text =~ m/ ( .{0,25} $string.{0,25} ) /gisx ) {    
  #print $fh1 $1,"\n";
#     }
    
 }
[download]

and get the following errors: Use of uninitialized value $text in pattern match (m//) at header_parser12.pl line 38. parsing Test/1.html Can't locate object method "20" via package "HTTP::Headers" at header_parser12.pl line 66.

In reply to Re: Perl Array Question, combining HTML::HeadParser and regex by Anonymous Monk
in thread Perl Array Question, combining HTML::HeadParser and regex by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.