mcoblentz has asked for the wisdom of the Perl Monks concerning the following question:

Dear Brothers,

I have an odd situation and while I have tried to study the mysteries, enlightenment eludes me. I seek guidance and wisdom.

I am iterating through an HTML table and seeking for the first occurrence of a text pattern. Once any pattern is detected, it should jump out of the WHILE loop. For that, I used LAST. So far, so good.

On occasion, the next line in the table is not properly executed through the loop and I don't understand why. Using PRINT statements rather liberally, I thought I would get a repeating pattern of:

New Row: blah blah blah pattern found/not found, set department to ... <possible repeat: pattern found/not found ...> final result: blah blah blah
Except sometimes it seems to skip the Pattern Found PRINT statement.

How did I manage to achieve this unexpected result? I thought this would be simple but I am just racking the ol' brain and coming up with zip.

My code follows:

#!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; use Data::Dumper; use HTML::TableExtract; use XML::FeedPP; use UTF8; use Encode qw(encode decode); # initialize my $cols; my $url; my $depth; my $count; my $data; my $tracking_code; my $location; my $job_title; my $date_posted; my $out_fh; my $key; my $value; my $department; my $input; # -------------------------------------------------------------------- +------------------ # Setup # -------------------------------------------------------------------- +------------------ # get the data from the web. The URL for CommVault's job listings is: # https://commvault.silkroad.com/epostings/index.cfm?fuseaction=app.jo +bsearch# # Either pass this in as --url <page_url> when invoking or just set it +. $url = "https://commvault.silkroad.com/epostings/index.cfm?fuseaction= +app.jobsearch"; # the headings on the table columns for the job listings are, in order +: $cols = 'tracking_code,job_title,location,date_posted'; # use a hash of patterns to look for within the job titles. If a patt +ern matches, assign the # category to the department label and stop processing. # the hash of jobs triggers to categories is below. Eventually this w +ill just be a text # file to make maintenance easier - for now, just hardcode a couple of + sample patterns and # category pairs. my %job_categories = ( 'account manager' => "Sales", 'systems engineer' => "PreSales" ); # define the output areas my $directory = "/Users/coblem/testing/"; my $outfile = "cvlt_jobs.csv"; open( $out_fh, '>', $ directory . $outfile) or die("Unable to create output file \"$out_fh\": $!\n"); print ("------ START -------------------- START -------------------- S +TART -------------", "\n"); # -------------------------------------------------------------------- +------------------ # Processing # -------------------------------------------------------------------- +------------------ # bring in the table from the URL, break everything up into the column +s and iterate over the rows # looking for patterns. if we find a match, set the department to the + hash value. # first, bring in the row and extract along the column fields. my $m = WWW::Mechanize->new(); $m->get($url); $input = $m->content; my $te; if ( defined ($cols)) { my @headers = split(/,/, $cols); $te = HTML::TableExtract->new( headers => [ 'Tracking Code', 'Job Tit +le', 'Location', 'Date Posted' ] ) or die qq{$!}; } # we shouldn't be in this section. the error handling below is not ye +t sophisticated enough. # fortunately they are not using embedded tables or lots of XML to sor +t through so we # shouldn't hit this. else { $te = new HTML::TableExtract( depth => $depth, count=>$count); }; # second, iterate over each row, looking for a pattern from the hash o +f categories. # start with breaking up the row into the fields from the column headi +ngs $te->parse($input); foreach my $row ($te->rows) { $tracking_code = $ { $row }[0]; $job_title = $ { $row }[1]; $location = $ { $row }[2]; $date_posted = $ { $row }[3]; print ("new row\: $tracking_code $job_title $location $date_posted + \n"); # now look for a $key pattern to match inside $job_title. # If a key matches, set department to the category, $value # ---------------------------------------------------------------- +---------------------- # THE WHILE LOOP STARTS HERE, ALONG WITH THE PROBLEM # ---------------------------------------------------------------- +---------------------- while ( ($key, $value) = each %job_categories ) { print ( "looking in $job_title for\: $key \n"); if ($job_title =~ /$key/i) { $department = $value; print ( "found a match. setting department to $value \n") +; last; } else { $department = 'Undefined'; print ( "no match found. department is now $department \n +"); } } print (" final result\: $job_title \=\> $department \n \n"); # -------------------------------------------------------------------- +------------------ # Close out and clean-up # close file handles, any other items # not yet implemented # -------------------------------------------------------------------- +------------------ }

Here is a sample of the output from the terminal:

------ START -------------------- START -------------------- START --- +---------- new row: 306145-636 Sales Account Manager - Enterprise Seattle, Washin +gton, United States 10/31/2013 looking in Sales Account Manager - Enterprise for: systems engineer no match found. department is now Undefined looking in Sales Account Manager - Enterprise for: account manager found a match. setting department to Sales final result: Sales Account Manager - Enterprise => Sales new row: 306144-636 Inside Sales Administrator Madrid, Madrid, Spain 1 +0/30/2013 final result: Inside Sales Administrator => Sales new row: 306143-636 Inside Sales Administrator Milano, Lombardia, Ital +y 10/30/2013 looking in Inside Sales Administrator for: systems engineer no match found. department is now Undefined looking in Inside Sales Administrator for: account manager no match found. department is now Undefined final result: Inside Sales Administrator => Undefined new row: 306134-636 Senior Technical Consultant / Enterprise Solutions + Architect Reading, West Berkshire, United Kingdom 10/30/2013 looking in Senior Technical Consultant / Enterprise Solutions Architec +t for: systems engineer no match found. department is now Undefined looking in Senior Technical Consultant / Enterprise Solutions Architec +t for: account manager no match found. department is now Undefined final result: Senior Technical Consultant / Enterprise Solutions Arch +itect => Undefined new row: 306142-636 Product Manager - Database Oceanport, New Jersey, +United States 10/29/2013 looking in Product Manager - Database for: systems engineer no match found. department is now Undefined looking in Product Manager - Database for: account manager no match found. department is now Undefined final result: Product Manager - Database => Undefined new row: 306141-636 Professional Services Project Manager Pleasanton, +California, United States 10/29/2013 looking in Professional Services Project Manager for: systems engineer + no match found. department is now Undefined looking in Professional Services Project Manager for: account manager + no match found. department is now Undefined final result: Professional Services Project Manager => Undefined new row: 306140-636 Customer Support Engineer - Contract-to-Hire Ocean +port, New Jersey, United States 10/29/2013 looking in Customer Support Engineer - Contract-to-Hire for: systems e +ngineer no match found. department is now Undefined looking in Customer Support Engineer - Contract-to-Hire for: account m +anager no match found. department is now Undefined final result: Customer Support Engineer - Contract-to-Hire => Undefi +ned new row: 306139-636 Systems Engineer - Houston Houston, Texas, United +States 10/29/2013 looking in Systems Engineer - Houston for: systems engineer found a match. setting department to PreSales final result: Systems Engineer - Houston => PreSales new row: 306137-636 Systems Engineer New York, New York, United States + 10/29/2013 looking in Systems Engineer for: account manager no match found. department is now Undefined final result: Systems Engineer => Undefined

The only thing I can see is that this seems to occur after a successful match with the SALES value/key pair. This value/key pair seems to be last in the hash. What difference does the position in the hash make? I don't get it.

Could the use of EACH be the problem? In that if I don't go all the way through the hash table, I somehow don't start back at the beginning? Because when I don't use it, I get a runaway loop (which I also don't understand).

Replies are listed 'Best First'.
Re: While loop and LAST
by BrowserUk (Patriarch) on Nov 02, 2013 at 05:51 UTC

    The problem is that each continues from where it last returned unless you reset the iterator using (say) keys %hash. Which means that you are not searching the complete hash each time around your outer loop.

    But, before you go fixing that; why are you searching a hash in the first place? That (as Mr.Wall would have it) is like "buying and Uzi and using it to club your enemy to death".

    You should be able to just do:

    foreach my $row ($te->rows) { $tracking_code = $ { $row }[0]; $job_title = $ { $row }[1]; $location = $ { $row }[2]; $date_posted = $ { $row }[3]; if( exists $job_categories{ $job_title } ) { $department = $job_categories{ $job_title }; } else { print "no match"; } }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      The job titles from the parsed table are more verbose than the hash keys--which are key phrases that may be found in those titles, thus the OP's regexing in the while loop. For example, in one case the OP is looking for the phrase "account manager" (a hash key) in the job title "Territory / Field Sales Account Manager Mid-Enterprise GEO".

        Then resetting the hash iterator will probably fix his problem.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Correct. I was (still am) thinking that this can become a more generic approach to the problem. Eventually the hash can be maintained outside the script, either as something a third party supplies or other managers just maintain.

        If this list gets to 500 key value pairs I'm not sure.

Re: While loop and LAST
by Kenosis (Priest) on Nov 02, 2013 at 06:47 UTC

    To avoid the each iterator issue that BrowserUk mentioned, use keys in a for loop instead (replacing the first six lines of your while loop with these):

    for $key ( keys %job_categories ) { print ( "looking in $job_title for\: $key \n"); if ($job_title =~ /$key/i) { $department = $job_categories{$key}; print ( "found a match. setting department to $job_catego +ries{$key} \n");

    Hope this helps!

      Hm. If the hash is of any size, that is a much slower and more memory hungry approach to solving the problem than simply using keys %job_categories; within the outer loop to reset the iterator:

      $h{ $_ } = $_ for 0 .. 1e6;; cmpthese 1,{ a=>q[ while( ( $k, $v ) = each %h ){ my $x = "$k:$v"; } ], b=>q[ for my $k ( keys %h ){ my $x = "$k:$h{$k}"; } ], };; s/iter b a b 8.67 -- -83% a 1.52 472% --

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        The OP's hash contained only two sample keys. Given the task, I'd be surprised if keys > 500, but my guess isn't too meaningful.

        Thank you both for the help. I am thinking this is unlikely to become a large hash (looks like BroswerUK simulated a million key pairs). I don't quite understand while there is a big difference in memory and performance. I'm always looking for elegance and performance so I plan to play with BrowserUK's idea; of course code that I understand is pretty important also.