in reply to Parsing and searching HTML code

You show us some code, but you don't show us the (relevant) part of the data, nor do you show us what $html contains. As the crystal ball is still getting sanded down, I guess you have to supply the data yourself.

My recommendation for matching is to use the idiom of:

$html =~ /(foo)/ or die "Invalid HTML, couldn't find 'foo' in '$html'"; print $1;

But even more, for scraping websites, I recommend one of the modules that use HTML::TreeBuilder::XPath, or something comparable, like Mojolicious, Web::Magic, Web::Scraper or App::scrape.

Replies are listed 'Best First'.
Re^2: Parsing and searching HTML code
by jayto (Acolyte) on Jul 26, 2012 at 16:47 UTC
    use strict; use warnings; use ID::Utilities; use constant ARCHIVE => "http://www.us-cert.gov/control_systems/ics-ce +rt/archive.html"; my $utils = new ID::Utilities; my $html = $utils->getHTML( ARCHIVE ); my $vendor = "Rockwell Automation"; my $search_start = "<b>$vendor</b>"; my $search_end = "<p>&nbsp;</p>"; $html =~ m/($search_start.+$search_end)/i; print $1;

    I look at a bunch of modules including the one you you mentioned, but I do not understand how to use them. The documentations for the TreeBuilder modules (and other HTML parser modules) are above my current perl knowledge.

    Data looks like this:

    <p><b>RealFlex Technologies</b><br />Multiple Vulnerabilities in RealF +lex RealWin, <a href="/control_systems/pdf/ICS-ALERT-11-080-04.pdf">ICS-ALERT-11-08 +0-04</a> (March 21, 2011)</p> <p>RealFlex RealWin Multiple Vulnerabilities, <a href="/control_systems/pdf/ICSA-11-110-01.pdf">ICSA-11-110-01</a> (April 20, 2011)</p> <p>RealWin Buffer Overflow, <a href="/control_systems/pdf/ICSA-10-313-01.pdf">ICSA-10-313-01</a> (November 09, 2010)</p> <p>RealWin Buffer Overflows, <a href="/control_systems/pdf/ICS-ALERT-10-305-01.pdf">ICS-ALERT-10-30 +5-01</a> (November 01, 2010)</p> <p>&nbsp;</p> <p><b>Rockwell Automation</b><br /> Rockwell Automation ControlLogix Multiple PLC Vulnerabilities (UPDATE) +, <a href="/control_systems/pdf/ICS-Alert-12-020-02A.pdf">ICS-ALERT-12-0 +20-02A</a> (February 14, 2012)</p> <p>Rockwell Automation ControlLogix PLC Multiple Vulnerabilities, <a href="/control_systems/pdf/ICS-Alert-12-020-02.pdf">ICS-ALERT-12-02 +0-02</a> (January 20, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver, <a href="/control_systems/pdf/ICSA-12-088-01.pdf">ICSA-12-088-01</a> (March 28, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver (UPDATE), <a href="/control_systems/pdf/ICSA-12-088-01A.pdf">ICSA-12-088-01A</a> (April 06, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver, <a href="/control_systems/pdf/ICS-ALERT-12-017-01.pdf">ICS-ALERT-12-01 +7-01</a> (January 17, 2012)</p> <p>Rockwell FactoryTalk Diag Viewer Memory Corruption, <a href="/control_systems/pdf/ICSA-11-175-01.pdf">ICSA-11-175-01</a> (June 24, 2011)</p> <p>Rockwell-PLC5, <a href="/control_systems/pdf/ICSA-10-070-02.pdf">ICSA-10-070-02</a> (March 11, 2010)</p> <p>Rockwell RSLinx EDS, <a href="/control_systems/pdf/ICSA-11-161-01.pdf">ICSA-11-161-01</a> (June 10, 2011)</p> <p>Rockwell RSLogix, <a href="/control_systems/pdf/ICS-ALERT-11-256-05.pdf">ICS-ALERT-11-25 +6-05</a> (September 13, 2011)</p> <p>Rockwell RSLogix (UPDATE), <a href="/control_systems/pdf/ICS-ALERT-11-256-05A.pdf">ICS-ALERT-11-2 +56-05A</a>&nbsp; (September 19, 2011)</p> <p>Rockwell RSLogix Denial-of-Service Vulnerability, <a href="/control_systems/pdf/ICSA-11-273-03.pdf">ICSA-11-273-03</a> (September 30, 2011)</p> <p>Rockwell RSLogix Denial-of-Service Vulnerability (UPDATE), <a href="/control_systems/pdf/ICSA-11-273-03A.pdf">ICSA-11-273-03A</a> + (October 06, 2011)</p> <p>RSLinx, <a href="/control_systems/pdf/ICSA-10-070-01.pdf">ICSA-10-070-01</a> (March 11, 2010)</p> <p>RSLinx (UPDATE), <a href="/control_systems/pdf/ICSA-10-070-01A.pdf">ICSA-10-070-01A</a> + (May 03, 2010)</p> <p>&nbsp;</p> <p><b>RuggedCom</b><br /> RuggedCom Weak Cryptography for Password Vulnerability, <a href="/control_systems/pdf/ICSA-12-146-01A.pdf">ICSA-12-146-01A</a> (June 18, 2012) <p>RuggedCom Weak Cryptography for Password Vulnerability, <a href="/control_systems/pdf/ICSA-12-146-01.pdf">ICSA-12-146-01</a> (May 25, 2012) </p> <p>RuggedCom Weak Cryptography for Password Vulnerability, <a href="/control_systems/pdf/ICS-ALERT-12-116-01.pdf">ICS-ALERT-12-11 +6-01</a> (April 25, 2012)</p> <p>RuggedCom Weak Cryptography for Password Vulnerability (UPDATE), <a href="/control_systems/pdf/ICS-ALERT-12-116-01A.pdf">ICS-ALERT-12-1 +16-01A</a> (April 27, 2012) </p> <p>&nbsp;</p>

    I need to get all the advisories under Rockwell Automation... That is my goal, but I am having trouble being able to accomplish it. Any ideas on how to fix my regular expression?

      The dot metacharacter (".") does not match newlines. See perlre for the /s modifier.

      As an aside, your code is not really helpful as it includes a reference to a module that I don't have, ID::Utilities, and calls some method in there. If you're certain that the ->get method returns the correct HTML, why reference that method at all? Simply directly assign the HTML value in a test program, if only to eliminate misbehaviour of the ->get method, or of the remote end.