use strict; use warnings; use ID::Utilities; use constant ARCHIVE => "http://www.us-cert.gov/control_systems/ics-ce +rt/archive.html"; my $utils = new ID::Utilities; my $html = $utils->getHTML( ARCHIVE ); my $vendor = "Rockwell Automation"; my $search_start = "<b>$vendor</b>"; my $search_end = "<p>&nbsp;</p>"; $html =~ m/($search_start.+$search_end)/i; print $1;

I look at a bunch of modules including the one you you mentioned, but I do not understand how to use them. The documentations for the TreeBuilder modules (and other HTML parser modules) are above my current perl knowledge.

Data looks like this:

<p><b>RealFlex Technologies</b><br />Multiple Vulnerabilities in RealF +lex RealWin, <a href="/control_systems/pdf/ICS-ALERT-11-080-04.pdf">ICS-ALERT-11-08 +0-04</a> (March 21, 2011)</p> <p>RealFlex RealWin Multiple Vulnerabilities, <a href="/control_systems/pdf/ICSA-11-110-01.pdf">ICSA-11-110-01</a> (April 20, 2011)</p> <p>RealWin Buffer Overflow, <a href="/control_systems/pdf/ICSA-10-313-01.pdf">ICSA-10-313-01</a> (November 09, 2010)</p> <p>RealWin Buffer Overflows, <a href="/control_systems/pdf/ICS-ALERT-10-305-01.pdf">ICS-ALERT-10-30 +5-01</a> (November 01, 2010)</p> <p>&nbsp;</p> <p><b>Rockwell Automation</b><br /> Rockwell Automation ControlLogix Multiple PLC Vulnerabilities (UPDATE) +, <a href="/control_systems/pdf/ICS-Alert-12-020-02A.pdf">ICS-ALERT-12-0 +20-02A</a> (February 14, 2012)</p> <p>Rockwell Automation ControlLogix PLC Multiple Vulnerabilities, <a href="/control_systems/pdf/ICS-Alert-12-020-02.pdf">ICS-ALERT-12-02 +0-02</a> (January 20, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver, <a href="/control_systems/pdf/ICSA-12-088-01.pdf">ICSA-12-088-01</a> (March 28, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver (UPDATE), <a href="/control_systems/pdf/ICSA-12-088-01A.pdf">ICSA-12-088-01A</a> (April 06, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver, <a href="/control_systems/pdf/ICS-ALERT-12-017-01.pdf">ICS-ALERT-12-01 +7-01</a> (January 17, 2012)</p> <p>Rockwell FactoryTalk Diag Viewer Memory Corruption, <a href="/control_systems/pdf/ICSA-11-175-01.pdf">ICSA-11-175-01</a> (June 24, 2011)</p> <p>Rockwell-PLC5, <a href="/control_systems/pdf/ICSA-10-070-02.pdf">ICSA-10-070-02</a> (March 11, 2010)</p> <p>Rockwell RSLinx EDS, <a href="/control_systems/pdf/ICSA-11-161-01.pdf">ICSA-11-161-01</a> (June 10, 2011)</p> <p>Rockwell RSLogix, <a href="/control_systems/pdf/ICS-ALERT-11-256-05.pdf">ICS-ALERT-11-25 +6-05</a> (September 13, 2011)</p> <p>Rockwell RSLogix (UPDATE), <a href="/control_systems/pdf/ICS-ALERT-11-256-05A.pdf">ICS-ALERT-11-2 +56-05A</a>&nbsp; (September 19, 2011)</p> <p>Rockwell RSLogix Denial-of-Service Vulnerability, <a href="/control_systems/pdf/ICSA-11-273-03.pdf">ICSA-11-273-03</a> (September 30, 2011)</p> <p>Rockwell RSLogix Denial-of-Service Vulnerability (UPDATE), <a href="/control_systems/pdf/ICSA-11-273-03A.pdf">ICSA-11-273-03A</a> + (October 06, 2011)</p> <p>RSLinx, <a href="/control_systems/pdf/ICSA-10-070-01.pdf">ICSA-10-070-01</a> (March 11, 2010)</p> <p>RSLinx (UPDATE), <a href="/control_systems/pdf/ICSA-10-070-01A.pdf">ICSA-10-070-01A</a> + (May 03, 2010)</p> <p>&nbsp;</p> <p><b>RuggedCom</b><br /> RuggedCom Weak Cryptography for Password Vulnerability, <a href="/control_systems/pdf/ICSA-12-146-01A.pdf">ICSA-12-146-01A</a> (June 18, 2012) <p>RuggedCom Weak Cryptography for Password Vulnerability, <a href="/control_systems/pdf/ICSA-12-146-01.pdf">ICSA-12-146-01</a> (May 25, 2012) </p> <p>RuggedCom Weak Cryptography for Password Vulnerability, <a href="/control_systems/pdf/ICS-ALERT-12-116-01.pdf">ICS-ALERT-12-11 +6-01</a> (April 25, 2012)</p> <p>RuggedCom Weak Cryptography for Password Vulnerability (UPDATE), <a href="/control_systems/pdf/ICS-ALERT-12-116-01A.pdf">ICS-ALERT-12-1 +16-01A</a> (April 27, 2012) </p> <p>&nbsp;</p>

I need to get all the advisories under Rockwell Automation... That is my goal, but I am having trouble being able to accomplish it. Any ideas on how to fix my regular expression?


In reply to Re^2: Parsing and searching HTML code by jayto
in thread Parsing and searching HTML code by jayto

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.