Corion's suggestions are most helpful, especially encouraging using modules to grab the section you need.
You're almost there in coding it yourself, but be aware (and I suspect you are, but it's worth mentioning) that the pattern matching may break the capture if the site's html structure changes.
Given that the matching may eventually break (although the page's structure is consistent), consider the following:
use Modern::Perl; use LWP::Simple 'get'; my $vendor = 'Rockwell Automation'; my $search_start = "<p><b>$vendor</b><br />"; my $search_end = '<p> </p>'; my $url = 'http://www.us-cert.gov/control_systems/ics-cert/archive.htm +l'; my $html = get $url; my ($section) = $html =~ /(\Q$search_start\E.*?\Q$search_end\E)/s; print $section;
Output:
<p><b>Rockwell Automation</b><br /> Rockwell Automation ControlLogix Multiple PLC Vulnerabilities (UPDATE) +, <a href="/control_systems/pdf/ICS-Alert-12-020-02A.pdf">ICS-ALERT-12-0 +20-02A</a> (February 14, 2012)</p> <p>Rockwell Automation ControlLogix PLC Multiple Vulnerabilities, <a href="/control_systems/pdf/ICS-Alert-12-020-02.pdf">ICS-ALERT-12-02 +0-02</a> (January 20, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver, <a href="/control_systems/pdf/ICSA-12-088-01.pdf">ICSA-12-088-01</a> (March 28, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver (UPDATE), <a href="/control_systems/pdf/ICSA-12-088-01A.pdf">ICSA-12-088-01A</a> (April 06, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver, <a href="/control_systems/pdf/ICS-ALERT-12-017-01.pdf">ICS-ALERT-12-01 +7-01</a> (January 17, 2012)</p> <p>Rockwell FactoryTalk Diag Viewer Memory Corruption, <a href="/control_systems/pdf/ICSA-11-175-01.pdf">ICSA-11-175-01</a> (June 24, 2011)</p> <p>Rockwell-PLC5, <a href="/control_systems/pdf/ICSA-10-070-02.pdf">ICSA-10-070-02</a> (March 11, 2010)</p> <p>Rockwell RSLinx EDS, <a href="/control_systems/pdf/ICSA-11-161-01.pdf">ICSA-11-161-01</a> (June 10, 2011)</p> <p>Rockwell RSLogix, <a href="/control_systems/pdf/ICS-ALERT-11-256-05.pdf">ICS-ALERT-11-25 +6-05</a> (September 13, 2011)</p> <p>Rockwell RSLogix (UPDATE), <a href="/control_systems/pdf/ICS-ALERT-11-256-05A.pdf">ICS-ALERT-11-2 +56-05A</a> (September 19, 2011)</p> <p>Rockwell RSLogix Denial-of-Service Vulnerability, <a href="/control_systems/pdf/ICSA-11-273-03.pdf">ICSA-11-273-03</a> (September 30, 2011)</p> <p>Rockwell RSLogix Denial-of-Service Vulnerability (UPDATE), <a href="/control_systems/pdf/ICSA-11-273-03A.pdf">ICSA-11-273-03A</a> + (October 06, 2011)</p> <p>RSLinx, <a href="/control_systems/pdf/ICSA-10-070-01.pdf">ICSA-10-070-01</a> (March 11, 2010)</p> <p>RSLinx (UPDATE), <a href="/control_systems/pdf/ICSA-10-070-01A.pdf">ICSA-10-070-01A</a> + (May 03, 2010)</p> <p> </p>
Hope this helps!
Update: Strings in regex now quoted. Thanks, aitap, for the suggestion.
In reply to Re: Parsing and searching HTML code
by Kenosis
in thread Parsing and searching HTML code
by jayto
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |