Parsing and searching HTML code

jayto has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing and searching HTML code by Corion (Patriarch) on Jul 26, 2012 at 16:39 UTC
You show us some code, but you don't show us the (relevant) part of the data, nor do you show us what `$html` contains. As the crystal ball is still getting sanded down, I guess you have to supply the data yourself. My recommendation for matching is to use the idiom of: `$html =~ /(foo)/ or die "Invalid HTML, couldn't find 'foo' in '$html'"; print $1;` [download] But even more, for scraping websites, I recommend one of the modules that use HTML::TreeBuilder::XPath, or something comparable, like Mojolicious, Web::Magic, Web::Scraper or App::scrape.	[reply] [d/l] [select]
Re^2: Parsing and searching HTML code by jayto (Acolyte) on Jul 26, 2012 at 16:47 UTC
`use strict; use warnings; use ID::Utilities; use constant ARCHIVE => "http://www.us-cert.gov/control_systems/ics-ce +rt/archive.html"; my $utils = new ID::Utilities; my $html = $utils->getHTML( ARCHIVE ); my $vendor = "Rockwell Automation"; my $search_start = "<b>$vendor</b>"; my $search_end = "<p> </p>"; $html =~ m/($search_start.+$search_end)/i; print $1;` [download] I look at a bunch of modules including the one you you mentioned, but I do not understand how to use them. The documentations for the TreeBuilder modules (and other HTML parser modules) are above my current perl knowledge. Data looks like this: <p><b>RealFlex Technologies</b><br />Multiple Vulnerabilities in RealF +lex RealWin, <a href="/control_systems/pdf/ICS-ALERT-11-080-04.pdf">ICS-ALERT-11-08 +0-04</a> (March 21, 2011)</p> <p>RealFlex RealWin Multiple Vulnerabilities, <a href="/control_systems/pdf/ICSA-11-110-01.pdf">ICSA-11-110-01</a> (April 20, 2011)</p> <p>RealWin Buffer Overflow, <a href="/control_systems/pdf/ICSA-10-313-01.pdf">ICSA-10-313-01</a> (November 09, 2010)</p> <p>RealWin Buffer Overflows, <a href="/control_systems/pdf/ICS-ALERT-10-305-01.pdf">ICS-ALERT-10-30 +5-01</a> (November 01, 2010)</p> <p> </p> <p><b>Rockwell Automation</b><br /> Rockwell Automation ControlLogix Multiple PLC Vulnerabilities (UPDATE) +, <a href="/control_systems/pdf/ICS-Alert-12-020-02A.pdf">ICS-ALERT-12-0 +20-02A</a> (February 14, 2012)</p> <p>Rockwell Automation ControlLogix PLC Multiple Vulnerabilities, <a href="/control_systems/pdf/ICS-Alert-12-020-02.pdf">ICS-ALERT-12-02 +0-02</a> (January 20, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver, <a href="/control_systems/pdf/ICSA-12-088-01.pdf">ICSA-12-088-01</a> (March 28, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver (UPDATE), <a href="/control_systems/pdf/ICSA-12-088-01A.pdf">ICSA-12-088-01A</a> (April 06, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver, <a href="/control_systems/pdf/ICS-ALERT-12-017-01.pdf">ICS-ALERT-12-01 +7-01</a> (January 17, 2012)</p> <p>Rockwell FactoryTalk Diag Viewer Memory Corruption, <a href="/control_systems/pdf/ICSA-11-175-01.pdf">ICSA-11-175-01</a> (June 24, 2011)</p> <p>Rockwell-PLC5, <a href="/control_systems/pdf/ICSA-10-070-02.pdf">ICSA-10-070-02</a> (March 11, 2010)</p> <p>Rockwell RSLinx EDS, <a href="/control_systems/pdf/ICSA-11-161-01.pdf">ICSA-11-161-01</a> (June 10, 2011)</p> <p>Rockwell RSLogix, <a href="/control_systems/pdf/ICS-ALERT-11-256-05.pdf">ICS-ALERT-11-25 +6-05</a> (September 13, 2011)</p> <p>Rockwell RSLogix (UPDATE), <a href="/control_systems/pdf/ICS-ALERT-11-256-05A.pdf">ICS-ALERT-11-2 +56-05A</a>  (September 19, 2011)</p> <p>Rockwell RSLogix Denial-of-Service Vulnerability, <a href="/control_systems/pdf/ICSA-11-273-03.pdf">ICSA-11-273-03</a> (September 30, 2011)</p> <p>Rockwell RSLogix Denial-of-Service Vulnerability (UPDATE), <a href="/control_systems/pdf/ICSA-11-273-03A.pdf">ICSA-11-273-03A</a> + (October 06, 2011)</p> <p>RSLinx, <a href="/control_systems/pdf/ICSA-10-070-01.pdf">ICSA-10-070-01</a> (March 11, 2010)</p> <p>RSLinx (UPDATE), <a href="/control_systems/pdf/ICSA-10-070-01A.pdf">ICSA-10-070-01A</a> + (May 03, 2010)</p> <p> </p> <p><b>RuggedCom</b><br /> RuggedCom Weak Cryptography for Password Vulnerability, <a href="/control_systems/pdf/ICSA-12-146-01A.pdf">ICSA-12-146-01A</a> (June 18, 2012) <p>RuggedCom Weak Cryptography for Password Vulnerability, <a href="/control_systems/pdf/ICSA-12-146-01.pdf">ICSA-12-146-01</a> (May 25, 2012) </p> <p>RuggedCom Weak Cryptography for Password Vulnerability, <a href="/control_systems/pdf/ICS-ALERT-12-116-01.pdf">ICS-ALERT-12-11 +6-01</a> (April 25, 2012)</p> <p>RuggedCom Weak Cryptography for Password Vulnerability (UPDATE), <a href="/control_systems/pdf/ICS-ALERT-12-116-01A.pdf">ICS-ALERT-12-1 +16-01A</a> (April 27, 2012) </p> <p> </p> [download] I need to get all the advisories under Rockwell Automation... That is my goal, but I am having trouble being able to accomplish it. Any ideas on how to fix my regular expression?	[reply] [d/l] [select]
Re^3: Parsing and searching HTML code by Corion (Patriarch) on Jul 26, 2012 at 17:01 UTC
The dot metacharacter (".") does not match newlines. See perlre for the `/s` modifier. As an aside, your code is not really helpful as it includes a reference to a module that I don't have, `ID::Utilities`, and calls some method in there. If you're certain that the `->get` method returns the correct HTML, why reference that method at all? Simply directly assign the HTML value in a test program, if only to eliminate misbehaviour of the `->get` method, or of the remote end.	[reply] [d/l] [select]
Re: Parsing and searching HTML code by Your Mother (Archbishop) on Jul 26, 2012 at 18:57 UTC
Regular expressions are very fragile for HTML parsing. Here's a parser (XML::LibXML) based example– use strictures; use XML::LibXML; use open qw(:std :utf8); use YAML; my $dom = XML::LibXML->load_html( string => do { local $/; <DATA> }, keep_blanks => 0 ); my @advisories; # Only select <c><p></c>s that have a PDF link inside. for my $p ( map { $_->parentNode } $dom->findnodes(q{//p//a[contains(@ +href,'.pdf')]}) ) { my %tmp; for my $kid ( $p->childNodes ) { if ( $kid->nodeName eq "a" ) { $tmp{pdf} = { title => $kid->textContent, href => $kid->getAttribute("href") }; } elsif ( not $tmp{pdf} ) { # You'd have to do some shuffling to handle <br/>->\n here +. $tmp{heading} .= $kid->textContent; } else { ( $tmp{date} = $kid->textContent ) =~ s/[)(\n\r]//g; } } s/[\s,]+\Z// for $tmp{heading}, $tmp{date}; push @advisories, \%tmp; } print YAML::Dump(\@advisories); exit 0; __DATA__ # YOUR HTML FRAGMENT HERE [download] Snippet of output `--- - date: 'March 21, 2011' heading: \|- RealFlex TechnologiesMultiple Vulnerabilities in RealFlex RealWin pdf: href: /control_systems/pdf/ICS-ALERT-11-080-04.pdf title: ICS-ALERT-11-080-04 - date: 'April 20, 2011' heading: RealFlex RealWin Multiple Vulnerabilities pdf: href: /control_systems/pdf/ICSA-11-110-01.pdf title: ICSA-11-110-01 ...` [download] I do understand that regular expressions seem more accessible at first and can solve many specific/one-off problems but putting in the time to get up the learning curve of any of the good HTML/XML parsers will repay greatly over time.	[reply] [d/l] [select]
Re^2: Parsing and searching HTML code by jayto (Acolyte) on Jul 26, 2012 at 20:32 UTC
Thanks for showing me that, I'm probably going to be using your post as a reference in the future, but I already finished my program and I moved on to the next part... Parsing the PDF file...	[reply]
Re: Parsing and searching HTML code by Kenosis (Priest) on Jul 26, 2012 at 17:10 UTC
Corion's suggestions are most helpful, especially encouraging using modules to grab the section you need. You're almost there in coding it yourself, but be aware (and I suspect you are, but it's worth mentioning) that the pattern matching may break the capture if the site's html structure changes. Given that the matching may eventually break (although the page's structure is consistent), consider the following: `use Modern::Perl; use LWP::Simple 'get'; my $vendor = 'Rockwell Automation'; my $search_start = "<p><b>$vendor</b><br />"; my $search_end = '<p> </p>'; my $url = 'http://www.us-cert.gov/control_systems/ics-cert/archive.htm +l'; my $html = get $url; my ($section) = $html =~ /(\Q$search_start\E.?\Q$search_end\E)/s; print $section;` [download] Output: <p><b>Rockwell Automation</b><br /> Rockwell Automation ControlLogix Multiple PLC Vulnerabilities (UPDATE) +, <a href="/control_systems/pdf/ICS-Alert-12-020-02A.pdf">ICS-ALERT-12-0 +20-02A</a> (February 14, 2012)</p> <p>Rockwell Automation ControlLogix PLC Multiple Vulnerabilities, <a href="/control_systems/pdf/ICS-Alert-12-020-02.pdf">ICS-ALERT-12-02 +0-02</a> (January 20, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver, <a href="/control_systems/pdf/ICSA-12-088-01.pdf">ICSA-12-088-01</a> (March 28, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver (UPDATE), <a href="/control_systems/pdf/ICSA-12-088-01A.pdf">ICSA-12-088-01A</a> (April 06, 2012)</p> <p>Rockwell Automation FactoryTalk RNADiagReceiver, <a href="/control_systems/pdf/ICS-ALERT-12-017-01.pdf">ICS-ALERT-12-01 +7-01</a> (January 17, 2012)</p> <p>Rockwell FactoryTalk Diag Viewer Memory Corruption, <a href="/control_systems/pdf/ICSA-11-175-01.pdf">ICSA-11-175-01</a> (June 24, 2011)</p> <p>Rockwell-PLC5, <a href="/control_systems/pdf/ICSA-10-070-02.pdf">ICSA-10-070-02</a> (March 11, 2010)</p> <p>Rockwell RSLinx EDS, <a href="/control_systems/pdf/ICSA-11-161-01.pdf">ICSA-11-161-01</a> (June 10, 2011)</p> <p>Rockwell RSLogix, <a href="/control_systems/pdf/ICS-ALERT-11-256-05.pdf">ICS-ALERT-11-25 +6-05</a> (September 13, 2011)</p> <p>Rockwell RSLogix (UPDATE), <a href="/control_systems/pdf/ICS-ALERT-11-256-05A.pdf">ICS-ALERT-11-2 +56-05A</a>  (September 19, 2011)</p> <p>Rockwell RSLogix Denial-of-Service Vulnerability, <a href="/control_systems/pdf/ICSA-11-273-03.pdf">ICSA-11-273-03</a> (September 30, 2011)</p> <p>Rockwell RSLogix Denial-of-Service Vulnerability (UPDATE), <a href="/control_systems/pdf/ICSA-11-273-03A.pdf">ICSA-11-273-03A</a> + (October 06, 2011)</p> <p>RSLinx, <a href="/control_systems/pdf/ICSA-10-070-01.pdf">ICSA-10-070-01</a> (March 11, 2010)</p> <p>RSLinx (UPDATE), <a href="/control_systems/pdf/ICSA-10-070-01A.pdf">ICSA-10-070-01A</a> + (May 03, 2010)</p> <p> </p> [download] Hope this helps! Update:* Strings in regex now quoted. Thanks, aitap, for the suggestion.	[reply] [d/l] [select]
Re^2: Parsing and searching HTML code by jayto (Acolyte) on Jul 26, 2012 at 17:20 UTC
Thank you, that post was really helpful it was the fact that the dot wasnt capturing new lines (Thanks Corion) and was fixed by putting an s at the end of the expression. I would have used modules, but I could not understand the documentation for any of them, but I would guess that if this website changes its structure any module would probably break since there are no tag IDs to track.	[reply]
Re: Parsing and searching HTML code by aitap (Curate) on Jul 26, 2012 at 17:28 UTC
Your strings should also be "quoted" by `\Q...\E` in case they contain any special characters: `$html =~ m/(\Q$search_start\E.+\Q$search_end\E)/si;` I suggest using HTML::TreeBuilder and `look_down` method of HTML::Element. Sorry if my advice was wrong.	[reply] [d/l] [select]
Re^2: Parsing and searching HTML code by Kenosis (Priest) on Jul 26, 2012 at 17:35 UTC
Good catch--thank you. Have updated the code.	[reply]
Re^2: Parsing and searching HTML code by jayto (Acolyte) on Jul 26, 2012 at 17:47 UTC
I thought that if strings are contained in variables then they don't need to quoted. Also one question Kenosis, in this statement /(\Q$search_start\E.?\Q$search_end\E)/s;, I realized that the ? after the . is important. What does it do?	[reply]
Re^3: Parsing and searching HTML code by Kenosis (Priest) on Jul 26, 2012 at 18:02 UTC
That prevents the regex from being greedy, else the match would end at the very last `$search_end`--way beyond what you wanted.	[reply] [d/l]

Snippet of output