jayto has asked for the wisdom of the Perl Monks concerning the following question:
So I have received the task of parsing this http://www.us-cert.gov/control_systems/ics-cert/archive.html. I am given a Company name by the user and I must search that page for the section and return all the information in that particular section. So far I tried using this :
my $vendor = "Rockwell Automation"; my $search_start = "<b>$vendor</b>"; my $search_end = "<p> </p>"; $html =~ m/($search_start.+$search_end)/i; print $1;
I do not understand why this doesn't work, it prints out that $1 is uninitialized. What is wrong with my expression? Scroll down for more information...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Parsing and searching HTML code
by Corion (Patriarch) on Jul 26, 2012 at 16:39 UTC | |
You show us some code, but you don't show us the (relevant) part of the data, nor do you show us what $html contains. As the crystal ball is still getting sanded down, I guess you have to supply the data yourself. My recommendation for matching is to use the idiom of:
But even more, for scraping websites, I recommend one of the modules that use HTML::TreeBuilder::XPath, or something comparable, like Mojolicious, Web::Magic, Web::Scraper or App::scrape. | [reply] [d/l] [select] |
by jayto (Acolyte) on Jul 26, 2012 at 16:47 UTC | |
I look at a bunch of modules including the one you you mentioned, but I do not understand how to use them. The documentations for the TreeBuilder modules (and other HTML parser modules) are above my current perl knowledge. Data looks like this:
I need to get all the advisories under Rockwell Automation... That is my goal, but I am having trouble being able to accomplish it. Any ideas on how to fix my regular expression? | [reply] [d/l] [select] |
by Corion (Patriarch) on Jul 26, 2012 at 17:01 UTC | |
The dot metacharacter (".") does not match newlines. See perlre for the /s modifier. As an aside, your code is not really helpful as it includes a reference to a module that I don't have, ID::Utilities, and calls some method in there. If you're certain that the ->get method returns the correct HTML, why reference that method at all? Simply directly assign the HTML value in a test program, if only to eliminate misbehaviour of the ->get method, or of the remote end. | [reply] [d/l] [select] |
|
Re: Parsing and searching HTML code
by Your Mother (Archbishop) on Jul 26, 2012 at 18:57 UTC | |
Regular expressions are very fragile for HTML parsing. Here's a parser (XML::LibXML) based example–
Snippet of output
I do understand that regular expressions seem more accessible at first and can solve many specific/one-off problems but putting in the time to get up the learning curve of any of the good HTML/XML parsers will repay greatly over time. | [reply] [d/l] [select] |
by jayto (Acolyte) on Jul 26, 2012 at 20:32 UTC | |
| [reply] |
|
Re: Parsing and searching HTML code
by Kenosis (Priest) on Jul 26, 2012 at 17:10 UTC | |
Corion's suggestions are most helpful, especially encouraging using modules to grab the section you need. You're almost there in coding it yourself, but be aware (and I suspect you are, but it's worth mentioning) that the pattern matching may break the capture if the site's html structure changes. Given that the matching may eventually break (although the page's structure is consistent), consider the following:
Output:
Hope this helps! Update: Strings in regex now quoted. Thanks, aitap, for the suggestion. | [reply] [d/l] [select] |
by jayto (Acolyte) on Jul 26, 2012 at 17:20 UTC | |
| [reply] |
|
Re: Parsing and searching HTML code
by aitap (Curate) on Jul 26, 2012 at 17:28 UTC | |
Your strings should also be "quoted" by \Q...\E in case they contain any special characters: $html =~ m/(\Q$search_start\E.+\Q$search_end\E)/si; I suggest using HTML::TreeBuilder and look_down method of HTML::Element.
Sorry if my advice was wrong.
| [reply] [d/l] [select] |
by Kenosis (Priest) on Jul 26, 2012 at 17:35 UTC | |
Good catch--thank you. Have updated the code. | [reply] |
by jayto (Acolyte) on Jul 26, 2012 at 17:47 UTC | |
| [reply] |
by Kenosis (Priest) on Jul 26, 2012 at 18:02 UTC | |
That prevents the regex from being greedy, else the match would end at the very last $search_end--way beyond what you wanted. | [reply] [d/l] |