This is an odd sort of post... there seems to be more detail in the title than there is in the text. What perl code have you tried so far? What are these "Markers"? Are they particular html tags? particular patterns of visible text when the html is displayed in a browser? chunks of javascript? Do you have lots of different html pages/files from which to extract stuff? If so, are the "Markers" different from one page/file to the next? Any or all of these things would affect one's choice of a solution.
On another topic:
My first though is reqular expression but is there a way to treat a string like an object in like in java
This struck me as intriguing, because one of the problems I had on the few occasions when I've tried to do something using java (or python), was adjusting to the notion of applying a regex match or substitution on a "string object". It just seems bizarre and unnatural (maybe even inefficient or suboptimal in some way) that regex operations are methods built into string objects, rather than simply being operations on strings (the perlish way). I guess my primitive non-OO orientation is glaringly obvious here...
In any case, figuring out how to use HTML modules will be time well spent, assuming you have a lot of work to do on HTML data. In the meantime, if you have an immediate task that simply involves capturing whatever comes between "Marker 1" and "Marker 2" in an html stream, here are some reasonable first attempts to do what you want:
use strict;
my $html;
open( HTML, "<", "some.html" ); # let's suppose the data is in this f
+ile
{
local $/;
$html = <HTML>; #read all the html data into one string
}
# if you expect just one match (or only want the first one):
my ( $match ) = ( $html =~ /Marker 1(.*?)Marker 2/s );
# alternatively, if there are two or more and you want them all:
my @matches = ( $html =~ /Marker 1(.*?)Marker 2/gs );
(update: in both cases the "s" option following the regex can be important, so that the "." (wildcard) will match newlines as well as any other character).
In the first case, the parens around $match provide a "list context", which will cause the regex match to return whatever string was "captured" by the match (in this case, parens within the regex say what part will be captured).
In the second case, the "g" option on the regex says "find and return all captured matches"; the result is being assigned to an array, which again provides a list context for the operation.
(In a scalar context, such as $found = ( /$pattern/ ) the returned value would simply be the number of matches: 0 or 1 without the "g" option, any non-negative integer with "g" i.e. "false/failure" or "true/success".)
So the main caveats with this approach (since I don't know what "Marker 1" and "Marker 2" represent) are:
- you might match something you didn't want, e.g. if "Marker 1" and/or "Marker 2" show up in places like html comments, html header or javascript, whereas you might just want the match to succeed on the displayable text part;
- when you capture a region that you do want, the text might contain stuff you can't use, e.g. incomplete pieces of nested tag structure or extra content you'd rather ignore, and "fixing" it could get dicey.
(updated wording of 2nd bullet for clarity).
Those are a few of the reasons why HTML parsing modules are the preferred tool in many cases -- but for a range of limited applications, simple regex matches can suffice. |