comment on

This is an odd sort of post... there seems to be more detail in the title than there is in the text. What perl code have you tried so far? What are these "Markers"? Are they particular html tags? particular patterns of visible text when the html is displayed in a browser? chunks of javascript? Do you have lots of different html pages/files from which to extract stuff? If so, are the "Markers" different from one page/file to the next? Any or all of these things would affect one's choice of a solution.

On another topic:

My first though is reqular expression but is there a way to treat a string like an object in like in java

This struck me as intriguing, because one of the problems I had on the few occasions when I've tried to do something using java (or python), was adjusting to the notion of applying a regex match or substitution on a "string object". It just seems bizarre and unnatural (maybe even inefficient or suboptimal in some way) that regex operations are methods built into string objects, rather than simply being operations on strings (the perlish way). I guess my primitive non-OO orientation is glaringly obvious here...

In any case, figuring out how to use HTML modules will be time well spent, assuming you have a lot of work to do on HTML data. In the meantime, if you have an immediate task that simply involves capturing whatever comes between "Marker 1" and "Marker 2" in an html stream, here are some reasonable first attempts to do what you want:

use strict;

my $html;

open( HTML, "<", "some.html" );  # let's suppose the data is in this f
+ile
{
    local $/;
    $html = <HTML>; #read all the html data into one string
}

# if you expect just one match (or only want the first one):
my ( $match ) = ( $html =~ /Marker 1(.*?)Marker 2/s );

# alternatively, if there are two or more and you want them all:
my @matches = ( $html =~ /Marker 1(.*?)Marker 2/gs );
[download]

(update: in both cases the "s" option following the regex can be important, so that the "." (wildcard) will match newlines as well as any other character).

In the first case, the parens around $match provide a "list context", which will cause the regex match to return whatever string was "captured" by the match (in this case, parens within the regex say what part will be captured).

In the second case, the "g" option on the regex says "find and return all captured matches"; the result is being assigned to an array, which again provides a list context for the operation.

(In a scalar context, such as $found = ( /$pattern/ ) the returned value would simply be ~~the number of matches:~~ 0 or 1 ~~without the "g" option, any non-negative integer with "g"~~ i.e. "false/failure" or "true/success".)

So the main caveats with this approach (since I don't know what "Marker 1" and "Marker 2" represent) are:

you might match something you didn't want, e.g. if "Marker 1" and/or "Marker 2" show up in places like html comments, html header or javascript, whereas you might just want the match to succeed on the displayable text part;
when you capture a region that you do want, the text might contain stuff you can't use, e.g. incomplete pieces of nested tag structure or extra content you'd rather ignore, and "fixing" it could get dicey.

(updated wording of 2nd bullet for clarity).

Those are a few of the reasons why HTML parsing modules are the preferred tool in many cases -- but for a range of limited applications, simple regex matches can suffice.

In reply to Re: Extracting a substring from HTML by graff
in thread Extracting a substring from HTML by richill

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.