richill has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks I apologise if this is a simple query.

I want to search a string and extract the substring between Marker 1 and Marker 2. Marker 1 and Marker 2 are different.

What is the best way to do this?

My first though is reqular expression but is there a way to treat a string like an object in like in java

If there is a way is it worth pursuing.

At the moment I believe Regular expressions are the best way to do perform these operations.

Replies are listed 'Best First'.
Re: Extracting a substring from HTML
by GrandFather (Saint) on Sep 10, 2006 at 09:49 UTC

    The right way of doing almost anything with HTML is to use the appropriate module. The appropriate module depends somewhat on the task. In this case I'd guess HTML::TreeBuilder is what you want.

    Life is too short to reinvent complicated wheels, and regexen for parsing HTML are complicated wheels indeed.

    If you need any help using TreeBuilder show us what you have tried with a very small (but complete) code sample showing the issue and a very small data sample as required to show the issue.


    DWIM is Perl's answer to Gödel
      Thank you. I'll look at the HTML::Treebuilder now.

      I know it was a basic queston but with so many ways of doing things in perl, the benefit of experience found on here is high.

      I could spend days on clumsy solution.

        The benefit of experience is actually in CPAN. Always look there first before coding anything yourself. Get the Perl Cookbook (ISBN 0-596-00313-7) to get _productive_ right away with Perl, it's going to be your best spent $50 if you are going to work with Perl, and many cool ideas not only on the use of Perl but on many modules for specific stuff.

        Then get the what I call the trilogy: Learning Perl, Intermediate Perl, and Advanced Perl.

        And then, of course, the Camel Book. But that's just to say you have it and have read it.
        Please be careful. Package names are case-sensitive in Perl. That's HTML::TreeBuilder
Re: Extracting a substring from HTML
by graff (Chancellor) on Sep 10, 2006 at 19:20 UTC
    This is an odd sort of post... there seems to be more detail in the title than there is in the text. What perl code have you tried so far? What are these "Markers"? Are they particular html tags? particular patterns of visible text when the html is displayed in a browser? chunks of javascript? Do you have lots of different html pages/files from which to extract stuff? If so, are the "Markers" different from one page/file to the next? Any or all of these things would affect one's choice of a solution.

    On another topic:

    My first though is reqular expression but is there a way to treat a string like an object in like in java

    This struck me as intriguing, because one of the problems I had on the few occasions when I've tried to do something using java (or python), was adjusting to the notion of applying a regex match or substitution on a "string object". It just seems bizarre and unnatural (maybe even inefficient or suboptimal in some way) that regex operations are methods built into string objects, rather than simply being operations on strings (the perlish way). I guess my primitive non-OO orientation is glaringly obvious here...

    In any case, figuring out how to use HTML modules will be time well spent, assuming you have a lot of work to do on HTML data. In the meantime, if you have an immediate task that simply involves capturing whatever comes between "Marker 1" and "Marker 2" in an html stream, here are some reasonable first attempts to do what you want:

    use strict; my $html; open( HTML, "<", "some.html" ); # let's suppose the data is in this f +ile { local $/; $html = <HTML>; #read all the html data into one string } # if you expect just one match (or only want the first one): my ( $match ) = ( $html =~ /Marker 1(.*?)Marker 2/s ); # alternatively, if there are two or more and you want them all: my @matches = ( $html =~ /Marker 1(.*?)Marker 2/gs );
    (update: in both cases the "s" option following the regex can be important, so that the "." (wildcard) will match newlines as well as any other character).

    In the first case, the parens around $match provide a "list context", which will cause the regex match to return whatever string was "captured" by the match (in this case, parens within the regex say what part will be captured).

    In the second case, the "g" option on the regex says "find and return all captured matches"; the result is being assigned to an array, which again provides a list context for the operation.

    (In a scalar context, such as  $found = ( /$pattern/ ) the returned value would simply be the number of matches: 0 or 1 without the "g" option, any non-negative integer with "g" i.e. "false/failure" or "true/success".)

    So the main caveats with this approach (since I don't know what "Marker 1" and "Marker 2" represent) are:

    • you might match something you didn't want, e.g. if "Marker 1" and/or "Marker 2" show up in places like html comments, html header or javascript, whereas you might just want the match to succeed on the displayable text part;
    • when you capture a region that you do want, the text might contain stuff you can't use, e.g. incomplete pieces of nested tag structure or extra content you'd rather ignore, and "fixing" it could get dicey.
    (updated wording of 2nd bullet for clarity).

    Those are a few of the reasons why HTML parsing modules are the preferred tool in many cases -- but for a range of limited applications, simple regex matches can suffice.

Re: Extracting a substring from HTML
by mugwumpjism (Hermit) on Sep 11, 2006 at 04:13 UTC

    Check out XML::LibXML for the nice ways to do this, using standards such as XPath and DOM. Regular expressions are not a very good way to parse structured input like XML, unless you can limit the input to a known subset of XML forms.

    $h=$ENV{HOME};my@q=split/\n\n/,`cat $h/.quotes`;$s="$h/." ."signature";$t=`cat $s`;print$t,"\n",$q[rand($#q)],"\n";
Re: Extracting a substring from HTML
by Anonymous Monk on Sep 12, 2006 at 03:48 UTC