http://qs1969.pair.com?node_id=690711

Grey Fox has asked for the wisdom of the Perl Monks concerning the following question:

Hello Fellow Monks;

I'm having a difficult time trying to get REGEX to stop after finding the first occurence of a string. I am extracting title information from an SGML file, and the regex is only returning partial information.

I'm using the following regex.

m/\s(?:-\s)?([\w\s\d()-,]{1,75})<\/title>/
against the following data.
<title>GRP -134 - Grinding And Cutting Solution (ACME PR50 - Water Typ +e) </title> <title> GRP-123-1 - Grinding And Cutting Solution (Quakeroat 2780 UTC +- Synthetic Type)</title> <title> GRP-124 - Alkaline Rust Remover Solution</title> <title> GRP-124-1 - Alkaline Rust Remover Solution (Ardvark 185 - Rust + Remover)</title> <title> GRP-124-2 - Alkaline Rust Remover Solution (Ardvark 185L - Rus +t Remover)</title> <title> GRP-124-3 - Alkaline Rust Remover Solution (Bee-Dee J84AL - Ru +st Remover)</title> <title> GRP-124-4 - Alkaline Rust Remover Solution (Mag HD2-202 - Rust + Remover)</title> <title> GRP-124-5 - Alkaline Rust Remover Solution (Turk 4181L - Rust +Remover)</title> <title> GRP-124-6 - Alkaline Rust Remover Solution (Turk 4181 - Rust R +emover)</title> <title> GRP-124-7 - Alkaline Rust Remover Solution (Bee-Dee J84A - Rus +t Remover)</title> <title> GRP-124-8 - Alkaline Rust Remover Solution (Cadilac HTP-1150 - + Rust Remover)</title> <title> GRP-124-9 - Alkaline Rust Remover Solution (Cadilac HTP-1150L +- Rust Remover)</title> <title> GRP-124-10 - Alkaline Rust Remover (Titanium Long Soak)";

Instead of getting "Grinding and Cutting Solution (ACME PR50 - Water Type)", I'm only getting "Water Type)", because of the second occurence of the " - " in the data. I know there is a way to make the regex only see the first occurence and then pass me all of the rest of the text up until </title>.

I've looked at Perlre and http://www.regular-expressions.info/quickstart.html

Thanks.

Note: Added more examples
-- Grey Fox
"We are grey. We stand between the darkness and the light" B5

Replies are listed 'Best First'.
Re: Regex only returning partial data
by toolic (Bishop) on Jun 06, 2008 at 17:42 UTC
    If I escape the dash inside your character class, I get more:
    $_ = '<title>GRP -134 - Grinding And Cutting Solution (ACME PR50 - Wat +er Type) </title>'; if (m/\s(?:-\s)?([\w\s\d()\-,]{1,75})<\/title>/) { print "$1\n"; }

    prints:

    -134 - Grinding And Cutting Solution (ACME PR50 - Water Type)

    Update: Also, the \d is not necessary since you already use \w. You should also consider using one of the CPAN HTML parser modules to grab the contents of the <title> elements.

Re: Regex only returning partial data
by throop (Chaplain) on Jun 06, 2008 at 17:54 UTC
    There are three dashes in your example, not two. And a previous poster's comments about escaping the dash apply. So I'm unsure just what you're trying to do. Still, you're wanting to change the greedy behavior of [...]{1,75}. Check out What does it mean that regexes are greedy? in perlfaq6

    throop

    update: you've added more examples, so let me give you some more comments –

    Are you ever going to have line breaks in your titles?
    Are all your titles going to start with GRP?
    I suggest:

    m| \s* GRP \- \d+ (?: \s? \-\d+) # The GRP intro \s+ \- \s+ # The dash [\w\s\d()\-,]{1,75} # I have doubts about this <\/title> |xms
    If this looks unfamiliar, check out How can I hope to use regular expressions.. I have some doubts that you really want the spec [\w\s\d()\-,]{1,75}. That is, are you really confident that you're not going to see a line like
    <title> GRP-124-9 - Alkaline Rust Remover Solution (Yugo HTP-1150L - R +ust &amp; Stain Remover)</title>
    and lose on the '&amp;'?
Re: Regex only returning partial data
by johngg (Canon) on Jun 06, 2008 at 20:48 UTC
    You can use regex look-around assertions and negated character classes to achieve this.

    use strict; use warnings; my $rxTitle = qr {(?x) # Use extended syntax (?<=-\s) # If preceded by hyphen & space ( # Open capture [^(-]+ # One or more non- opening parentheses # or hyphens (?: # Non-capture group for quantifier [^)]+ # One or more non- closing parentheses \) # Followed by a closing parenthesis )? # Quantify, zero or one of ) # Close capture (?=\s?</title>) # If followed by optional space and # closing title tag }; while ( <DATA> ) { my ( $text ) = m{$rxTitle}; print qq{$text\n}; } __END__ <title>GRP -134 - Grinding And Cutting Solution (ACME PR50 - Water Typ +e) </title> <title> GRP-123-1 - Grinding And Cutting Solution (Quakeroat 2780 UTC +- Synthetic Type)</title> <title> GRP-124 - Alkaline Rust Remover Solution</title> <title> GRP-124-1 - Alkaline Rust Remover Solution (Ardvark 185 - Rust + Remover)</title> <title> GRP-124-2 - Alkaline Rust Remover Solution (Ardvark 185L - Rus +t Remover)</title> <title> GRP-124-3 - Alkaline Rust Remover Solution (Bee-Dee J84AL - Ru +st Remover)</title> <title> GRP-124-4 - Alkaline Rust Remover Solution (Mag HD2-202 - Rust + Remover)</title> <title> GRP-124-5 - Alkaline Rust Remover Solution (Turk 4181L - Rust +Remover)</title> <title> GRP-124-6 - Alkaline Rust Remover Solution (Turk 4181 - Rust R +emover)</title> <title> GRP-124-7 - Alkaline Rust Remover Solution (Bee-Dee J84A - Rus +t Remover)</title> <title> GRP-124-8 - Alkaline Rust Remover Solution (Cadilac HTP-1150 - + Rust Remover)</title> <title> GRP-124-9 - Alkaline Rust Remover Solution (Cadilac HTP-1150L +- Rust Remover)</title> <title> GRP-124-10 - Alkaline Rust Remover (Titanium Long Soak)</title +>

    The output.

    Grinding And Cutting Solution (ACME PR50 - Water Type) Grinding And Cutting Solution (Quakeroat 2780 UTC - Synthetic Typ Alkaline Rust Remover Solution Alkaline Rust Remover Solution (Ardvark 185 - Rust Remover) Alkaline Rust Remover Solution (Ardvark 185L - Rust Remover) Alkaline Rust Remover Solution (Bee-Dee J84AL - Rust Remover) Alkaline Rust Remover Solution (Mag HD2-202 - Rust Remover) Alkaline Rust Remover Solution (Turk 4181L - Rust Remover) Alkaline Rust Remover Solution (Turk 4181 - Rust Remover) Alkaline Rust Remover Solution (Bee-Dee J84A - Rust Remover) Alkaline Rust Remover Solution (Cadilac HTP-1150 - Rust Remover) Alkaline Rust Remover Solution (Cadilac HTP-1150L - Rust Remover) Alkaline Rust Remover (Titanium Long Soak)

    I hope this is useful.

    Cheers,

    JohnGG

Re: Regex only returning partial data
by FunkyMonk (Chancellor) on Jun 06, 2008 at 17:40 UTC
    You didn't give us much data to test against, but would this be good enough?
    $_ = "<title>GRP -134 - Grinding And Cutting Solution (ACME PR50 - Wat +er Type) </title>"; print "/", m{\d+\s*-\s*(.*?)\s*</title>}, "/"; #/Grinding And Cutting Solution (ACME PR50 - Water Type)/


    Unless I state otherwise, all my code runs with strict and warnings
      Thanks FunkyMonk, I added more examples. Also I am just trying to pick up the Title description, not the GRP-134 -
      Thanks
      -- Grey Fox
      "We are grey. We stand between the darkness and the light" B5
        OK, with more data it looks like you want to capture everything after " - " up to "</title>". So...
        while (<DATA>) { print "/", m{ - (.*?)\s*</title>}, "/\n"; } __DATA__ <title>GRP -134 - Grinding And Cutting Solution (ACME PR50 - Water Typ +e) </title> <title> GRP-123-1 - Grinding And Cutting Solution (Quakeroat 2780 UTC +- Synthetic Type)</title> <title> GRP-124 - Alkaline Rust Remover Solution</title> <title> GRP-124-1 - Alkaline Rust Remover Solution (Ardvark 185 - Rust + Remover)</title> <title> GRP-124-2 - Alkaline Rust Remover Solution (Ardvark 185L - Rus +t Remover)</title> <title> GRP-124-3 - Alkaline Rust Remover Solution (Bee-Dee J84AL - Ru +st Remover)</title> <title> GRP-124-4 - Alkaline Rust Remover Solution (Mag HD2-202 - Rust + Remover)</title> <title> GRP-124-5 - Alkaline Rust Remover Solution (Turk 4181L - Rust +Remover)</title> <title> GRP-124-6 - Alkaline Rust Remover Solution (Turk 4181 - Rust R +emover)</title> <title> GRP-124-7 - Alkaline Rust Remover Solution (Bee-Dee J84A - Rus +t Remover)</title> <title> GRP-124-8 - Alkaline Rust Remover Solution (Cadilac HTP-1150 - + Rust Remover)</title> <title> GRP-124-9 - Alkaline Rust Remover Solution (Cadilac HTP-1150L +- Rust Remover)</title> <title> GRP-124-10 - Alkaline Rust Remover (Titanium Long Soak)";

        Output:

        /usr/bin/perl -w /home/bri/git/cvsid/pm /Grinding And Cutting Solution (ACME PR50 - Water Type)/ /Grinding And Cutting Solution (Quakeroat 2780 UTC - Synthetic Type)/ /Alkaline Rust Remover Solution/ /Alkaline Rust Remover Solution (Ardvark 185 - Rust Remover)/ /Alkaline Rust Remover Solution (Ardvark 185L - Rust Remover)/ /Alkaline Rust Remover Solution (Bee-Dee J84AL - Rust Remover)/ /Alkaline Rust Remover Solution (Mag HD2-202 - Rust Remover)/ /Alkaline Rust Remover Solution (Turk 4181L - Rust Remover)/ /Alkaline Rust Remover Solution (Turk 4181 - Rust Remover)/ /Alkaline Rust Remover Solution (Bee-Dee J84A - Rust Remover)/ /Alkaline Rust Remover Solution (Cadilac HTP-1150 - Rust Remover)/ /Alkaline Rust Remover Solution (Cadilac HTP-1150L - Rust Remover)/ //

        The empty last match is due yo what I hope is copy-paste error in the data you posted


        Unless I state otherwise, all my code runs with strict and warnings
Re: Regex only returning partial data
by Grey Fox (Chaplain) on Jun 06, 2008 at 20:18 UTC

    Thanks all

    I used all of your advice and came up with the following regex that does the trick.

    m|<title>GRP\-\d{1,3}(?:\-\d{1,3})?(?:\s)?(?:\-\s)?(.*?)</title>|

    The use of the (.*?) instead of the group set and {1,75} makes it a little less greedy.

    Thanks again for all your help.

    -- Grey Fox
    "We are grey. We stand between the darkness and the light" B5