Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Regex only returning partial data

by Grey Fox (Chaplain)
on Jun 06, 2008 at 17:29 UTC ( [id://690711]=perlquestion: print w/replies, xml ) Need Help??

Grey Fox has asked for the wisdom of the Perl Monks concerning the following question:

Hello Fellow Monks;

I'm having a difficult time trying to get REGEX to stop after finding the first occurence of a string. I am extracting title information from an SGML file, and the regex is only returning partial information.

I'm using the following regex.

m/\s(?:-\s)?([\w\s\d()-,]{1,75})<\/title>/
against the following data.
<title>GRP -134 - Grinding And Cutting Solution (ACME PR50 - Water Typ +e) </title> <title> GRP-123-1 - Grinding And Cutting Solution (Quakeroat 2780 UTC +- Synthetic Type)</title> <title> GRP-124 - Alkaline Rust Remover Solution</title> <title> GRP-124-1 - Alkaline Rust Remover Solution (Ardvark 185 - Rust + Remover)</title> <title> GRP-124-2 - Alkaline Rust Remover Solution (Ardvark 185L - Rus +t Remover)</title> <title> GRP-124-3 - Alkaline Rust Remover Solution (Bee-Dee J84AL - Ru +st Remover)</title> <title> GRP-124-4 - Alkaline Rust Remover Solution (Mag HD2-202 - Rust + Remover)</title> <title> GRP-124-5 - Alkaline Rust Remover Solution (Turk 4181L - Rust +Remover)</title> <title> GRP-124-6 - Alkaline Rust Remover Solution (Turk 4181 - Rust R +emover)</title> <title> GRP-124-7 - Alkaline Rust Remover Solution (Bee-Dee J84A - Rus +t Remover)</title> <title> GRP-124-8 - Alkaline Rust Remover Solution (Cadilac HTP-1150 - + Rust Remover)</title> <title> GRP-124-9 - Alkaline Rust Remover Solution (Cadilac HTP-1150L +- Rust Remover)</title> <title> GRP-124-10 - Alkaline Rust Remover (Titanium Long Soak)";

Instead of getting "Grinding and Cutting Solution (ACME PR50 - Water Type)", I'm only getting "Water Type)", because of the second occurence of the " - " in the data. I know there is a way to make the regex only see the first occurence and then pass me all of the rest of the text up until </title>.

I've looked at Perlre and http://www.regular-expressions.info/quickstart.html

Thanks.

Note: Added more examples
-- Grey Fox
"We are grey. We stand between the darkness and the light" B5

Replies are listed 'Best First'.
Re: Regex only returning partial data
by toolic (Bishop) on Jun 06, 2008 at 17:42 UTC
    If I escape the dash inside your character class, I get more:
    $_ = '<title>GRP -134 - Grinding And Cutting Solution (ACME PR50 - Wat +er Type) </title>'; if (m/\s(?:-\s)?([\w\s\d()\-,]{1,75})<\/title>/) { print "$1\n"; }

    prints:

    -134 - Grinding And Cutting Solution (ACME PR50 - Water Type)

    Update: Also, the \d is not necessary since you already use \w. You should also consider using one of the CPAN HTML parser modules to grab the contents of the <title> elements.

Re: Regex only returning partial data
by throop (Chaplain) on Jun 06, 2008 at 17:54 UTC
    There are three dashes in your example, not two. And a previous poster's comments about escaping the dash apply. So I'm unsure just what you're trying to do. Still, you're wanting to change the greedy behavior of [...]{1,75}. Check out What does it mean that regexes are greedy? in perlfaq6

    throop

    update: you've added more examples, so let me give you some more comments –

    Are you ever going to have line breaks in your titles?
    Are all your titles going to start with GRP?
    I suggest:

    m| \s* GRP \- \d+ (?: \s? \-\d+) # The GRP intro \s+ \- \s+ # The dash [\w\s\d()\-,]{1,75} # I have doubts about this <\/title> |xms
    If this looks unfamiliar, check out How can I hope to use regular expressions.. I have some doubts that you really want the spec [\w\s\d()\-,]{1,75}. That is, are you really confident that you're not going to see a line like
    <title> GRP-124-9 - Alkaline Rust Remover Solution (Yugo HTP-1150L - R +ust &amp; Stain Remover)</title>
    and lose on the '&amp;'?
Re: Regex only returning partial data
by johngg (Canon) on Jun 06, 2008 at 20:48 UTC
    You can use regex look-around assertions and negated character classes to achieve this.

    use strict; use warnings; my $rxTitle = qr {(?x) # Use extended syntax (?<=-\s) # If preceded by hyphen & space ( # Open capture [^(-]+ # One or more non- opening parentheses # or hyphens (?: # Non-capture group for quantifier [^)]+ # One or more non- closing parentheses \) # Followed by a closing parenthesis )? # Quantify, zero or one of ) # Close capture (?=\s?</title>) # If followed by optional space and # closing title tag }; while ( <DATA> ) { my ( $text ) = m{$rxTitle}; print qq{$text\n}; } __END__ <title>GRP -134 - Grinding And Cutting Solution (ACME PR50 - Water Typ +e) </title> <title> GRP-123-1 - Grinding And Cutting Solution (Quakeroat 2780 UTC +- Synthetic Type)</title> <title> GRP-124 - Alkaline Rust Remover Solution</title> <title> GRP-124-1 - Alkaline Rust Remover Solution (Ardvark 185 - Rust + Remover)</title> <title> GRP-124-2 - Alkaline Rust Remover Solution (Ardvark 185L - Rus +t Remover)</title> <title> GRP-124-3 - Alkaline Rust Remover Solution (Bee-Dee J84AL - Ru +st Remover)</title> <title> GRP-124-4 - Alkaline Rust Remover Solution (Mag HD2-202 - Rust + Remover)</title> <title> GRP-124-5 - Alkaline Rust Remover Solution (Turk 4181L - Rust +Remover)</title> <title> GRP-124-6 - Alkaline Rust Remover Solution (Turk 4181 - Rust R +emover)</title> <title> GRP-124-7 - Alkaline Rust Remover Solution (Bee-Dee J84A - Rus +t Remover)</title> <title> GRP-124-8 - Alkaline Rust Remover Solution (Cadilac HTP-1150 - + Rust Remover)</title> <title> GRP-124-9 - Alkaline Rust Remover Solution (Cadilac HTP-1150L +- Rust Remover)</title> <title> GRP-124-10 - Alkaline Rust Remover (Titanium Long Soak)</title +>

    The output.

    Grinding And Cutting Solution (ACME PR50 - Water Type) Grinding And Cutting Solution (Quakeroat 2780 UTC - Synthetic Typ Alkaline Rust Remover Solution Alkaline Rust Remover Solution (Ardvark 185 - Rust Remover) Alkaline Rust Remover Solution (Ardvark 185L - Rust Remover) Alkaline Rust Remover Solution (Bee-Dee J84AL - Rust Remover) Alkaline Rust Remover Solution (Mag HD2-202 - Rust Remover) Alkaline Rust Remover Solution (Turk 4181L - Rust Remover) Alkaline Rust Remover Solution (Turk 4181 - Rust Remover) Alkaline Rust Remover Solution (Bee-Dee J84A - Rust Remover) Alkaline Rust Remover Solution (Cadilac HTP-1150 - Rust Remover) Alkaline Rust Remover Solution (Cadilac HTP-1150L - Rust Remover) Alkaline Rust Remover (Titanium Long Soak)

    I hope this is useful.

    Cheers,

    JohnGG

Re: Regex only returning partial data
by FunkyMonk (Chancellor) on Jun 06, 2008 at 17:40 UTC
    You didn't give us much data to test against, but would this be good enough?
    $_ = "<title>GRP -134 - Grinding And Cutting Solution (ACME PR50 - Wat +er Type) </title>"; print "/", m{\d+\s*-\s*(.*?)\s*</title>}, "/"; #/Grinding And Cutting Solution (ACME PR50 - Water Type)/


    Unless I state otherwise, all my code runs with strict and warnings
      Thanks FunkyMonk, I added more examples. Also I am just trying to pick up the Title description, not the GRP-134 -
      Thanks
      -- Grey Fox
      "We are grey. We stand between the darkness and the light" B5
        OK, with more data it looks like you want to capture everything after " - " up to "</title>". So...
        while (<DATA>) { print "/", m{ - (.*?)\s*</title>}, "/\n"; } __DATA__ <title>GRP -134 - Grinding And Cutting Solution (ACME PR50 - Water Typ +e) </title> <title> GRP-123-1 - Grinding And Cutting Solution (Quakeroat 2780 UTC +- Synthetic Type)</title> <title> GRP-124 - Alkaline Rust Remover Solution</title> <title> GRP-124-1 - Alkaline Rust Remover Solution (Ardvark 185 - Rust + Remover)</title> <title> GRP-124-2 - Alkaline Rust Remover Solution (Ardvark 185L - Rus +t Remover)</title> <title> GRP-124-3 - Alkaline Rust Remover Solution (Bee-Dee J84AL - Ru +st Remover)</title> <title> GRP-124-4 - Alkaline Rust Remover Solution (Mag HD2-202 - Rust + Remover)</title> <title> GRP-124-5 - Alkaline Rust Remover Solution (Turk 4181L - Rust +Remover)</title> <title> GRP-124-6 - Alkaline Rust Remover Solution (Turk 4181 - Rust R +emover)</title> <title> GRP-124-7 - Alkaline Rust Remover Solution (Bee-Dee J84A - Rus +t Remover)</title> <title> GRP-124-8 - Alkaline Rust Remover Solution (Cadilac HTP-1150 - + Rust Remover)</title> <title> GRP-124-9 - Alkaline Rust Remover Solution (Cadilac HTP-1150L +- Rust Remover)</title> <title> GRP-124-10 - Alkaline Rust Remover (Titanium Long Soak)";

        Output:

        /usr/bin/perl -w /home/bri/git/cvsid/pm /Grinding And Cutting Solution (ACME PR50 - Water Type)/ /Grinding And Cutting Solution (Quakeroat 2780 UTC - Synthetic Type)/ /Alkaline Rust Remover Solution/ /Alkaline Rust Remover Solution (Ardvark 185 - Rust Remover)/ /Alkaline Rust Remover Solution (Ardvark 185L - Rust Remover)/ /Alkaline Rust Remover Solution (Bee-Dee J84AL - Rust Remover)/ /Alkaline Rust Remover Solution (Mag HD2-202 - Rust Remover)/ /Alkaline Rust Remover Solution (Turk 4181L - Rust Remover)/ /Alkaline Rust Remover Solution (Turk 4181 - Rust Remover)/ /Alkaline Rust Remover Solution (Bee-Dee J84A - Rust Remover)/ /Alkaline Rust Remover Solution (Cadilac HTP-1150 - Rust Remover)/ /Alkaline Rust Remover Solution (Cadilac HTP-1150L - Rust Remover)/ //

        The empty last match is due yo what I hope is copy-paste error in the data you posted


        Unless I state otherwise, all my code runs with strict and warnings
Re: Regex only returning partial data
by Grey Fox (Chaplain) on Jun 06, 2008 at 20:18 UTC

    Thanks all

    I used all of your advice and came up with the following regex that does the trick.

    m|<title>GRP\-\d{1,3}(?:\-\d{1,3})?(?:\s)?(?:\-\s)?(.*?)</title>|

    The use of the (.*?) instead of the group set and {1,75} makes it a little less greedy.

    Thanks again for all your help.

    -- Grey Fox
    "We are grey. We stand between the darkness and the light" B5

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://690711]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-04-18 17:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found