mariuspopovici has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

I'm learning Perl and while working on a small project I came up upon this issue: I need to extract the substring marked as XXX in the following text using RegEx:

"Label1 : XXX Label2 : aaaa Label3 : bbbb Label5 : ccc and so on...."

The regular expression should probably look like this:

$txt =~ /Label1 : (.*) Label2 :/i)

only the second label is arbitrary, meaning that it's not always "Label2" it could be some other text. Basically, I want to extract the value after the first label. The string length of this value is unknown.

I hope I'm making some sense here :) Thanks for any help.

Marius

Replies are listed 'Best First'.
Re: RegEx question
by McDarren (Abbot) on Oct 01, 2006 at 04:05 UTC
    If your "XXX" isn't expected to contain any whitespace, then you could grab everything after "Label1 :" until you hit some whitespace, eg:
    $txt =~ /^Label1 : (.*?)\s+/

    Note the anchoring "^" in the above, and the use of ".*?" to make it non-greedy (if that's actually what is required).

    However.... it seems to me that split would be a better tool for this task:

    my ($wanted) = (split /\s?Label\d\s:\s/, $txt)[1];

    Cheers,
    Darren :)

      Thanks for you answers,

      Unfortunately, the text can contain whitespace, several words, sometimes comma separated.

      The second example would not work because I chose a misleading example. The labels are not necessarily named Label1, Label2, ... LabelN. They can be different words. A better example would be:

      "some text .... Programming Languages: C++, Java Author: John Date Cre +ated: 20004-01-05 10:23 ....."

      In this case, I would need to extract the string: "C++, Java".

      Again, thanks for taking the time to look at this.

      Marius

        Well then, you've got a problem, but I'm going to propose a solution. First, I'll try to explain the problem.

        If one label is "Programming Languages:", another label is "Author:", and another is "Date Created:", you clearly cannot count on the labels not having whitespace. And if your data fields are "C++, Java", "John", "2004-01-05 10:23...", you clearly cannot count on your data fields not containing whitespace. Your fields aren't of fixed width either. And your delimiter (the colon) appears mid-record, so it's more of an anchor than a delimiter, which doesn't help tremendously. What that leaves you with is this: No good way of determining where a data field ends, and where a new label starts. ......unless, of course.... unless you're lucky enough to know all the possible labels.

        Maybe you could instead skim for known labels. That would be helpful. For example, if you know that the only labels in the text are "Programming Languages", "Author", and "Date Created", you could compose your regular expression like this:

        my $labels = qr/Programming Languages|Author|Date Created/; my $re = qr/($labels):(.+?)(?=$labels|$)/; while( my( $label, $data ) = $text =~ m/$re/g ) { print "Label: $label\tData: $data\n"; }

        This will capture the known label into $label on each iteration, and then the field following the label into $data. Each match stops as soon as the lookahead assertion finds the next known label, or the end of the string.


        Dave

        Are you getting the original data one entry at a time, or are several entries munged together?

        The solution is simple if you get the data one entry per line and there is no more than one word for the second label ('Author' in your example). A slight modification of McDarren's sample is what you are after:

        use strict; use warnings; my $line = "some text Programming Languages: C++, Java Author: John D +ate Created: 20004-01-05 10:23"; my ($text) = $line =~ / ^[^:]* # Skip everything from the start of line until the first : :\s+ # Skip the : and any trailing white space ( # Capture (?: # Group, but don't capture (?! # look ahead an fail to match if given pattern fou +nd \s+\w+: # Pattern to fail on - space word : ). # Capture a character if the look ahead didn't fai +l )* # Do it as many times as possible ) # Close the capture /x; # x flag ignores most white space and allows comments print $text;

        Prints:

        C++, Java

        DWIM is Perl's answer to Gödel
        I'm thinking that a good approach to this may be to split your data into a hash, but it's difficult to be sure about that given the data that you've shown.

        Could you post 3-4 full lines of the actual data that you are working with? (edit anything that may be sensitive, of course).

        Update: Just looked at this again. Will the word "Author" always follow the text that you need to extract?

        If yes, then you could probably just do:

        /:\s(.*?)Author:/

        But again, difficult to say for certain without seeing a few more lines of data and having the requirements clarified a bit.

        A dataset, please. Provide a complete dataset that shows your problem. And a piece of code from your keyboard to see where you're stuck.

        I'ts annoying to find out that a given solution doesn't fit because the problem wasn't exposed clearly in the first place.

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: RegEx question
by Hue-Bond (Priest) on Oct 01, 2006 at 04:03 UTC

    Simple: just use ^ to anchor the regex to the start of the string:

    $txt =~ /^Label1 : (\S+)/i;

    Update: This only works if the value doesn't contain spaces, as McDarren correctly notes below.

    --
    David Serrano

Re: RegEx question
by lyklev (Pilgrim) on Oct 01, 2006 at 19:22 UTC

    Find a way in words how you as a human would extract the things you want to have. Describe it in words, expressions, whatever, as long as it is unambiguous. Try to define a set of rules which will work, then try it in your head on your sample data.

    With those rules posted here, it can't be too difficult.