redhotpenguin has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks,

While my regex tuits have been getting better, I'm faced with a challenge which is throwing me for a loop. I'm trying to extract values from a string (maybe a regex isn't the right way to do this).

The data I'm trying to parse:
my $string = <<STRING; A Title What: A description of what it is Date added: January 16th, 2006 ( an optional source ) Data: some data Another Title What: Another description Date added: April 20th, 2005 Data: some other data STRING

I can parse most of this, but what I need to do is assign a default value for ( an optional source ) if the optional source is missing (the default value would be ''. Here's what I have so far:

my @fizzbin = ( $string =~ m{ (\w+)\n # Grab the title What:\s+([^\n]+)\n # Grab the what Date\sadded:\s+([^\(^\n]+) # Date added (\([^\n]+\))\n # Optional source # I want to use a default # value if nothing is # captured Data:\s+([^\n]+)\n # Grab the data }xmgs);

This regex works fine if I don't try to grab the optional source, it returns an array which is not ideal but provides the data I'm looking for ( $title1, $what1, $date1, $data1, $title2, $what2...) I don't know how to assign a default value to the source capture - any advice on that appreciated, including alternate methods of parsing here.

Thanks in advance.

Replies are listed 'Best First'.
Re: Default value for capture in regular expression
by ikegami (Patriarch) on Jan 16, 2006 at 20:55 UTC
    Consider
    'ab' =~ /a(z?)b/ # Returns ''. 'ab' =~ /a(z)?b/ # Returns undef.
    So (\([^\n]+\)) is optional? Then let's add a ? as follows:
    my @fizzbin = ( $string =~ m{ (\w+)\n # Grab the title What:\s+([^\n]+)\n # Grab the what Date\sadded:\s+([^\(^\n]+) # Date added (\([^\n]+\))?\n # Optional source Data:\s+([^\n]+)\n # Grab the data }xmgs);

    After looking at my first snippet, we determine that the source will be undef if the source is omitted. Let's rearrange your code to use named variables like you wanted, and let's integrate the default. We get the following:

    while ( my ($title, $what, $date, $source, $data) = $string =~ m{ (\w+)\n # Grab the title What:\s+([^\n]+)\n # Grab the what Date\sadded:\s+([^\(^\n]+) # Date added (\([^\n]+\))?\n # Optional source Data:\s+([^\n]+)\n # Grab the data }xmgs ) { $source = '' if not defined $source; # Use default ''. ... }

    By the way,

    • the m modifier is useless, since you don't use ^ or $;
    • the s modifier is also useless, since you don't use .; and
    • Date\sadded could be replaced with Date[ ]added to improve readability.
Re: Default value for capture in regular expression
by ww (Archbishop) on Jan 16, 2006 at 20:47 UTC
    You might capture the optional source together with the date in a first pass, and then, inside an compound existence test, s/// such source data as does exist to $data1 while using the else clause to insert "No source indicated" or whatever as the new value for data1.

    Inelegant, and maybe not efficient, but sometimes 2 regexen are better than one.

Re: Default value for capture in regular expression
by bmcnett (Novice) on Jan 17, 2006 at 04:19 UTC
    In cases like these I forego regexps for some kind of data language, such as Perl itself or JSON. I'd set up my defaults as a hash, then read a JSON object from a string (or a file, or wherever), then do a "hash slice" to override the defaults.
    my %stuff = ( Title => 'some default', What => 'some default', 'Date added' => 'some default', Data => 'some default' ); use JSON; my $newstuff = jsonToObj( <<STRING ); { "Title": "Gone With The Wind", "Date added": "January 16th, 2006 ( an optional source )" } STRING @stuff{keys %$newstuff} = values %$newstuff;
Re: Default value for capture in regular expression
by blazar (Canon) on Jan 17, 2006 at 08:45 UTC

    I wouldn't do that with a (single) regex at all. There are tons of other ways: in particular in this case your data seems fairly regular. Thus one possibility (and just one amongst the may WTDI) may be:

    1. read in by paragraphs;
    2. split on \n;
    3. parse the single lines to get, say, ($data, $value);
    4. use a hash:
      my %defult = (what => 'wtf?', date => 'no date', data => 'no data'); # ... unless (exists $default{$data}) { warn "unexpected field: $data'"; # ... } $value ||= $default{$data};
    5. Of course //= would be better suited for this, but since we currently don't have it yet unless we installed a patch, ||= should be enough. If it is not or in case of doubt, use your own defined test instead.