maha has asked for the wisdom of the Perl Monks concerning the following question:

Hello!

I want to split a string into multiple strings for given below example

(e.g)

<P_contrib-author>Allan Wigfield, Susan L. Klauda, and Jenna Cambria</P_contrib-author>

output should be

<contrib-group> <name><surname>Wigfield</surname><given-names>Allan</given-names></nam +e> <name><surname>Klauda</surname><given-names>Susan L.</given-names></na +me> <name><surname>Cambria</surname><given-names>Jenna</given-names></name +> </contrib-group>

I used following code

@arr = split(/(,|and)/,$cont)

but it could't show the full content

Replies are listed 'Best First'.
Re: split function using multiple delimiters
by ambrus (Abbot) on Dec 27, 2011 at 12:35 UTC

    This task has two parts.

    The first part is to split the string to a list of individual names. They're separated by commas and the word "and". This is possible, and you almost got it right. Try something like

    split /(?:\s+and\s+|\s*,\s*|\s*,\s*and\s+)/
    There are two important point here: don't use a capturing parenthesis in the regex (if you must use grouping, use (?: ) non-capturing groups), and make sure to find and only if it's a separate word, not when it's inside a name.

    The second part of the task is to take each name and break it to a surname and given name. That is, sadly, impossible.

Re: split function using multiple delimiters
by ww (Archbishop) on Dec 27, 2011 at 11:44 UTC
    "it could't show the full content"
    Neither does your question provide "full content" when the phrase is used to mean "a clear description" of your problem.

    The split you show, executed against the source string (presumptively $cont is the string you showed initially), produces an array as you appear to expect. The content of that array, via Data::Dumper, is:

    $VAR1 = '<P_contrib-author>Allan Wigfield'; $VAR2 = ','; $VAR3 = ' Susan L. Klauda'; $VAR4 = ','; $VAR5 = ' '; $VAR6 = 'and'; $VAR7 = ' Jenna Cambria</P_contrib-author>';

    So... is your question

    • how to clean up the <P_contrib-author> tags
    • how to make convert the output of your split (which follows the procedure outlined in perldoc -f split) to XML (in which case, you'll be well served by searching this site for "XML")
    • or something else?

    WAG: You might find producing the desired output easier if you deal first with the tags and the substantive content. If your data is uniform -- contains (identical) <P_contrib-author> tags each time you need to extract authors -- identify, modify, and strip them out for your later use before dealing with the authors list.

    But, I suspect the style of your data will vary from item to item... omitting the Harvard comma, using an ampersand instead of "and" and so on... so the overall solution won't be a single PATTERN for use in split, anyway.

      You're sure in the running for the "unhelpful reply of the day" award (OMG! He didn't "use strict" either!), but at least you said one useful thing:
      But, I suspect the style of your data will vary from item to item... omitting the Harvard comma, using an ampersand instead of "and" and so on... so the overall solution won't be a single PATTERN for use in split, anyway.
      Parsing citations is pretty hard unless they are machine-generated. There are a bunch of formats, and people regularly screw them up.

        I thought the results of the code from the OP was helpful. That would have been nice to see in the OP.

        You're sure in the running for the "unhelpful reply of the day" award
        I think ww's response is the best post in this thread.
Re: split function using multiple delimiters
by zwon (Abbot) on Dec 27, 2011 at 11:23 UTC
    my @arr = split /,\s*(?:and\s*)?/, $str;
Re: split function using multiple delimiters
by jhourcle (Prior) on Dec 29, 2011 at 18:56 UTC

    As someone who's had to parse author lists before, let me just say that unless they're clean coming in, you can run into a *lot* of problems if you just try to split. You also have the 'Susan L.' example of a given-name, but you might also run into a 'Frank de Leo' where the last name would be 'de Leo' not 'Leo'

    It looks like the Biblio::Citation::Parser hasn't seen an update in 7 years, but it's likely that's it's a solved problem, and isn't in need of updates (unless someone wants to add DOI or other ID handling). It's intended to take a full citation, so you might have to look at how they're parsing the author string -- look for sub find_authors in Biblio::Citation::Parser::Citebase.

    As there are quite a few people in the libraries using Perl, you could ask on the code4lib mailing list, which has lots of Perl folks on it, or the perl4lib which is lower volume (but more focused in scope), to ask if there are any better parsers out there.

Re: split function using multiple delimiters
by cavac (Prior) on Dec 27, 2011 at 19:03 UTC

    Update: Ignore this post, i was reading the requirement the wrong way around.

    While this does not really answer the question asked: If the example data matches your real-life data, you might try XML::Simple, since your data looks very much like XML.

    BREW /very/strong/coffee HTTP/1.1
    Host: goodmorning.example.com
    
    418 I'm a teapot
A reply falls below the community's threshold of quality. You may see it by logging in.