Extracting multiple-asterisk-delimited substring with Text::Balanced

kba has asked for the wisdom of the Perl Monks concerning the following question:

greetings, perl people,

I'm writing a parser for generic lightweight markup, that is agnostic to the different flavours (Textile, reST, Wikimedia, Creole etc.). Therefore it heavily relies on plugins. The Block Parser works fine by now, but the Inline Parser (for typography, links, footnotes etc.) is giving me headaches, because of Text::Balance's animosity towards regexp-metacharacters.

How can I match a string delimited by double asterisk ('**') with Text::Balanced?

Suppose I have this string

$str = '**bold words**'
[download]

and I want to use extract_tagged to extract the bold words part.

My first try:

warn Dumper extract_tagged( 
    $str,
    '**',
    '**',
);
[download]

Now perl complains about nested qualifiers because Text::Balanced creates a regex that starts with /\G**. So I try escaping the '*'s:

warn Dumper extract_tagged( 
    $str,
    '\*\*',
    '\*\*',
);
[download]

That leads to a 'quantifier follows nothing' regexp error. Okay, double escape it, methinks, but that gives the same error, 'quantifier follows nothing' in the regexp Text::Balance creates.

When I try it with three or more '\'s, there are neither errors nor results and I am thoroughly confused because of all the escaping of escaping chars, I played around with quotemeta, but that didn't work either so I turn to you:

Is there a way to match a string delimited by multiple regexp metacharaters like '*' and if so, what would be the least confusing way to implement it?

Thanks in advance and merry kwanzaa
kba

Comment on Extracting multiple-asterisk-delimited substring with Text::Balanced Select or Download Code

Replies are listed 'Best First'.
Re: Extracting multiple-asterisk-delimited substring with Text::Balanced by kba (Sexton) on Dec 25, 2008 at 23:05 UTC
Ah, I just found a way! Wrapping the metacharacters in a character class prevents the meta-character confusion: `warn Dumper extract_tagged( '[][]' '[][]' );` [download] Spits out `$VAR1 = 'bold words'; $VAR2 = ''; $VAR3 = ''; $VAR4 = ''; $VAR5 = 'bold words'; $VAR6 = '';` [download] as it's supposed to. Still, if someone has got a better way to do it, I'd be happy to hear it.	[reply] [d/l] [select]
Re: Extracting multiple-asterisk-delimited substring with Text::Balanced by bruno (Friar) on Dec 26, 2008 at 04:07 UTC
I don't want to rain on your parade, but have you checked txt2tags? It's a lightweight and extensible markup language; it's been around for nearly a decade, so it's rock solid. I use it often and I highly recommend it.	[reply]
Re^2: Extracting multiple-asterisk-delimited substring with Text::Balanced by kba (Sexton) on Dec 26, 2008 at 10:45 UTC
Nah, doesn't really bother me :) I'm more interested in a parser whose grammar can be changed from the ground up by implementing all syntax elements as plugins and let the user decide specifically which plugins to use (e.g. textile's heading block style and css inline style, AsciiDocs text box style, Creole's typography markup etc.) Also I want to combine it with an indentation-aware parser to process tab-delimited outlines (like vimOutliner). Besides, I finally found a reason to play with OpenOffice::OODoc and XSLT to make Lightweight Markup documents importable in OpenOffice and ODT exportable to Lightweight Markup Plain Text documents. That being said, I like txt2tags. Lots. I don't like some of their conventions though, as with any other lightweight markup language. From the parser perspective, I love DokuWiki, very extendable, very well written, but in PHP and embedded in a CMS-like environment. Thanks for the hint though.	[reply]
Re: Extracting multiple-asterisk-delimited substring with Text::Balanced by Anonymous Monk on Dec 25, 2008 at 23:08 UTC
Text::Balanced extract_tagged extracts and segments text between (balanced) specified tags.... `extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} );` [download]	[reply] [d/l]
Re^2: Extracting multiple-asterisk-delimited substring with Text::Balanced by kba (Sexton) on Dec 25, 2008 at 23:27 UTC
What do you want to tell me? I should have rtfm (I did) I should use the 'reject' option (doesn't change the problem) Text::Balanced's extract_tagged works only with xml-like tags (no, it doesn't) I'd appreciate it if you could be more verbose about what you mean. Anyway, thanks for the reply.	[reply]


more useful options
	PerlMonks