bigup401 has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: regex in perl
by GrandFather (Saint) on Dec 15, 2020 at 20:02 UTC | |
Don't do that! Use any of the many modules designed for making a good job of parsing and manipulating markup like HTML and XML. HTML::Tree may be a good starting point.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] |
|
Re: regex in perl
by marto (Cardinal) on Dec 15, 2020 at 20:08 UTC | |
Is there a reason you don't want to use a proper parser, specifically designed for this task, and the other tasks you're trying to achieve? | [reply] |
| |
|
Re: regex in perl (nested HTML tags)
by LanX (Saint) on Dec 15, 2020 at 20:11 UTC | |
Cheers Rolf | [reply] |
|
Re: regex in perl
by davido (Cardinal) on Dec 16, 2020 at 20:28 UTC | |
The fact is that it's deceptively simple in appearance. What could be so bad about using regex for this? I really like tchrist's explanation on StackOverflow: Oh Yes You Can Use Regexes to Parse HTML!. Here are a few of the many useful statements in that post: Let's open our eyes, then. Where's your sample input? I don't see any. So now I have to contrive some. I grabbed this from http://geeksforgeeks.org/span-tag-html:
And now lets open our minds to what a proper DOM class can achieve:
This produces:
The beauty here is that you don't have to worry about what happens in the case of nested spans, which the regex you're producing doesn't look like it would deal with gracefully. And you don't have to worry about a whole bunch of other nuances, such as the fact that <span and < span are equivalent (but also not handled by the regex you were crafting). And what's the cost? In terms of non-core Perl modules, you've added: All of those are part of the Mojolicious distribution, which is distributed as a 776kb tarball, and installable with only core Perl tools. Additionally, this distribution provides you with a nice User Agent, and a great test framework. It takes under a minute to install, and has no non-core dependencies. My own take: It may be that I could take a stab at writing an HTML parser that would remove span tags, and also the other things you were asking about the day before, and the things you will ask about tomorrow, for the subset of HTML you deal with in your specific use case. And I might get it right. But I will have wasted a lot of time to implement a more fragile solution to a very specific problem, and it would be a tool that couldn't grow as my problem evolves. I don't know what your job is, but my job is not to spend more time than necessary to create a less robust, more buggy solution to an already solved problem if I'm aware a shorter-time-to production, more robust, less buggy, easier to understand approach exists. Now I may sometimes manage to do that unintentionally anyway. But when shown the light, I realize that part of my job is to learn, and evolve to adopt the better approach. Dave | [reply] [d/l] [select] |
|
Re: regex in perl (bigup401 faq)
by Anonymous Monk on Dec 16, 2020 at 10:04 UTC | |
| [reply] | |
|
Re: regex in perl
by Anonymous Monk on Dec 16, 2020 at 02:16 UTC | |
Definitely use "a proper HTML parser." You'll be glad you did. But also, be aware of CPAN modules such as Regexp::Common. (In fact, there are quite a few modules in the "Regexp" family. Very many patterns have already been thoroughly done.) | [reply] |