I am writing a web-based database application in Perl 5.6. All its control, input and output are via the web-browser...
When creating a new record (records contain a "document" field, whereby a large body of text is captured (press releases, bulletins, notices, etc)) I am currently trying to parse for various elements: filtering obscenities, stripping html, converting URL's and e-mail addresses to links, hard-coding carriage-returns (why do we still call them that?) and spaces, converting emoticons to icons, etc, etc.
I am doing all of the above with regex at the moment and, while these are okay most of the time, there are many conditions that cause them to fail. I would like to implement the following rules:
- If HTML is not permitted in the database (use configurable option) then all HTML input by the creator of the record (i.e. not generated by the application itself) is to be removed. This should work even if the HTML is "broken" or runs over multiple lines
- Convert URL's to links. This should work for e-mail addresses as well as "http", "https", "ftp", URL's without the "http://" prefix (www.mydomain.com), URL's without "www" (http://sub-domain.mydomain.com/), etc.
- URL's that are within HTML tag parameters should not be converted to links: i.e. don't convert this URL -> <Img Src="http://www.mydomain.com/gfx/sexygirl.jpg"> as that will break the HTML (assuming that the user is permitting HTML)
- URL's that are encapsulated in tags should be converted to links: i.e. do convert this URL -> <Center>http://www.mydomain.com/</Center>
Now I know that with these demands, I've stepped well outside of the quick and easy regex (certainly with my knowledge of regex anyway) and that CPAN is my only solution.
I would like to use only those modules that are part of the standard Perl distribution (if at all possible), as I want to distribute this application when it's completed and want installation to be as simple as possible for the both the novice user and those users who haven't got physical access to a server and are therefore relying on un-cooperative ISP's for their CGI programs. However, I will use a "non-standard" module if that's the only way to achieve what I want.
Searching CPAN, I think the modules I need are URI::URL and libwww-perl (LWP).
Is this correct, can these provide the mechanism between them to perform the tasks outlined above?
If so, could anyone post or e-mail me with example code (bonus points for anyone supplying code relating to the tasks above) as I have become more and more confused with the documentation for both of the above modules?
Huge thanks in advance.
In theory, there is no difference between theory and practise. But in practise, there is.
Jonathan M. Hollin
Digital-Word.com
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.