http://qs1969.pair.com?node_id=74059

abultm74 has asked for the wisdom of the Perl Monks concerning the following question:

Help! I'm working on a class project. We are caching copies of HTML files. The problem: HTML 'href's and 'src's need to be changed. Relative links need to be changed to hard links, so that all our database has to hold is HTML text, not images, etc. Anyway, there is a myriad of ways of making HTML href and img tags: No quotes, quotes, relative, relative with '..'s, leading slashes, trailing slashes, ones with 'http://', with only 'www', etc. I need to find all HTML 'href' and 'src' links and make them hard links. Any ideas? Is there a module that does this, or do I have to do a million regexps? I need some help... 'Mad Props' to anyone who can shed some light... Adam

Replies are listed 'Best First'.
Re: HTML tag search/replace.
by Maclir (Curate) on Apr 20, 2001 at 07:11 UTC

    HTML::Parser is your friend. A subset - HTML::TokeParser - may be sufficient. You are correct, Adam, when you identify the source of the difficulty - they there are a variety of ways to code those HTML tags. (as a side issue - XHTML with a much more rigid syntax will make this easier - sort of like "use strict;" for HTML.)

    Now, if this is part of a class project, maybe they are wanting to see how you would tackle the problem, as an exercise in analysis and program design. At least HTML::Parser should be a good source of inspiration.

          
      XHTML ... sort of like "use strict;" for HTML

      You know this is brilliant! It makes perfect sense both for Perl coders trying to understand XML and XHTML and for XMLers learning Perl.

      Saddly it still allows for both ' and " to be used around attributes and has no rule concerning URL's, so they will still be just as hard to parse ;--(

Re: HTML tag search/replace.
by merlyn (Sage) on Apr 20, 2001 at 16:25 UTC