aforonda has asked for the wisdom of the Perl Monks concerning the following question:

Ok this is gonna sound wierd but I have a situation where I have files that have two title tag sets in lots of hmtl pages. The first title set is empty but the next set has the actual title i need to use. No due to the app I cannot remove the first title set, it's mandatory.

What I need to do is pull the <title>some content</title> "some content" portion from the second title set and put it in the first title of the page.

So the source will no longer look like
Old:
<title></title>
<title>some content</title>

Should look like
Need:
<title>some content</title>
<title>some content</title>

Thanks.

Replies are listed 'Best First'.
Re: pull the title content
by sauoq (Abbot) on Jun 11, 2003 at 00:21 UTC

    I'm guessing this is a one-time fix kind of a thing, right? Normally, I'd suggest properly parsing any HTML but you might get by on the cheap.

    perl -i.bak -0ple 's!<title>\s*</title>(.*?)(<title>.*?</title>)!$2$1$ +2!is' file.html

    This comes with caveats, YMMV, etc... the regex is brittle... but it might do what you need. If this is something that you'll have to do over and over, though, I'd really suggest you take a more robust approach. See HTML::Parser for starters.

    -sauoq
    "My two cents aren't worth a dime.";
    
•Re: pull the title content
by merlyn (Sage) on Jun 11, 2003 at 02:28 UTC
    What is the purpose of more than one title? I believe it's an error. What you end up with is that some browsers will believe the first, and others will believe the second. Bad HTML. Bad. It'd be better to reduce it to a single title element.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      Randal is correct, of course.

      Instead of adding in the second title, tags and all (which I believe you must be doing somewhere), why not just sub the title into the existing tags (which are already there)? Sounds easier and better.

      Jasper
        Ok some history, I got stuck with a site that was working great. Due to politics of company x they decide to justify a portal app thats been sitting on the shelf for sometime. So now it breaks the site (note not a portal) and I hack away to make the site work, not only with the portal app but with our existing cms.

        Ok now that that is out of the way, essentially what the app does is wrap my html pages with a elaborate wrapper that throws another top set of html. That is <html><title><head><body>, thus when it renders the gatewayed page I get two sets of that html. Yes it's bad html but it's been working in all targeted browsers, problem is the wrapper does not obviously show the correct title, that is the 2nd title (eyes crossed yet?). So what I need is a quick solution (granted it's an admitted hack) so when the page renders it knows to pull the value of the 2nd title and place it in the first (wrapper file) title.

        Your help is appreciated.