naphelge has asked for the wisdom of the Perl Monks concerning the following question:

hey gang,

I want to search strings of text in an html file and on match (regex) read (buffer) the first letter of each word in the search field discarding the rest and making the first letters available so I can substitute them on the line something like:

</p><p><b><font face="Garamond" size="5">Part II: Nietzsche's Project, An Overall Review</b></center>

I would like to end up with:

<p><font size="-1><a href="#PINPAORToc" name="PINPAORTxt" style=text-decoration:none">Part II: Nietzsche's Project, An Overall Review</a></font><br>

I need to re-format webpages to be as small (in size) as possible so I can upload them to my phone for easy reading while I am on the bus each day. I also need logical links in the document for easy navigation. So anyways, I have been making due with sed, but I have hit a wall with what I would like to do and what sed wants to do. For the example above I can hack it in sed, but sed only has 9 buffers and no kewl functions to make life easier.

sSo I hope the above example makes sense as I am sure there is an easier perl oneliner than sed oneliner that can do it.

cheers, nap

Replies are listed 'Best First'.
Re: search and using first letter of words on a line
by aaron_baugher (Curate) on Sep 22, 2011 at 06:34 UTC

    That's some pretty ugly HTML (both versions), so I'm going to hope there are some typos in it. You may want to use a module to parse that, or if the tags will always be the same, strip it out with a regex. But once you've got your title, you can get your acronym with something like this:

    my $title = "Part II: Nietzsche's Project, An Overall Review"; my $acronym = join '', ($title =~ /(?:^|\s+)(\w)/g); say $acronym; # -> PINPAOR

    That'll give you a string containing each word character that follows the start of the string or a group of whitespace characters. You could narrow \w further to just take capital letters or whatever you like.

      <quote>That's some pretty ugly HTML (both versions), so I'm going to hope there are some typos in it.</quote>

      Yeap. Not as pretty as it could be, but very ideal for my situation. I need to make pages for phone (almost) as lean as possible while navigatable because the phon is old enough to struggle opening files larger than 50K.

      <quote>or if the tags will always be the same, strip it out with a regex.</quote>

      Exactly. The tags in any given document will be the same, but the heading title words from which I want to strip the first letter to make link href and name are different, so it needs to find first letter of words and combine them for link ids. Then I use sed to create my TOC at the top of the page. appreciate a kick in the right direction. nap

        Does your phone understand CSS? That could make the resulting code a lot leaner, if you could specify your smaller font size once instead of with a font tag on every paragraph.

        /* in the <head> of the page */ <style type='text/css'> p { font-size:.7em; /* or whatever works */ } </style>
Re: search and using first letter of words on a line
by ww (Archbishop) on Sep 22, 2011 at 12:52 UTC
    I'm having trouble groking how this helps:

    <p><font size="-1><a href="#PINPAORToc" name="PINPAORTxt" style=text-decoration:none">...

    The font size leaves me puzzled on two counts:
    1. since you're using css && your goal is reducing file size, why not make and use a one- or two-letter class name (saving at least two chars update: note 1)? ( sm(all) comes to mind). and...
    2. Why would you chose a font-size of -1 on a tiny-screen device?

    Also:

    Why are you bothering with name="..." when you're trying to reduce the char count. Yes, it's often good form for public HTML but on your fone?

    If the original HTML is borked ( </center> ? ) and with other close tags out-of-order ( </b> before </font> (which is never seen) ), why not clean it up, first? I've often found that HTML that's as bad as what you display is also outrageously bloated. Think TIDY or maybe one of the HTML::... modules

    Note 1 (an update): TANSTAAFL applies: I should say 'save two chars per instance, after paying back the overhead of the style def in the <head>' of course. But a style...</style> section in the head could also provide abbreviations for the likes of style=text-decoration:none

      <quote>I'm having trouble groking how this helps:

      <font size="-1><a href="#PINPAORToc" name="PINPAORTxt" style=text-decoration:none">... The font size leaves me puzzled on two counts:</quote>

      I really appreciate the help and actually I have some things to look at to help makes things leaner for sure. But my question, the only thing I really need answered ATM is:
      How can I extract the first letter from each word in a string and join them together?
      I have been searching and it looks like perhaps using perl I need to use the substr, split, and join functions, so if anyone might be able to provide some sort of example I would be grateful.
      It seems like it might be a similar request to extracting the first letter of names in a generated list or file?

      I am uncertain if a sed example might help but something I have tried in sed is:

      sed -ri 's/<b>([A-Za-z])([A-Za-z:,]{0,20})[ ]{0,1}([A-Za-z]{0,1})[ ]{0,1}([A-Za-z:,]{0,20})[ ]{0,1}([A-Za-z]{0,1})[ ]{0,1}([A-Za-z:,]{0,20})[ ]{0,1}([A-Za-z]{0,1})(.*)<\/b>/<font size="-1"><a href="#\1\3\5\7Toc" name="\1\3\5\7Txt" style="text-decoration:none">\1 \2 \3 \4 \5 \6 \7 \8<\/a><\/font><br>/' 1.html

      which is getting a little unruly not to mention I am limited to capturing on the first letters of the first four words in the string because sed only has nine memory buffers to recall from, so I think by turning to perl I will give myself a lot more options to do what I want.

      cheers nap

        For what you're defining as your "need ...ATM," is there some problem with the solution offered in Re: search and using first letter of words on a line? It works:
        #!/usr/bin/perl use strict; use warnings; # 927275 use 5.012; my @title = ("Part II: Nietzsche's Project, An Overall Review", "Learn +ing Perl, 5th Ed.", "Mastering Regular Expressions"); for my $title(@title) { my $acronym = join '', ($title =~ /(?:^|\s+)(\w)/g); say $acronym; } =head OUTPUT PINPAOR LP5E MRE =cut

        Of course, your spec leaves a little to be desired: how will you distinguish among Parts I, II and III of your first title?

        If your problem is in applying that answer to <a href="file:///foobar baz">Bazing with foo</a>, then you need to consider something on the order of HTML::Parser or another of the HTML::... modules.

        If you have some other stumbling block in your way, pray detail it, but without teaching Grandmother how to cook eggs with sed.