Popcorn Dave has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,

I'm writing a program to look up sales tax rates for California. Unfortunately there is more than one rate for the state - it's determined by county. Not a problem in so much as there's a PDF file with all the information, and Adobe is kind enough to provide a conversion service to make a PDF file into HTML here. So far so good.

I'm running in to trouble when I try to parse the file looking for a city. The HTML source is 3243 lines long, but when I read the file in, I'm only getting 231 lines before it quits.

My thinking is that when the file was converted, there is some character that is acting as an EOF. I'm not sure about that though, because the file displays fine in a browser.

So, I'm wondering, is there a simple way to strip out anything below ascii 32? I've tried using things like

next if $line =~ /\W/; next if $line != /^[A-Z]/; next if $line =~ /[^A-Z]/;

to no avail.

The information I'm actually after in the file starts probably half way in to the file and is layed out as: city, tax rate, county so looking for the capital letter at the beginning, I thought, would work.

Thanks in advance!

There is no emoticon for what I'm feeling now.

  • Comment on Stripping non alphanumeric characters and leaving punctuation characters from a file
  • Download Code

Replies are listed 'Best First'.
Re: Stripping non alphanumeric characters and leaving punctuation characters from a file
by BrowserUk (Patriarch) on Jun 06, 2003 at 19:27 UTC

    Whilst $line =~ tr[\x00-\x1f][]d; will strip ascii chars less than space from your input, I don't think that will do you any good if your input is terminating before the end.

    If you can't read it, how can you fix it?

    Have you tried using binmode on the file?

    Does it make any difference?

    Am I misreading your post?


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      Well I could read it, but only to a certain point. I could read the first 231 lines in text mode.

      I thought that since it was a text file, or at least I thought it was that I shouldn't go in to binmode.

      However, since then svsingh helped me to get this working. I've modified his code to this:

      #!/usr/bin/perl -w use strict; open FH, 'pub71.html'; binmode FH; $\ = "<BR>"; while (<FH>) { next if $_ =~ /\,/; next if $_ =~ /^[a-z]/; if ( m/(.+)\s(\d{1,2}\.\d{2})%\s(.+)\s*<BR>/ ) { print "City:$1\tCounty:$3\tTax$2\n"; } } close FH;

      and it now reads all the information that I want.

      There is no emoticon for what I'm feeling now.

Re: Stripping non alphanumeric characters and leaving punctuation characters from a file
by Zaxo (Archbishop) on Jun 06, 2003 at 19:29 UTC

    Answer to the stated problem; there are character classes in perl for control characters. In unicode, s/\p{IsC}//g; or s/\p{IsCntrl}//g;, in POSIX, s/[[:cntrl:]]//g;.

    As for the real problem, it sounds very clunky to parse that information out of a pdf file each run. Why not extract it once and place it in a small db or flat file? [Update] The same objection holds for parsing it from html each time. Scribble the tax rate data and only the tax rate data somewhere you can get it easily.

    Perl's binmode instruction may help with your file reading problem.

    After Compline,
    Zaxo

      Sorry I forgot to mention that.

      I'm checking to see if the file exists on the local machine, and only downloading and processing it if it's not there. I don't want to be hitting Adobe's server every time this thing is run.

      There is no emoticon for what I'm feeling now.