DukeLeto has asked for the wisdom of the Perl Monks concerning the following question:

Yea holy and sublime monks of the Perline order, I beseech thee, aidest me!

I am but a poor and simple coder, late wandered out of the dark wood of error where I was blinded by my service to the Whore of Redmond. I have come to a time and place where the powers that be command that I forge for them from the living Perl a script that might subdivide their bloated SGML files and render up a set of named subfiles that may be used to populate a set of TEXT fields in a great library of SQL.

Though I have labored a full fifth-score hours on this task, poring over many a camel-embossed tome of distracted wisdom, I have come to realize that my work is in vain and I must look to the almighty for my salvation, for my childish reason can not comprehend the sublimity that is the way of the Perl. Therefore, I do remit my opus to thee, in the hope that my crass idiocy might reveal itself thine holy eyne and that the goal of my project might be completed without a further waste of time.

Herein lies the code, built upon the backbone of a program that thou hast helped one of my co-workers build:

#!/usr/bin/perl -w #Purpose: To Take a DOS file wildcard and thus take all the matching c +ustom SGML files in the working directory and subdivide them into new + files whose names are the id's of the divs in those original files. print "What file(s) do you want to run this program on?\n"; $TheFile=<STDIN>; chomp ($TheFile); our $lines = ""; our @InFileNames; our @OutFileNames; our @OutFileContent; #open file and get all text sub OpenFile { open(FILE, @_) or $lines = ""; local $/ = undef; $lines = <FILE>; #remove blank lines $lines =~ s/\n{2}/\n/gms; close(FILE); } #add ¥ to closing div tags sub MarkClose { $lines =~ s/(<\/div>)/\$1¥/gms; } #open output.txt for appending and write results to it sub FileAppend { my $Outfile = ">>" . $_[1] . ".bsd"; my $Content = $_[2]; open(FILE, $Outfile) or die "Can't open $Outfile.\n"; print FILE $Content; print FILE "\n"; close FILE; } #Create an array containing all file in the directory matching the glo +b. sub GetInFileList { my $FileDef = @_; @InFileNames = glob($FileDef); } #Populate an array with the contents of the id attribute of every <div +> tag in the input file. sub GetOutFilesList { @OutFileName = $lines =~ m/<div[^>]*>/gms; foreach $OutFile (@OutFileName){ $OutFile =~ s/<div type=[^>]* id="([^>]*)">/$1/gms; $OutFile =~ s/\./_/gms; } } #Subdivides the File into the subfiles. sub GetOutFileContent { my $LinesString = @_; @OutFileContent = split /¥/, $LinesString; } ### Does the job &GetInFileList($TheFile); if (@InFileNames > 0){ for ($i = 0, $i < @InFileNames, $i++) { &OpenFile($InFileNames[$i]); &MarkClose; &GetOutFilesList; &GetOutFileContent($lines); for ($j = 0, $j < $OutFileNames, $j++){ &FileAppend($OutFileNames[$j], $OutFileContent[$j]); } } } #be nice and say it's done print "Program Finished\n";

edited: Sun Mar 9 14:47:58 2003 by jeffa - title change (orig is now first para)

Replies are listed 'Best First'.
Re: Need help with subdividing SGML files
by BrowserUk (Patriarch) on Mar 07, 2003 at 20:46 UTC

    Thoust elegant beseeching falleth not upon ears deaf, but, pray tell, in what way wouldest thou havest aid thee?

    (Er...like er.. you know man, like ...What's the problem?)


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
      As near as I can tell the first array is never populated. There's probably a considerable number of other fudge-ups.

        Sorry, but "the first array" means nothing. Is that the first array mention in the program reading from top to bottom? The first array that is used in the runtime order?

        The work involved for anyone who doesn't have a set of sgml files with the particular set of <div> tags that your program is looking, to try and divine the format of those files and mock up data to allow them to try and run your program is considerable.

        I think that you should consider putting as much effort into describing the problem as you did into your flowery request for help, you then might give us enough information uppon which to begin to advise you.

        Try adding a few print tstatements to your program and work out what it is/is not doing. Come back with a clear descripton of the problem and you might get some more help.


        Examine what is said, not who speaks.
        1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
        2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
        3) Any sufficiently advanced technology is indistinguishable from magic.
        Arthur C. Clarke.
Re: Need help with subdividing SGML files
by hawtin (Prior) on Mar 07, 2003 at 20:56 UTC

    Immediate thoughts:

    • use strict;Always
    • &MarkClose; should become &MarkClose();
    • The if (@InFileNames > 0){ is redundant. Just have the foreach execute 0 times
    • s/<div type=[^>]* id="([^>]*)">/$1/gms since a space is not a > the [^>]* will gobble a lot more string than I think you expect. Try [^>\s]*
    • Use the debugger (perl -d) and try things one line at a time
    • K&R indentation :-(

    Update: OK once more but this time with real square brackets

      Hmmm... Oh yeah, unlike TextPad, this animal won't stop for line breaks.
Re: Need help with subdividing SGML files
by tall_man (Parson) on Mar 07, 2003 at 21:04 UTC
    This be not a role-playing game, Sir DukeLeto. Flowery language and especially flowery but meaningless article titles doth not contribute to our desire to help thee.

    Firstly, thou shouldst "use strict". It will find for thee many problems.

    Secondly, thy argument-passing skills want improvement. Instead of:

    my $LinesString = @_;
    thou shouldst have
    my $LinesString = shift;
    or
    my ($LinesString) = @_;
    When using $_[1] notation, remember that the first argument startest at $_[0]
      My apologies for the flowery language. I wanted to establish my non-moronicity whilst proclaiming it. Seems to have backfired. I wanted to evoke an image of a penitant former MS programmer pleading at the gates of the monastery. I forgot that my favorite screen name can evoke the image of some D&D Paladin. (I though the fifth-score hour was good. BTW.)
        Speaking as a former MS programmer, the standard tone for those of us who have left the Borg is somewhere around "Those Bastards! Those Evil Sons of..." you get the idea. I always thought of them as more of a demon horde (managers and above) fleshed out with the soul-less (employees, sacrificing everything in their lives to keep the stock vesting at all costs) and the damned (permatemps). But

        Dude, don't worry too much about tone. The Monk metaphor permeates the site, but one reference per post is sufficient. As for your screen name, I thought it was a Dune reference. Duke Leto Atreides, right?

        -Logan
        "What do I want? I'm an American. I want more."

      With respect to your suggestion of inadequate understanding of Perl variable passing, I plead guilty. I have to admit though, based on my work with the synoptic languages, this format is VERY odd.
Re: Need help with subdividing SGML files
by allolex (Curate) on Mar 07, 2003 at 21:44 UTC

    This is slightly OT, so please forgive me that.

    I think the advice about use strict; is truly excellent. And perhaps use warnings; for good measure. But what I'm really writing about is a point of grammar. Kids today just have horrible grammar, using a second person singular form to address a group, hmph! Here are the correct forms/functions:

    • Thou forgivest me mine abhorrent code. (Talking to one person - subject)
    • You forgive me my bad breath. (Talking to a group - still subject)
    • I shall strike thee over the head with the Camel. (object of 'strike')

    That Camel an animal/mine animal, a beast/my beast. Get it?

    BTW, these are just example sentences, I do not have chronic halitosis, I have no plans of violence towards you, and I do not own anything but a paperback camel (as opposed to the kind that would flatten you), so it's unlikely to be fatal anyway.

    Update: A nifty link I found addressing this off-topic. :)

    --
    Allolex

      This is completely off topic, and your grammar lesson is wrong. "Thou" is second person familiar, not second person singular. And "you" is second person formal, NOT second person plural. Neither has a plural form. Shakespearean characters often berate crowds in the thou form. I'll admit, if I would have thought about it... addressing a crowd in the familiar is condescending and abusive, especially if your screenname has a patent of nobility. On the other hand, the King James Bible addresses the deity in the familiar all the time, and the religious inferiority was the relationship I was shooting for.
Re: Need help with subdividing SGML files
by DukeLeto (Novice) on Mar 17, 2003 at 17:28 UTC
    Well, once I figured out how to work the debug mode, it was all downhill from there. I assume you all had a good laugh about the "I just keep getting a 'DB(n)' Prompt that won't take my input." Oh well, you're only ignorant once. The program is now giving a reasonable approximation of its intended functionality. I'm attaching the code below so everyone can point out the silly looking, rough-hewn parts. Of course, in the words of the immortal Beeblebrox, "Hey! Don't knock it, it worked."
    #!/usr/bin/perl -w #Purpose: To Take a DOS file wildcard and thus take all the matching c +ustom SGML files in the working directory and subdivide them into new + files whose names are the id's of the divs in those original files. use strict; print "Enter the name of a file containing the list of files you want +to work on.\n"; our $lines = ""; our @InFileNames; our @OutFileNames; our @OutFileExtensions; our @OutFileContent; my $i = 0; my $j = 0; my $k =0; my $TheFile = <STDIN>; chomp ($TheFile); #open file and get all text sub OpenFile { open(FILE, $_[0]) or $lines = ""; local $/ = undef; $lines = <FILE>; #remove blank lines $lines =~ s/\n{2}/\n/gms; close(FILE); } #add ¥ to closing div tags sub MarkClose { $lines =~ s/(<div type)/¥$1/gms; $lines =~ s/\A¥//gms; } #open output.txt for appending and write results to it sub FileAppend { my $Outfile = ">>" . $_[0] . "." . $_[1]; my $Content = $_[2]; open(FILE, $Outfile) or die "Can't open $Outfile.\n"; print FILE $Content; print FILE "\n"; close FILE; } #Create an array containing all file in the directory matching the glo +b. sub GetInFileList { my $FileDef = $_[0]; open (FILE, $FileDef) or die "That isn't a valid file, Wesley!"; local $/ = undef; $lines = <FILE>; #remove blank lines $lines =~ s/\n{2}/\n/gms; close(FILE); @InFileNames = split /\n/, $lines; #If the program can't give a list to Muhammed, than Muhammed will give + a list to the program. } #Populate an array with the contents of the id attribute of every <div +> tag in the input file. sub GetOutFilesList { $k = 0; @OutFileNames = $lines =~ m/<div type[^>]*>/gms; while ($k < (scalar(@OutFileNames))){ $OutFileNames[$k] =~ s/<div type="[^"]*" id="([^"]*)"[^>]*>/$1/gms +; $OutFileNames[$k] =~ s/\./_/gms; $k = $k + 1; } $k = 0; @OutFileExtensions = $lines =~ m/<div type[^>]*>/gms; while ($k < (scalar(@OutFileExtensions))){ $OutFileExtensions[$k] =~ s/<div[1-9]? type="([^"]*)" id="[^"]*"[^ +>]*>/$1/gms; $OutFileExtensions[$k] =~ s/\./_/gms; $k = $k + 1; } } #Subdivides the File into the subfiles. sub GetOutFileContent { my $LinesString = $_[0]; @OutFileContent = split /¥/, $LinesString; } ### Does the job &GetInFileList($TheFile); $i =0; while ($i < (scalar(@InFileNames))) { &OpenFile($InFileNames[$i]); &MarkClose(); &GetOutFilesList; &GetOutFileContent($lines); $j=0; while ($j < (scalar(@OutFileNames))){ &FileAppend($OutFileNames[$j], $OutFileExtensions[$j], $OutFil +eContent[$j]); $j = $j + 1; } $i = $i + 1; } #be nice and say it's done print "Program Finished\n";
    (Edited to reflect the code that actually DID work.)
Re: Need help with subdividing SGML files
by DukeLeto (Novice) on Mar 12, 2003 at 18:49 UTC

    UPDATE:

    I've implemented most of the suggestions I received and the little program now looks like this:

    #!/usr/bin/perl -w #Purpose: To Take a DOS file wildcard and thus take all the matching c +ustom SGML files in the working directory and subdivide them into new + files whose names are the id's of the divs in those original files. use strict; print "What file(s) do you want to run this program on?\n"; our $lines = ""; our @InFileNames; our @OutFileNames; our @OutFileContent; my $i = 0; my $j = 0; my $TheFile = <STDIN>; chomp ($TheFile); #open file and get all text sub OpenFile { open(FILE, $_[0]) or $lines = ""; local $/ = undef; $lines = <FILE>; #remove blank lines $lines =~ s/\n{2}/\n/gms; close(FILE); } #add ¥ to closing div tags sub MarkClose { $lines =~ s/(<\/div>)/$1¥/gms; } #open output.txt for appending and write results to it sub FileAppend { my $Outfile = ">>" . $_[0] . ".bsd"; my $Content = $_[1]; open(FILE, $Outfile) or die "Can't open $Outfile.\n"; print FILE $Content; print FILE "\n"; close FILE; } #Create an array containing all file in the directory matching the glo +b. sub GetInFileList { my $FileDef = $_[0]; @InFileNames = glob($FileDef); } #Populate an array with the contents of the id attribute of every <div +> tag in the input file. sub GetOutFilesList { my $OutFile; @OutFileNames = $lines =~ m/<div[^>]*>/gms; foreach $OutFile (@OutFileNames){ $OutFile =~ s/<div type=[^\s]* id="([^>]*)">/$1/gms; $OutFile =~ s/\./_/gms; } } #Subdivides the File into the subfiles. sub GetOutFileContent { my $LinesString = $_[0]; @OutFileContent = split /¥/, $LinesString; } ### Does the job &GetInFileList($TheFile); for ($i = 0, $i < @InFileNames, $i++) { &OpenFile($InFileNames[$i]); &MarkClose(); &GetOutFilesList; &GetOutFileContent($lines); for ($j = 0, $j < @OutFileNames, $j++){ &FileAppend($OutFileNames[$j], $OutFileContent[$j]); } } #be nice and say it's done print "Program Finished\n";

    The script now dies at a very specific point: On line 13 or possibly 18, where it gives the following error message:

    Use of uninitialized value in open at ##program name censored## line 18, <STDIN> line 1.

    I take this to mean that the array entitled @InFileNames has no contents, because the glob function used to fill it on line 45 didn't behave as I thought it would.

    Also, the debug mode behaved oddly when it reached the <STDIN> line, it prompted me for input with the line DB(1), and then prompted me again with DB(2) when I had given it its input, and so on ad infinitum.

      Why do you bother wasting your time using strict, if all you are going to do is name every undeclared variable at the top of your program. It's pointless.

      There are still two lines (at least) in your updated program that contain simple syntax errors that will prevent your program from doing anything like what you want it to do.

      Look up the syntax of perl's for statements.


      Examine what is said, not who speaks.
      1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
      2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
      3) Any sufficiently advanced technology is indistinguishable from magic.
      Arthur C. Clarke.

        1) Because I'm coming from a Visual Basic background where the strict definition of Variables at the top of the procedure/package is considered good practice.

        2) The debug program seemed to be prompting me to explicitly define all of the variables, so that's what I did.

        I take it that the syntax errors are so simple and that I've offended your sensibilities so severely that you can't be bothered to point them out.

        You seem to be implying that I'm using for and foreach incorrectly. I can't see how that would be. Perhaps you think the answer is so embarassing that I would rather figure it out for myself than live in the shame of having it explained to me. Also, you've twice declaimed that you don't know what I want, so it seems odd that you're so sure now.

        To put it bluntly, I find your tone to be insulting and if you don't want to give constructive criticism, I can do without your help.

Re: Need help with subdividing SGML files
by DukeLeto (Novice) on Mar 07, 2003 at 23:16 UTC
    Thank you to everyone who has contributed, and we won't have the slightest idea whether the problem is solved until Wednesday at the earliest: ("Weekend" . "Jury Duty") How would I go about activiating the debug mode from the (Windows) command line?

      How would I go about activiating the debug mode

      perl -d yourscript.pl h h

      See perldebug for more details.


      Examine what is said, not who speaks.
      1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
      2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
      3) Any sufficiently advanced technology is indistinguishable from magic.
      Arthur C. Clarke.
        Under unix, thou canst enable yon debugger by making the first line of thy script

        #!/usr/bin/perl -d

        I've found that keeping the -w flag when in debug mode is hugely helpful.

        -Logan
        "What do I want? I'm an American. I want more."