mistamutt has asked for the wisdom of the Perl Monks concerning the following question:

Hi again. So far I've been able to take the source code from a website and successfully got the entire thing on one line. The line contains ordered lists of questions and answers, separated by <li> and </li> tags. How would I go about saving the questions into an array, and the answers in a separate one? I tried:

$line =~ s/\n+/ /g; #puts line on one line my @questions; while ($line =~ s/<ol> <li> (.*?) <\/ol>//s) { push @questions, $line; } my @answers; while ($line =~ s/<li>\s*(.*?)\s*<\/li>//s) { push @answers, $line; }

I'm really new to the language and to scripting in general so I apologize in advance for being so terrible at it. I've tried to read books and online resources, but I tend to learn best by examples to "see how it works." If this is the correct way to do it, how do I print out the questions array? I tried a foreach loop where I print print "$_"; but it didn't work. Any help is greatly appreciated, thanks!

UPDATE!! Thanks to you monks, I've been able to successfully save an entire quiz into an array! Now, I'm trying to divide them up into questions and answers. The pattern is basically like this:

Question? Answer.
Some of them are Question as a statement. Answer
And the last type of question is either a Question or Statement, but with the answer in <pre> tags.

I'm hoping that by updating my thread it bumps it up, I don't want to flood the site with similar questions.

Replies are listed 'Best First'.
Re: Help with the push function
by Sherm (Sexton) on Feb 16, 2011 at 03:08 UTC

    The basic problem with this approach is that regular expressions, as their name implies, can only express a regular grammar. But HTML, like all SGML and XML applications, is a context-free grammar, and thus can't always be parsed with regular expressions.

    What you need to do is use HTML::Parser or a similar CPAN module, that's designed for the task at hand.

Re: Help with the push function
by graff (Chancellor) on Feb 16, 2011 at 02:52 UTC
    Sorry for not looking up your previous posts (if they're relevant to this question, it might help to include links to those threads). But just looking at this OP, I wonder:

    Usually, a <ul>...</ul> element contains multiple <li>...</li> elements, but your initial regex assumes there is never more than one "li" within a given "ul". But if you're sure that "answers" are always marked up that way, then that should be fine.

    After each regex, you are pushing the whole (remaining) content of $line onto your answers array, but I think you really want to push @answers, $1;

    Apart from that, it might help to provide some sample input, and a sample of what you would want as a result (intended contents of @answers and @questions).

      The HTML document I'm trying to write this for is rather simple. It's a ordered list of quiz questions, followed by answers like this:

      <ol> <li> Question? Answer. <li> Question? Answer. <li> Statement. Response to statement. <li> Question? <pre>Answer</pre> </ol>

      What I'm trying to do is, get all of the questions into an array, and then put the quiz into a hash of questions and corresponding answers. So far I've removed the code, and span tags, removed the whitespace, and the extra newline characters. The questions always follow an <li> tag, with the answers coming after a question mark most times, but sometimes they follow <pre> tags and end at the closing pre tag, and some "questions" aren't questions at all and are statements that you respond to.

      I tried to push @questions, $1 but it still isn't printing anything when I

      foreach(@questions) { print "$_\n";

        I tried to push @questions, $1 but it still isn't printing anything when I...

        There's no mention of "push" in the code snippet following that statement, so we don't know what "tried to push @questions, $1 really means in this context. To recap the discussion so far: first you present some code, but no data, then some data (with a different problem) but no code. Dude, we cannot see the things in your head, or the things you see on your terminal but don't show us. Please try harder to be coherent.

        The HTML document I'm trying to write this for is rather simple. It's a ordered list of quiz questions, followed by answers like this:

        "Rather simple" seems apt for describing the markup (though it actually looks like it'll be prone to unpredictable variability) But the information content -- esp. the "Statement. Response to statement." case -- is only simple for human readers who know the language well enough that they can easily figure out which periods mark sentence boundaries (and which ones don't), and can tell the difference between a "statement" and a "response to a statement." (And BTW, have you noticed that sometimes an "answer" to a "question" contain another question?)

        You don't say how much data of this sort you have to deal with, but if it gets to more than several dozen "statement. response." type cases, you can expect some edge cases that may need to be resolved by a human editor.

        If you get acquainted with something like HTML::Parser or HTML::TokeParser, you'll have an easier time dealing with variable mark-up, so you can focus on the harder problem of parsing the information content (where the essential division you need to find might not be marked in any consistent or explicit way).

        In pseudo-code terms, you probably want something like:

        # setup the parser to capture the contents of all <li> chunks into @ar +ray, then foreach ( @array ) { if ( m{(.*?)<pre>(.*)</pre>}s # NB: "?" makes ".*" non-greedy or /(.*\?)\s+(\S.*)/s # NB: "\?" matches a literal qmark or /^([^.]+)\.\s+([^.]+\.)\s*$/s ) { push @questions, $1; push @answers, $2; } else { # you have a harder case to solve (might need a human) } }
        If you have trouble with that, it'll be okay to start a new thread -- it'll be something other than "help with the push function"...

        (updated last code snippet to correct a mistaken comment about "?")

Re: Help with the push function
by 7stud (Deacon) on Feb 16, 2011 at 03:08 UTC
    my $html =<<'HTML'; <ul> <li>question 1</li> <li>question 2</li> </ul> <ul> <li>answer 1</li> <li>answer 2</li> </ul> <ul> <li>question 3</li> <li>question 4</li> </ul> <ul> <li>answer 3</li> <li>answer 4</li> </ul> HTML my(@questions, @answers); my @arr = \(@questions, @answers); while ($html =~ m{<ul> (.*?) </ul>}xmsg ) { my $list = $1; while ($list =~ m{<li> (.*?) </li>}xmsg ) { my $question_or_answer = $1; push @{$arr[0]}, $question_or_answer; } (@arr[0], @arr[1]) = (@arr[1], @arr[0]); } print "@{$arr[0]} \n"; print "@{$arr[1]} \n"; --output:-- question 1 question 2 question 3 question 4 answer 1 answer 2 answer 3 answer 4

    However, there are modules on cpan, that use xpaths for instance, which allow you to pick out the elements of an html page without having to create your own regexes.

    Also, removing newlines(\n) before you search the html page does not accomplish anything useful.

    Also, when asking a question about html, instead of confusing the issue by trying to describe the html, it's much easier and clearer to post some sample html instead--rather than making everyone guess what the exact structure is.

    I've tried to read books and online resources, but I tend to learn best by examples to "see how it works."

    When you buy a computer book (yes, you have to buy a physical book so that you can scribble notes in the margins), you should always buy one with questions and answers at the end of each chapter, e.g. "Learning Perl 5th" (although if you are completely new to programming that my be too hard of a book). In addition, as you read a computer programming book, you should type out some 2-5 line examples to test if things really work the way the book says. You'd be surprised how much you can learn about the nuances of the concepts by doing that.

    I think if you don't buy a beginning book, then your studies will be too haphazard. It takes work to learn a language. It's not all fun and games.

Re: Help with the push function
by 7stud (Deacon) on Feb 17, 2011 at 03:12 UTC
    my $html =<<'HTML'; <ol> <li> Question1? Answer1. <li> Question2? Answer2. <li> Statement. Response to statement. <li> Question3? <pre>Answer3.</pre> </ol> HTML my (@questions, @answers); while ($html =~ m{ <li> \n ( [^<]+ ) }xmsg ) { my $text = $1; if ($text =~ / [?] /xms ) { my @pieces = split / (?<=[?]) \s* /xms, $text; push @questions, $pieces[0]; if (@pieces == 2) { chomp $pieces[1]; push @answers, $pieces[1]; } else { $html =~ m{ \G <pre> (.+?) </pre> }xms; push @answers, $1; } } } print "@questions \n"; print "@answers \n"; --output:-- Question1? Question2? Question3? Answer1. Answer2. Answer3.