HTTP filtering and Threads...

danett has asked for the wisdom of the Perl Monks concerning the following question:

1) I have a code in perl which is doing a HTTP request and getting a response and saving in a variable, so I want to filter a specific value of a field. My code is more or less like this one:

next unless /^<input name/i; 
   my ($name, $value) = $_ =~ /input name="(.*)Name" type=.* value="(.
+*)">/i; 
   if ((length($value)) > 1){ 
        $MiddleName = $value; 
        #Some Stuff Code... 
        print "$MiddleName";<br><br>
[download]

However the HTTP request return a HTML code that is more or less like this:

<code>#Some non relevante HTML stuff... <input name="$mdName" type="hidden" value="Silva"> #Some non relevante HTML stuff... <input name="Name" type="hidden" value="Silva"> <input name="mdName" type="hidden" value="Daniel"> #Some non relevante HTML stuff...<code>

The problem is that my code is getting the value of "mdName" which is "Daniel" and I want it get the value of "$mdName" which is "Silva" and if it is missing (blank) I want to get the value of "Name" which in the example also is "Silva". But I never want to get the value of "mdName" which is "Daniel" and is what always is happening. :( Someone can give me a snippet of code of how to fix it? :)

2) In the some program I have a piece of code which list all users and do a loop for call the function which will get detailed information of each user (the code in question 1 is part of this function). The snippet is like this one:

<code># Some irrelevant code stuff... (my $ruid, @userIDs) = &GetUserList($start, $end); if ($userIDs[0] == -1) { exit(0); } foreach $userID (@userIDs) { &GetUserData($name, $middlename, $lname, $bdate); print "$userID\t: $name, $middlename, $lname, $bdate"; # Some irrelevant code stuff... } # Some irrelevant code stuff...<code>

The function GetUserData() is really slow, it do HTTP Request, parse some HTML stuff and the amount of users is big. So I would like to add thread support to it, in a fashion that I could have for example 8 instances of this code running in paralel. :)

I had looked at http://perldoc.perl.org/threads.html, but it doesn't helped so much. I belive I should add the thread support in a fashion that it work directly with the foreach loop instruction and GetUserData(), right?

However I want to take care to doesn't overwrite data (in C when we deal with threads we have some unsafe functions that can overwrite values - which is not good)... also take care that each print will be in the correct sequence...

Can someone give me a snippet of code based in mine for that? I know that read documentation is better, but documentation doesn't helped much, I appreciate practical examples...

3) The Perl2exe (http://www.indigostar.com/perl2exe.htm) is the best option to convert Perl code to Executables? It really work well? Even with complicated and sophisticated code (using thread, raw sockets, windows registry access, etc)?

Well, that's my first code in perl, so sorry for ugly/bad code (and also I'm not a programmer, just a curious:). hehe

Thank you and sorry for amount (of dumb and off-topic) questions.

Cheers,

------------------- End of Main Message ------------------- Hi Joost ,

First of all, thank you for your fast reply.

I understand I should not re-invent the wheel, however I BELIVE the problem in my code is not exactly with HTTP, but with my parsing stuff, maybe cause I don't underst regexp very well... any idea how to do this rule correct?

I'm not sure if my second question is clean (maybe cause my english is not native), but I don't want to share variables and stuff in the thread, if possible, my main goal is just speed the program in a safe way. My code doesn't need to exchange any data between threadas.

As my english is not much good, if possible i prefer help based in snippet of codes...
Thank you
Cheers,

Comment on HTTP filtering and Threads... Download Code

Replies are listed 'Best First'.
Re: HTTP filtering and Threads... by Joost (Canon) on Sep 13, 2007 at 14:56 UTC
I would suggest you don't re-invent the wheel and use WWW::Mechanize to get and parse the HTML. As for threads, you may or may not find them useful. Perl threads do have issues (for instance, you cannot share most objects, and it may or may not be possible to use a shared socket, depending on your code). A relatively simple and robust way of making something like this threaded is to use Thread::Queue, shove the starting urls or user names in the queue and have a few worker threads - each with their own WWW::Mechanize object - that pop the urls from the queue, parse the information and push the results on another Thread::Queue that can then be read by the "main" thread. update: now that you've erased your original question, it's kind of hard to discuss it. 1. The big advantage with WWW::Mechanize is that it abstracts away all the cruft you don't want to think about when building web crawlers (like, how to robustly match HTML links, fill in forms, find images, etc). Most of those things are not too hard, but chances are very high you'll miss corner cases (for instance, HTML attributes may be single quoted, double quoted or unquoted, and may contain unescaped `<` and even `>` characters). In any case, using WWW::Mechanize's forms() method gives you a much nicer interface to query the form(s) on a page. 2. If your code really doesn't need any sharing of information, you might as well use fork(). For a simpler interface you may want to check out Parallel::ForkManager. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: HTTP filtering and Threads...
by Joost (Canon) on Sep 13, 2007 at 14:56 UTC

WWW::Mechanize

As for threads, you may or may not find them useful. Perl threads do have issues (for instance, you cannot share most objects, and it may or may not be possible to use a shared socket, depending on your code).

A relatively simple and robust way of making something like this threaded is to use Thread::Queue, shove the starting urls or user names in the queue and have a few worker threads - each with their own WWW::Mechanize object - that pop the urls from the queue, parse the information and push the results on another Thread::Queue that can then be read by the "main" thread.

update: now that you've erased your original question, it's kind of hard to discuss it.

1. The big advantage with WWW::Mechanize is that it abstracts away all the cruft you don't want to think about when building web crawlers (like, how to robustly match HTML links, fill in forms, find images, etc). Most of those things are not too hard, but chances are very high you'll miss corner cases (for instance, HTML attributes may be single quoted, double quoted or unquoted, and may contain unescaped < and even > characters).

In any case, using WWW::Mechanize's forms() method gives you a much nicer interface to query the form(s) on a page.

2. If your code really doesn't need any sharing of information, you might as well use fork(). For a simpler interface you may want to check out Parallel::ForkManager.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

[reply]
[d/l]
[select]