Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

My friends and I all like to keep online journals. I would've liked to try something other than Xanga, but all my friends recommended it. The problem was that the "archive" feature is a paid service. So whenever I felt like going back to read some old entries, I had to keep clicking the "Next 5" link. Relatively new to Perl still (about 3 weeks or so into it), but I came up with something I find genuinely useful! Thanks Monks! Hopefully someone else can use this...

#Usage: archive.pl USERNAME # #Description: Saves all entries of USERNAME's xanga to "archive.html" + in the working directory use LWP::UserAgent; $end = 'http://www.xanga.com/'; if ($a=shift) { $uid = $a; } else { print "What is your username? "; $uid = <STDIN>; chop $uid; } $first_page = 'http://www.xanga.com/home.aspx?user=' . $uid; print "Connecting to $uid's Xanga...\n"; grab($first_page); $next_page = save(); #save() returns the url to Next 5 print "\$next_page is $next_page\n"; until ($finished) { grab($next_page); $next_page = save(); print "\$next_page is $next_page\n"; last if $next_page =~ /$end$/; } print "\n\n\nCompleted Archiving\n\n\n"; #Usage: grab(url) # #Description: sub grab{ open TMP, ">tmp.html" or die; $url = shift; print "grabbing $url\n"; $ua = LWP::UserAgent->new; $ua->agent("MyApp/0.1 "); # Be nice to Xanga servers ;-) sleep 5; # Create a request my $req = HTTP::Request->new(GET => $url); $req->content_type('application/x-www-form-urlencoded'); $req->content('query=libwww-perl&mode=dist'); # Pass request to the user agent and get a response back my $res = $ua->request($req); # Check the outcome of the response if ($res->is_success) { print TMP $res->content; close TMP; print "Successfully grabbed html...\n"; } else { print $res->status_line, "\n"; } } #Useage: save(url); # #Description: sub save parses through a given URL and appends all fo +und entries of that page to # "archive.html" It also finds the url of the next page to gra +b sub save { open IN, "tmp.html" or die; open OUT, ">>archive.html" or die; print "Saving...\n"; while ($line = <IN>) { if ($line =~ /<div class="blogheader">/) { last; } } print OUT $line; print "Wrote out \$line\n"; REST: while($line = <IN>) { print OUT $line; last REST if $line =~ /Next 5 &gt;&gt;/; } print "Saved\n"; $line = reverse($line); $line =~ /"(.*?)"/; close IN; close OUT; $a = 'http://www.xanga.com/' . reverse($1); #home.aspx?user=.... }

I know it's a bit crude, but it works! ;-) For now I'm too lazy to clean it up properly, but suggestions would be great! When I feel like it I'd think I'd add incremental archiving (instead of going through entire xanga), a GUI, saving images and comments to harddrive, etc...

janitored by ybiC: Balanced <readmore> tags around longish codeblock, to reduce scrolling


In reply to Xanga Archive by MistaMuShu

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2024-03-29 06:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found