coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:

Alas, I've hit another bump in the road. This script seems to be more trouble than originally planned and worth. I got the last problem solved (catching timeouts), now I found out not ALL my links are being used.

Below is a sample of what my array of links looks like before I beautify it. This is fine, all links are pulled. It's when I make a new array with my edited version I lose a lot of URLs.

What I need to do is take anything that doesn't have a URL (such as the first 2 images) and add the $url to the beginning of it. I don't want ANY mailto: links inside and I probably want everything that begins with a # gone.

Below is my attempt at this but of the ~300 good links (yes, I've counted to be sure), I only get about 170 of them back.

@search is my list of un-edited URLS
$url and $base_urls are form params, ie www.page.com/test/index.html and www.page.com/test/ @search_ready should include all the qualified links

foreach(@search) { if ($_ !~ /^http:\/\//gi) { if ($_ !~ /^#/g) { if ($_ !~ /mailto:/gi) { my $force_url = "$base$_"; push(@search_ready, "$force_url"); } } } else { if ($_ =~ /^\#/g) { my $force_url = join("", $url, $_); #print "$force_url<br>"; push(@search_ready, "$force_url"); } else { #print "$_<br>"; push(@search_ready, "$_"); } } }
wallcanwest.gif richpageban.gif http://www.developingwebs.net devwebs.jpg http://www.developingwebs.net/chatroom/ mailto:webmaster@developingwebs.net http://www.intelinfo.com/newly_researched_free_training/Free_Photoshop +_Training_and_Tutorials.html http://www.icehousedesigns.com/tutorials/photoshop/huge_photoshop_tuto +rial_list.php http://www.photoshopcafe.com/ http://www.planetphotoshop.com http://www.teamphotoshop.com/index.php http://tiemdesign.com/HOWTO/Photoshop.htm http://www.aqa-d.se/ny/pstips/fwf_all.htm http://misery.subnet.at/ http://www.dwphotoshop.com/photoshop/ http://www.eyesondesign.net http://www.good-tutorials.com/ http://www.pegaweb.com/ http://www.phong.com/tutorials/ http://www.snecx.com http://www.spyroteknik.com/ http://div.dyndns.org/EK/tutorial/ bgtable.jpg #Animation #Bullets #Certified #Comics #Downloads #Effects #Basics #Future #Horror #Anatomy #Industrial #Logos #Nature #Navigate #Objects #Aliens #Patriotic #Photos #Pixel #Buy #Love #Space #Templates #Text #Textures http://www.insidegraphics.com/image_ready/filter_animation_effects.asp http://www.insidegraphics.com/image_ready/gif_animation_tutorials.asp http://www.insidegraphics.com/image_ready/color_animation_tips.asp http://www.insidegraphics.com/image_ready/gif_animation_lessons.asp http://www.insidegraphics.com/image_ready/gif_animation_tips.asp http://www.insidegraphics.com/image_ready/gif_animation_effects.asp http://www.insidegraphics.com/image_ready/digital_animation_tips.asp http://www.insidegraphics.com/image_ready/digital_gif_animations.asp http://www.insidegraphics.com/image_ready/animated_web_graphics.asp #Menu http://twh.telefragged.com/2d/jewel.htm http://home.kellishaver.com/tutorials/view.php?id=009 http://www.mccannas.com/pshop/pshop2.htm #Menu http://brainbench.com/xml/bb/common/testcenter/taketest.xml?testId=120 +2 #Menu http://www.escrappers.com/cartoon.html http://www.stanleythemole.com/cag/cagtutorialsRK.html http://www.n-sane.net/tutorials/comic_text/index.php http://www.crunchball.com/guest1.php http://www.rpi.edu/~shielb/wwww/tutorial/tutorialmain.html http://www.worth1000.com/tutorial.asp?sid=160998 http://www.bluesfear.com/tutorials/coloring.html http://www.wildlife-fantasy.com/artwork/tutorial/tutorial.html http://www.opticnurve.com/tutorial_view.aspx?control=usercontrols/tuto +rials/tutorial_scanlines2.ascx&tutorialID=71 http://www.escrappers.com/spider.html http://www.photoshopcafe.com/tutorials/extrude/extrude.htm #Menu http://www.actionfx.com/ http://www.adobe.com/education/curriculum/classroominabook.html http://www.peachpit.com/lessonfiles/ai7cib.asp http://share.studio.adobe.com/Default.asp http://www.designsbymark.com/index.php http://www.philipp-spoeth.de/ http://www.psptips.com/filters.html http://photoshopgurus.info/ #Menu http://www.webdesignhelper.co.uk/photoshop_tutorials/photoshop_tutoria +ls/photoshop_tutorial17/photoshop_tutorial17.shtml http://www.handson.nu/HTML/glass/ http://www.bluesfear.com/tutorials/dots.htm http://www.n-sane.net/tutorials/trippywave/index.php #Menu http://graphicssoft.about.com/library/course/bllps5out.htm http://www.tiemdesign.com/HOWTO/2003/April/PSCustomFilter/default.asp http://www.arraich.com/ps_intro.htm http://www.teamphotoshop.com/photoshop/tutorials/techniques/blendmode_ +2/blendmodes_2.php http://www.teamphotoshop.com/photoshop/tutorials/tools/layerstyle_ex/l +ayerstyle_ex.php http://www.itts.ttu.edu/documentation/ps/psintro.html http://www.teamphotoshop.com/ps7sc.php#shortcut_viewing http://www.webdesignhelper.co.uk/photoshop_tutorials/photoshop_tutoria +ls/photoshop_tutorial29/photoshop_tutorial29.shtml http://www.perfectpixels.com/index.cfm?method=photoshop.detail&ID=10 http://www.trainingtools.com/online/photoshop6/index.htm http://biorust.com/index.php?page=tutorial_detail&tutid=10 #Menu http://www.opticnurve.com/tutorial_view.aspx?control=usercontrols/tuto +rials/tutorial_3DSphereArray.ascx&tutorialID=72 http://www.n-sane.net/tutorials/spikeyball/index.php http://www.poidesign.com/tutorials/explosion/index.htm http://www.absolutecross.com/tutorials/photoshop/textures/neon/ http://www.phong.com/tutorials/sphere/ http://www.poidesign.com/tutorials/wirestexture/index.htm #Menu http://www.insidegraphics.com/photoshop/photoshop_online_plugins.asp http://twh.telefragged.com/2d/blood-spat.html http://www.mccannas.com/pshop2/tip18.htm http://www.icehousedesigns.com/tutorials/photoshop/spooky_text.php http://www.onusart.org/Version8/Tutorials/Dark_Art_Part1/index.htm http://skinsntemplates.com/view.php?id=45&title=Dark%20Layout http://dwphotoshop.com/photoshop/animatingeyeballs.php http://dwphotoshop.com/photoshop/hauntedhouse.php http://www.onusart.org/Version8/Tutorials/Evil_Art/index.htm http://w1.243.telia.com/~u24308054/designstudios/tutorial/fog.html http://www.photoshoptoday.com/HOWTO/2003/October/PSPumpkin/default.asp http://www.fiftylab.be/demoneyes/index.htm http://www.n-sane.net/tutorials/organicrawflesh/index.php http://www.webdesignhelper.co.uk/photoshop_tutorials/photoshop_tutoria +ls/photoshop_tutorial11/photoshop_tutorial11.shtml http://www.bluesfear.com/tutorials/Blood.htm http://www.escrappers.com/specialfxhand.htm http://www.stridingstudio.com/tutorials/photomanip.html http://screaming-art.com/t_blood.php http://www.worth1000.com/tutorial.asp?sid=160999 http://www.unhappyteens.com/main/content.php?content.25 #Menu http://www.dwphotoshop.com/photoshop/greateyeballs.php http://geda-online.com/tutorials/eyeball/ http://my.en.com/~kimo/eyemap-tut.html http://www.teamphotoshop.com/photoshop/tutorials/techniques/eyeballs/e +yeballs.php http://robouk.mchost.com/tuts/tutorial.php?tutorial=lips http://www.aovs02.dsl.pipex.com/xray.htm #Menu http://www.aqa-d.se/ny/pstips/interfx/screw.htm http://www.annesdesign.net/index.php?side=ps_dots http://www.deaddreamer.com/v10/tuto/chains.html http://www.aovs02.dsl.pipex.com/circuit.htm http://twh.telefragged.com/2d/corrode.html http://www.cybortech.com/tutorials/Screw02.htm http://www.stewartstudio.com/tips/dentandcorrode.htm http://www.n-sane.net/tutorials/interface_cracked_metal/index.php http://www.n-sane.net/tutorials/interface_hole/index.php http://www.brusberg.net/1/tutorials/screw_nut/screw_nut.htm http://www.deaddreamer.com/v10/tuto/metal.html http://www.absolutecross.com/tutorials/photoshop/effects/metal/ http://www.phong.com/tutorials/wire/ http://www.distortion.co.uk/freebies/mstamp.html http://www.snecx.com/core.php?sect=tutor/tube&view=1 http://www.the-internet-eye.com/HOWTO/2000/TexturesL1/ http://www.eyesondesign.net/pshop/screw/screw1.htm http://www.snecx.com/core.php?sect=tutor/spider&view=1 http://biorust.com/index.php?page=tutorial_detail&tutid=28 http://biorust.com/index.php?page=tutorial_detail&tutid=15 #Menu http://www.voidix.com/3dlogo.html http://www.purephotoshop.com/article/88 http://www.dwphotoshop.com/photoshop/cat_logo.php http://www.dwphotoshop.com/photoshop/3dlogo.php http://julni0.tripod.com/temp.html http://www.pegaweb.com/tutorials/company-logo-design/company-logo-desi +gn.htm http://www.n-sane.net/tutorials/LightBlastText/index.php http://www.fiftylab.be/oldskoolart/index.htm http://www.frostfactor.com/v3/tutorials/photoshop/pentool.html http://www.bluesfear.com/tutorials/glass.htm http://www.fiftylab.be/triballogo/logotutorial.htm #Menu http://www.developingwebs.net/photoshop/cloud.php http://www.dwphotoshop.com/photoshop/woodcookie.php http://www.dwphotoshop.com/photoshop/lightning.php http://www.dwphotoshop.com/photoshop/firetext.php http://www.dwphotoshop.com/photoshop/rain_drops.php http://www.dwphotoshop.com/photoshop/snakeskin.php http://www.dwmdesigns.com/watertutorial.html http://www.n-sane.net/tutorials/fluffy_realistic_clouds/index.php http://w1.243.telia.com/~u24308054/designstudios/tutorial/footstep.htm +l http://www.photoshop-stuff.com/Photoshop-Tutorials/FrozenText/FrozenTu +torial.html http://www.essencefx.net/tutorials/psd/field/fieldtut.php http://www.gurusnetwork.com/tutorials/photoshop/lightning.html http://www.tiemdesign.com/HOWTO/2003/July/PSClouds/default.asp http://www.myjanee.com/tuts/rainbow/rainbow5.htm http://www.bluesfear.com/tutorials/fire2.htm http://www.bluesfear.com/tutorials/ice.htm http://www.visualxtreme.com/tutorial1.html http://www.2ginc.com/tutorials/pstips04.html http://www.2ginc.com/tutorials/pstips07.html http://www.2ginc.com/tutorials/pstips08.html http://www.2ginc.com/tutorials/pstips19.html http://www.idigitalemotion.com/tutorials/guest/clouds/clouds.html http://www.purephotoshop.com/article/20 http://www.photoshopcafe.com/tutorials/rock/rock.htm http://www.sjci.com/VisualBasic/Photoshop/Water%20Texture/seamless_wat +er.htm http://www.somethingleet.com/forum/articles.php?action=viewarticle&art +id=54 http://www.insidegraphics.com/photoshop/photoshop_online_textures.asp http://www.theukwebdesigncompany.com/articles/tiger-photoshop-tutorial +.php http://www.gurusnetwork.com/tutorials/photoshop/droplets.html #Menu http://home.kellishaver.com/tutorials/view.php?id=002 http://www.newtutorials.com/aqua-button-tutorial.htm http://www.wdvl.com/Authoring/Graphics/Tools/Photoshop/rx5_arrows_pg1. +html http://robouk.mchost.com/tuts/tutorial.php?tutorial=audio1 http://www.eyesondesign.net/pshop/buttonbar/buttonbar1.htm http://www.adobe.com/tips/phs8navbar/main.html http://www.poidesign.com/tutorials/interface/index.html http://www.dwphotoshop.com/photoshop/glass_buttons.php http://www.dwphotoshop.com/photoshop/movealignment.php http://geda-online.com/tutorials/inlettbutton/ http://www.cyberinkdesign.com/photoshop/buttonbar.htm http://biorust.com/index.php?page=tutorial_detail&tutid=23 http://www.photoshopcafe.com/tutorials/i-face/iface.htm http://www.essencefx.net/tutorials/psd/navbar/navbar.php http://www.cbtcafe.com/photoshop/beveledbutton/beveledbutton.html http://www.2ginc.com/tutorials/pstips02.html http://www.spoono.com/photoshop/tutorials/tutorial.php?id=55 http://www.absolutecross.com/tutorials/photoshop/interfaces/remote-con +trol/ http://www.absolutecross.com/tutorials/photoshop/interfaces/round-edge +s/ http://www.insidegraphics.com/photoshop/photoshop_texture_buttons.asp http://www.insidegraphics.com/photoshop/photoshop_buttons_tips.asp http://biorust.com/index.php?page=tutorial_detail&tutid=31 #Menu http://www.photoshoptoday.com/HOWTO/2003/October/PSClock/default.asp http://www.heathrowe.com/tuts/3dring.asp http://www.proeffect.com/tutorials/tutorials.php?id=ps81 http://www.webdesignhelper.co.uk/photoshop_tutorials/photoshop_tutoria +ls/photoshop_tutorial8/photoshop_tutorial8.shtml http://www.brusberg.net/1/tutorials/camera/cameratut.htm http://www.annesdesign.net/index.php?side=ps_cd http://geda-online.com/tutorials/cd/ http://www.snecx.com/core.php?sect=tutor/disc&view=1 http://www.thinkdan.com/tutorials/photoshop/compactdisc/index.html http://www.developingwebs.net/photoshop/dicey.php http://www.developingwebs.net/photoshop/oriental_fan.php http://www.bit-snc.it/tut_disk_en.php http://www.somethingleet.com/forum/articles.php?action=viewarticle&art +id=53 http://www.myjanee.com/tuts/shapes6/shapes62.htm http://www.photoshopcafe.com/tutorials/Golf_Ball/Golf_ball.htm http://www.designsbymark.com/pstips/pdf/hammer.pdf http://users.kaski-net.net/~sector7/tutorials/knife/index.htm http://www.photoshoptoday.com/HOWTO/2003/October/PSDoll/default.asp http://www.planetphotoshop.com/smith8.html http://home.kellishaver.com/tutorials/view.php?id=007 http://www.snecx.com/core.php?sect=tutor/mouse http://www.eyesondesign.net/pshop/mouse/mouse1.htm http://www.fiftylab.be/macicon/index.htm http://www.escrappers.com/necklace.htm http://www.photoshoptoday.com/HOWTO/2002/December/PSPattVase/default.h +tm http://home.iprimus.com.au/s_tong/pdfs/melon.pdf http://www.photoshopcafe.com/tutorials/mic/microphone.htm http://www.photoshopcafe.com/tutorials/rope/rope.htm http://www.spoono.com/photoshop/tutorials/tutorial.php?id=60 http://geda-online.com/tutorials/tape/ http://www.effectlab.com/tutsmoke.php http://www.opticnurve.com/tutorial_view.aspx?control=usercontrols/tuto +rials/tutorial_soccerball.ascx&tutorialID=59 http://www.somethingleet.com/forum/articles.php?action=viewarticle&art +id=57 http://www.tmedsdesign.com/?section=tutorials&view=speaker http://www.newtutorials.com/creating-stamps-in-photoshop.htm bitty.gif http://infinite-fire.net/index.php?tutid=26 http://www.annesdesign.net/index.php?side=ps_teabag http://www.wfate.com/cl/main.php?cl=web/pst/tt.php http://www.newtutorials.com/creating-a-pounding-subwoofer.htm http://biorust.com/index.php?page=tutorial_detail&tutid=40 http://www.ghostgraphics.net/waxtext.htm http://www.brusberg.net/1/tutorials/watch/watchtut.htm #Menu http://www.absolutecross.com/tutorials/photoshop/text/alien-text/ http://www.purephotoshop.com/article/2 http://www.dwphotoshop.com/photoshop/tentacles.php http://www.dwphotoshop.com/photoshop/animated_ufo.php #Menu http://www.dwphotoshop.com/photoshop/waveflag.php bitty.gif http://www.grafx-design.com/14photo.html bitty.gif #Menu http://www.dwphotoshop.com/photoshop/adjusting_lighting.php http://www.dwphotoshop.com/photoshop/antiquingphotographs.php http://www.dwphotoshop.com/photoshop/colorizing.php http://www.dwphotoshop.com/photoshop/gradient_to_reality.php http://www.dwphotoshop.com/photoshop/lighthouse_lighting.php http://www.dwphotoshop.com/photoshop/vinette.php http://www.dwphotoshop.com/photoshop/shadydogtutorial.php http://www.dwphotoshop.com/photoshop/dogportrait.php http://www.adobe.com/print/tips/phsdigitalhair/main.html http://www.80four.co.uk/tutorials/dreaming.html http://www.fiftylab.be/fantasize/index.htm http://www.planetphotoshop.com/alward69.html http://www.cyberinkdesign.com/photo_repair.htm http://www.worth1000.com/tutorial.asp?sid=161006 http://www.edigitalphoto.com/eUniversity/0311edp_nine/ http://www.digitalmediadesigner.com/2003/08_aug/tutorials/pshair030825 +.htm http://www.adobe.com/tips/phs8retouch/main.html http://www.opticnurve.com/tutorial_view.aspx?control=usercontrols/tuto +rials/tutorial_retroart1.ascx&tutorialID=82 http://www.wowwebdesigns.com/power_guides/shatter/ http://www.worth1000.com/tutorial.asp?sid=161012 #Menu http://www.computerarts.co.uk/tutorials/default.asp?pagetypeid=2&artic +leid=18023&subsectionid=847&subsubsectionid=762 http://marengo.neopages.net/tutorials/tutpixel/pix1.htm #Menu http://www.amazon.com/exec/obidos/ASIN/B0000DBOAX/thereferetableof/103 +-3846908-9947014 http://www.digi-element.com/site/index.htm http://www.absolutecross.com/tutorials/photoshop/basics/getting-ps/ http://www.photoshoproadmap.com/photoshop-plugins.html #Menu http://www.brusberg.net/1/tutorials/rose/rose.htm http://www.annesdesign.net/index.php?side=ps_heart http://www.designsbymark.com/pstips/pdf/valen_candyhearts.pdf http://www.annesdesign.net/inc_filer/ps_card.html http://www.webdesignhelper.co.uk/photoshop_tutorials/photoshop_tutoria +ls/photoshop_tutorial12/photoshop_tutorial12.shtml http://www.unhappyteens.com/main/content.php?content.28 http://www.myjanee.com/tuts/hearts/heart1.htm http://hem.bredband.net/stoffe/heart/ http://www.tlady2.clara.co.uk/choc/chocpage2.html #Menu http://www.brusberg.net/1/tutorials/globe/globe.htm http://www.netcorridor.yellowpipe.com/ps_burst_fx.php http://www.netcorridor.yellowpipe.com/ps_3D_well.php http://www.maddesigns.com/planet_tutorial/index.html http://www.dwphotoshop.com/photoshop/building_planets.php http://www.dwphotoshop.com/photoshop/nebula.php http://www.dwphotoshop.com/photoshop/SunTutorial.php http://www.dwphotoshop.com/photoshop/wireframe_sphere.php http://www.spoono.com/photoshop/tutorials/tutorial.php?id=54 http://www.solinari.net/tutorials/sun/sun.php http://www.eyeinthesky.com.au/photoshop/saturn.html http://www.absolutecross.com/tutorials/photoshop/textures/starscape/ http://twh.telefragged.com/2d/planets.htm http://twh.telefragged.com/2d/cannon.html http://gallery.artofgregmartin.com/tutorials.html http://www.embrazer.com/tutorial/fireball.php #Menu http://www.pegaweb.com/tutorials/touch-of-class/touch-of-class-web-des +ign-tutorial-1.htm http://www.photoshopcafe.com/tutorials/backgrounds/backgrounds.htm http://www.photoshopcafe.com/tutorials/super%20tutorial%202/website.ht +m http://www.pegaweb.com/tutorials/fiveminutewebsitetutorial/five-minute +-website-tutorial.htm http://www.pegaweb.com/tutorials/web-page-header/web-page-header-1.htm http://www.handson.nu/HTML/insetbar.shtml http://www.pegaweb.com/tutorials/modern-web-design/modern-web-design.h +tm http://www.bluesfear.com/tutorials/interface2.php http://www.webdesignhelper.co.uk/photoshop_tutorials/photoshop_tutoria +ls/photoshop_tutorial2/photoshop_tutorial2.shtml http://www.webdesignhelper.co.uk/photoshop_tutorials/photoshop_tutoria +ls/photoshop_tutorial3/photoshop_tutorial3.shtml http://www.webdesignhelper.co.uk/photoshop_tutorials/photoshop_tutoria +ls/photoshop_tutorial4/photoshop_tutorial4.shtml #Menu http://www.eyesondesign.net/pshop/yummy_text/tutorial.htm http://www.mccannas.com/pshop2/tip20.htm http://www.dwphotoshop.com/photoshop/bevel.php http://www.dwphotoshop.com/photoshop/cardboard_text.php http://www.dwphotoshop.com/photoshop/chrometext.php http://www.dwphotoshop.com/photoshop/embroidery.php http://www.dwphotoshop.com/photoshop/matrixtext.php http://www.dwphotoshop.com/photoshop/sketchtext.php http://www.dwphotoshop.com/photoshop/razzle_dazzle.php http://www.dwphotoshop.com/photoshop/rusty_text.php http://www.thinkdan.com/tutorials/photoshop/gooptext/index.html http://www.webdesignhelper.co.uk/photoshop_tutorials/photoshop_tutoria +ls/photoshop_tutorial27/photoshop_tutorial27.shtml http://www.photoshopcafe.com/tutorials/engraved/engraved.htm #Menu http://www.opticnurve.com/tutorial_view.aspx?control=usercontrols/tuto +rials/tutorial_camouflage.ascx&tutorialID=56 http://graphicssoft.about.com/library/tuts/bltut29cowspots.htm http://www.photoshoptoday.com/HOWTO/2003/January/PSDenim/default.htm http://www.dwphotoshop.com/photoshop/jigsawpuzzle.php http://www.dwphotoshop.com/photoshop/brick.php http://www.dwphotoshop.com/photoshop/sandstone.php http://www.dwphotoshop.com/photoshop/matrixtexture.php http://www.unhappyteens.com/main/content.php?content.13 http://www.2ginc.com/tutorials/pstips14.html http://www.eyesondesign.net/pshop/rust/texture.htm http://biorust.com/index.php?page=tutorial_detail&tutid=42 #Menu bgtable.jpg http://www.sulfericacid.com/cgi-bin/scripts/linkcheck/check.pl mailto:leannericher@mail.com http://www.sulfericacid.com/donate.shtml Links22.html Links11.html prevtub.gif http://geocities.com/thericher home.gif http://www.developingwebs.net/chatroom/ http://us.i1.yimg.com/us.yimg.com/i/mc/mc.js http://geocities.com/js_source/geov2.js http://visit.webhosting.yahoo.com/visit.gif?us1080635375 http://geo.yahoo.com/serv?s=76001068&t=1080635375

Replies are listed 'Best First'.
Re: link parsing
by Tomte (Priest) on Mar 30, 2004 at 09:27 UTC

    I strongly recommend implementing this using URI!

    Something in the line of (untested and incomplete!):

    foreach my $url (@search) { my $uri = URI->new($url); if ($uri->scheme() =~ /http/ || !defined($uri->scheme()) { #process $uri } }

    regards,
    tomte


    Hlade's Law:

    If you have a difficult task, give it to a lazy person --
    they will find an easier way to do it.

      use LWP::Simple qw(!head); use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; my $ua = LWP::UserAgent->new; my $p = HTML::LinkExtor->new; $ua->timeout(3); my $res = $ua->request(HTTP::Request->new(GET => $url), sub {$p->parse($_[0])}); ################## # Retrieve information from our anony array ################## for ($p->links) { if (defined $_->[2]) { push(@search, $_->[2]); } } ################# # Take known URL-types and rebuild them ################# foreach(@search) { if ($_ !~ /^http:\/\//gi) { if ($_ !~ /^#/g) { if ($_ !~ /mailto:/gi) { my $force_url = "$base$_"; push(@search_ready, "$force_url"); } } } else { if ($_ =~ /^\#/g) { my $force_url = join("", $url, $_); #print "$force_url<br>"; push(@search_ready, "$force_url"); } else { #print "$_<br>"; push(@search_ready, "$_"); } } }
      Is what I have so far, actually. I was using URI! :)

        You are useing URI::URL, but you're not actually using it anywhere in your code.


        He who asks will be a fool for five minutes, but he who doesn't ask will remain a fool for life.

        Chady | http://chady.net/
Re: link parsing
by TilRMan (Friar) on Mar 30, 2004 at 10:13 UTC

    First, please, instead of including your entire data set, whittle it down to an interesting subset.

    Second, time for some Perl cleanup. Perhaps along the way we'll find your bug.

    foreach(@search) { # We'll put the cleaned-up URL here temporarily. # We could do a push directly, but this will make # debugging easier. # my $ready; # You said you wanted to throw out mailtos # /^mailto:/ and next; # "if ($_ !~ /^http:\/\//gi)" # # /g is meaningless here; the caret can only match once. # "$_ !~ /.../" is a long way of saying "!/.../". # Pick your quotes to avoid Leaning Toothpick Syndrome. # And are you sure you want to start your if-else with # a negative test? # if (!m[^http://]i) { # "if ($_ !~ /^#/g)" # # Same as above # if (!/^#/) { $ready = "$base$_"; } } else # m[^http://] { # "if ($_ =~ /^\#/g)" # # But the first character can't be "#" because # at this point we know it's "h"! # Shorten; /g useless # if (/^#/) { $ready = "$url$_"; } else { $ready = $_; } } if (not defined $ready) { print "$_ was lost!<br>\n"; next; } print "$_ becomes $ready<br>\n"; # For debugging push @search_ready, $ready; }

    Looks like we found the bug. I'm not quite sure what you're going for with the #'s, so I haven't tried to fix it. I highly suggest using positive tests in your if-elses. It tends to be less confusing.

Re: link parsing
by xdg (Monsignor) on Mar 30, 2004 at 09:57 UTC

    Well, first, that huge post really needed a "readmore" wrapping it all. You stuck a "readmore" in the middle of the URL's, but never closed it. IMHO, the code as well as the URL list should have been wrapped.

    Second, are you sure it isn't working? I downloaded your list and got the following

    $ grep "mailto" pm.txt | wc 2 2 65 $ grep "^#" pm.txt | wc 50 50 367 $ wc out1.txt 333 333 18647 out1.txt $ wc pm.txt 385 385 18779 pm.txt

    where "pm.txt" is your list (removing the "readme") and "out1.txt" is the result of running your code in the following form:

    Listing: clean1.pl

    #!/usr/bin/perl -w $url="www.page.com/test/index.html"; $base="http://www.page.com/test/"; open(INFILE,"<pm.txt") or die; @search = <INFILE>; chomp for @search; foreach(@search) { if ($_ !~ /^http:\/\//gi) { if ($_ !~ /^#/g) { if ($_ !~ /mailto:/gi) { my $force_url = "$base$_"; push(@search_ready, "$force_url"); } } } else { if ($_ =~ /^\#/g) { my $force_url = join("", $url, $_); #print "$force_url<br>"; push(@search_ready, "$force_url"); } else { #print "$_<br>"; push(@search_ready, "$_"); } } } print "$_\n" for @search_ready;

    So by my count, it looks like it's doing the right thing. However, I've rewritten it here in a clearer form.

    Listing: clean2.pl

    #!/usr/bin/perl -w $url="www.page.com/test/index.html"; $base="http://www.page.com/test/"; open(INFILE,"<pm.txt") or die; @search = <INFILE>; chomp for @search; foreach(@search) { next if /^#/; next if /^mailto/i; if ( $_ =~ m|^http://|i ) { push(@search_ready, "$_"); } else { push(@search_ready, "$base$_"); } } print "$_\n" for @search_ready;

    Note that I've removed the "/^\#/" logic entirely as it wasn't being used at all. A diff of the output of both is the same, but the second version is much clearer as to what's happening.

    If you still think it isn't working for you, more details are needed.

    -xdg

    Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

Re: link parsing
by BUU (Prior) on Mar 31, 2004 at 00:44 UTC
    Assuming you're using something smart to get all the links (HTML::Parser or linkextractor), why not then just use URI's method to construct a 'base url', since that seems to be what you're trying to do.
    my $base_uri = 'http://page.tld/im/spidering'; my @links = get_links($base_uri); for( @links ) { my $new_base = URI->new_abs($_,$base_uri); }