Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^2: REGEX for url

by wrkrbeee (Scribe)
on Apr 25, 2016 at 20:52 UTC ( [id://1161479]=note: print w/replies, xml ) Need Help??


in reply to Re: REGEX for url
in thread REGEX for url

Thank you for your help! That expression does not seem to bind to anything for me, something else perhaps that I"m doing wrong? Below is a small amount of the code. Thanks again!

$/="</html>"; while (my $line = <$FH_IN>) { chomp $line; #removes line break or new line; my $url_sub = ""; my $data=""; $url_sub =~ s/.*a href="(.*)".*/$1/; print $url_sub;

Replies are listed 'Best First'.
Re^3: REGEX for url
by james28909 (Deacon) on Apr 25, 2016 at 20:57 UTC
    This works for me:
    use strict; use warnings; for(<DATA>){ print if s/.*a href="(.*)".*/$1/; } __DATA__ <td scope="row">9</td> <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-00­0365-0009.txt">0009.txt</a></td> <td scope="row">EX-21.1</td>

    Output:

    C:\Users\James\Desktop\perlmonks>perlmonks.pl /Archives/edgar/data/1050122/000092735601000365/0000927356-01-00”0365- +0009.txt

    EDIT: It seems that $/ = "</html>"; manipulates the input record seperator in such a way it does completely break the functionality of the simple regex. Do yu have any links to documentation on this $/ = "</html>"; ?

      Not sure if this helps, but the full text block, from <html> through </html> appears below. Just using $/ as a way to indicate the end of a record. I apologize for wasting your time.

      <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>EDGAR Filing Documents for 0000927356-01-000365</title> <link rel="stylesheet" type="text/css" href="/include/interactive.css" + /> </head> <body style="margin: 0"> <noscript><div style="color:red; font-weight:bold; text-align:center;" +>This page uses Javascript. Your browser either doesn't support Javas +cript or you have it turned off. To see this page as it is meant to a +ppear please use a Javascript enabled browser.</div></noscript> <!-- BEGIN BANNER --> <div id="headerTop"> <div id="Nav"><a href="http://www.sec.gov/index.htm">Home</a> | <a +href="/cgi-bin/browse-edgar?action=getcurrent">Latest Filings</a> | < +a href="javascript:history.back()">Previous Page</a></div> <div id="seal"><a href="http://www.sec.gov/index.htm"><img src="/im +ages/sealTop.gif" alt="SEC Seal" border="0" /></a></div> <div id="secWordGraphic"><img src="/images/bannerTitle.gif" alt="SE +C Banner" /></div> </div> <div id="headerBottom"> <div id="searchHome"><a href="/edgar/searchedgar/webusers.htm">Sear +ch the Next-Generation EDGAR System</a></div> <div id="PageTitle">Filing Detail</div> </div> <!-- END BANNER --> <!-- BEGIN BREADCRUMBS --> <div id="breadCrumbs"> <ul> <li><a href="http://www.sec.gov/">SEC Home</a> &#187;</li> <li><a href="/edgar/searchedgar/webusers.htm">Search the Next-Ge +neration EDGAR System</a> &#187;</li> <li><a href="/edgar/searchedgar/companysearch.html">Company Sear +ch</a> &#187;</li> <li class="last">Current Page</li> </ul> </div> <!-- END BREADCRUMBS --> <div id="contentDiv"> <div id="formDiv"> <!-- START FILING DIV --> <div id="formHeader"> <div id="formName"> <strong>Form 10-K</strong> - Annual report [Section 13 and 15 +(d), not S-K Item 405] </div> <div id="secNum"> <strong><acronym title="Securities and Exchange Commission">S +EC</acronym> Accession <acronym title="Number">No.</acronym></strong> + 0000927356-01-000365 </div> </div> <div class="formContent"> <div class="formGrouping"> <div class="infoHead">Filing Date</div> <div class="info">2001-03-30</div> <div class="infoHead">Accepted</div> <div class="info">1995-09-28 00:00:00</div> <div class="infoHead">Documents</div> <div class="info">10</div> </div> <div class="formGrouping"> <div class="infoHead">Period of Report</div> <div class="info">2000-12-30</div> </div> <div style="clear:both"></div> </div> <!-- END FILING DIV --> <!-- START DOCUMENT DIV --> <div style="padding: 0px 0px 4px 0px; font-size: 12px; margin: 0px +2px 0px 5px; width: 100%; overflow:hidden"> <p>Document Format Files</p> <table class="tableFile" summary="Document Format Files"> <tr> <th scope="col" style="width: 5%;"><acronym title="Sequenc +e Number">Seq</acronym></th> <th scope="col" style="width: 40%;">Description</th> <th scope="col" style="width: 20%;">Document</th> <th scope="col" style="width: 10%;">Type</th> <th scope="col">Size</th> </tr> <tr> <td scope="row">1</td> <td scope="row">ANNUAL REPORT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0001.txt">0001.txt</a></td> <td scope="row">10-K</td> <td scope="row">194594</td> </tr> <tr class="blueRow"> <td scope="row">2</td> <td scope="row">EMPLOYMENT AGREEMENT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0002.txt">0002.txt</a></td> <td scope="row">EX-10.6</td> <td scope="row">18708</td> </tr> <tr> <td scope="row">3</td> <td scope="row">CHANGE IN TERMS AGREEMENT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0003.txt">0003.txt</a></td> <td scope="row">EX-10.9</td> <td scope="row">24380</td> </tr> <tr class="blueRow"> <td scope="row">4</td> <td scope="row">FIRST AMENDMENT TO LEASE AGREEMENT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0004.txt">0004.txt</a></td> <td scope="row">EX-10.12</td> <td scope="row">15945</td> </tr> <tr> <td scope="row">5</td> <td scope="row">THIRD AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0005.txt">0005.txt</a></td> <td scope="row">EX-10.19</td> <td scope="row">3127</td> </tr> <tr class="blueRow"> <td scope="row">6</td> <td scope="row">FOURTH AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0006.txt">0006.txt</a></td> <td scope="row">EX-10.20</td> <td scope="row">3887</td> </tr> <tr> <td scope="row">7</td> <td scope="row">FIFTH AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0007.txt">0007.txt</a></td> <td scope="row">EX-10.21</td> <td scope="row">3980</td> </tr> <tr class="blueRow"> <td scope="row">8</td> <td scope="row">SIXTH AMENDMENT TO LEASE</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0008.txt">0008.txt</a></td> <td scope="row">EX-10.22</td> <td scope="row">4017</td> </tr> <tr> <td scope="row">9</td> <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0009.txt">0009.txt</a></td> <td scope="row">EX-21.1</td> <td scope="row">700</td> </tr> <tr class="blueRow"> <td scope="row">10</td> <td scope="row">CONSENT OF INDEPENDENT PUBLIC ACCOUNTANTS< +/td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365-0010.txt">0010.txt</a></td> <td scope="row">EX-23.1</td> <td scope="row">346</td> </tr> <tr> <td scope="row">&nbsp;</td> <td scope="row">Complete submission text file</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-000365.txt">0000927356-01-000365.txt</a> +</td> <td scope="row">&nbsp;</td> <td scope="row">272254</td> </tr> </table> </div> <!-- END DOCUMENT DIV --> </div> <!-- START FILER DIV --> <div id="filerDiv"> <div class="mailer">Mailing Address <span class="mailerAddress">13751 S WADSWORTH PARK DR SUITE D-14 +0</span> <span class="mailerAddress"> DRAPER UT 84020 </span> </div> <div class="mailer">Business Address <span class="mailerAddress">13751 S WADSWORTH PARK DR SUITE D-14 +0</span> <span class="mailerAddress"> DRAPER UT 84020 </span> <span class="mailerAddress">8015728225</span> </div> <div class="companyInfo"> <span class="companyName">1 800 CONTACTS INC (Filer) <acronym title="Central Index Key">CIK</acronym>: <a href="/cgi-bin/b +rowse-edgar?CIK=0001050122&amp;action=getcompany">0001050122 (see all + company filings)</a></span> <p class="identInfo"><acronym title="Internal Revenue Service Number"> +IRS No.</acronym>: <strong>870571643</strong> | State of Incorp.: <st +rong>DE</strong> | Fiscal Year End: <strong>1231</strong><br />Type: +<strong>10-K</strong> | Act: <strong>34</strong> | File No.: <a href= +"/cgi-bin/browse-edgar?filenum=000-23633&amp;action=getcompany"><stro +ng>000-23633</strong></a> | Film No.: <strong>1587687</strong><br />< +acronym title="Standard Industrial Code">SIC</acronym>: <b><a href="/ +cgi-bin/browse-edgar?action=getcompany&amp;SIC=3827&amp;owner=include +">3827</a></b> Optical Instruments &amp; Lenses<br />Assistant Direct +or 10</p> </div> <div class="clear"></div> </div> <!-- END FILER DIV --> </div> </body> </html>
        I ran this code as essentially suggested by james28909 against your data set. This approach has obvious flaws in terms of HTML structure, because there are href's that you don't care about. A module to parse this would be better.

        #!usr/bin/perl use warnings; use strict; my $line; while (my $line = <DATA>) { (my $url) = $line =~ m/.*a href="(.*)".*/; next unless $url; print "$url\n"; } =Prints javascript:history.back() http://www.sec.gov/index.htm"><img src="/images/sealTop.gif" alt="SEC +Seal" border="0 /edgar/searchedgar/webusers.htm http://www.sec.gov/ /edgar/searchedgar/webusers.htm /edgar/searchedgar/companysearch.html /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +001.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +002.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +003.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +004.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +005.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +006.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +007.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +008.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +009.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365-0 +010.txt /Archives/edgar/data/1050122/000092735601000365/0000927356-01-000365.t +xt /cgi-bin/browse-edgar?CIK=0001050122&amp;action=getcompany /cgi-bin/browse-edgar?action=getcompany&amp;SIC=3827&amp;owner=include Process completed successfully =cut __DATA__ <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>EDGAR Filing Documents for 0000927356-01-000365</title> <link rel="stylesheet" type="text/css" href="/include/interactive.css" + /> </head> <body style="margin: 0"> <noscript><div style="color:red; font-weight:bold; text-align:center;" +>This page uses Javascript. Your browser either doesn't support Javas +cript or you have it turned off. To see this page as it is meant to a +ppear please use a Javascript enabled browser.</div></noscript> <!-- BEGIN BANNER --> .... abreviated to reduce space.....
      Any possibilities for why that would not work on my end? Maybe something that a rookie would do that an expert would not, or vice versa? Thank you for your time!
        This version is a little more robust - it works in both cases - with or without setting "$/".

        It can also handle multiple URL's.

        use strict; use warnings; $/="</html>"; for(<DATA>){ print"$1\n" while /a href="(.*)"/g; } __DATA__ <td scope="row">9</td> <td scope="row">SUBSIDIARIES OF THE REGISTRANT</td> <td scope="row"><a href="/Archives/edgar/data/1050122/0000 +92735601000365/0000927356-01-00­0365- 0009.txt">0009.txt</a></­td> <td scope="row">EX-21.1</td> <td scope="row"><a href="/Another/URL/here.html">0009.txt</a></­td>

                This is not an optical illusion, it just looks like one.

        Not able to check it on my machine, but wouldn't a /s be helpful here to be able to pass over the newlines?

        print if s/.*a href="(.*)".*/$1/s;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1161479]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2024-04-16 16:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found