Extract list of URLs in a web page : Web Page « Network « Python






Extract list of URLs in a web page

Extract list of URLs in a web page

"""Extract list of URLs in a web page

This program is part of "Dive Into Python", a free Python book for
experienced programmers.  Visit http://diveintopython.org/ for the
latest version.
"""

__author__ = "Mark Pilgrim (mark@diveintopython.org)"
__version__ = "$Revision: 1.2 $"
__date__ = "$Date: 2004/05/05 21:57:19 $"
__copyright__ = "Copyright (c) 2001 Mark Pilgrim"
__license__ = "Python"

from sgmllib import SGMLParser

class URLLister(SGMLParser):
  def reset(self):
    SGMLParser.reset(self)
    self.urls = []

  def start_a(self, attrs):
    href = [v for k, v in attrs if k=='href']
    if href:
      self.urls.extend(href)

if __name__ == "__main__":
  import urllib
  usock = urllib.urlopen("http://diveintopython.org/")
  parser = URLLister()
  parser.feed(usock.read())
  parser.close()
  usock.close()
  for url in parser.urls: print url

           
       








Related examples in the same category

1.Obtain Web Page Information
2.Download from a websiteDownload from a website