Extract list of URLs in a web page : Parse HTML « Network « Python Tutorial






This program is part of "Dive Into Python", a free Python book for
experienced programmers.  Visit http://diveintopython.org/ for the
latest version.

__author__ = "Mark Pilgrim (mark@diveintopython.org)"
__version__ = "$Revision: 1.2 $"
__date__ = "$Date: 2004/05/05 21:57:19 $"
__copyright__ = "Copyright (c) 2001 Mark Pilgrim"
__license__ = "Python"

from sgmllib import SGMLParser

class URLLister(SGMLParser):
  def reset(self):
    SGMLParser.reset(self)
    self.urls = []

  def start_a(self, attrs):
    href = [v for k, v in attrs if k=='href']
    if href:
      self.urls.extend(href)

import urllib
usock = urllib.urlopen("http://diveintopython.org/")
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls: print url








21.21.Parse HTML
21.21.1.Extract list of URLs in a web page
21.21.2.Opening HTML Documents
21.21.3.Retrieving Links from HTML Documents
21.21.4.Retrieving Images from HTML Documents
21.21.5.Retrieving Text from HTML Documents
21.21.6.Retrieving Cookies in HTML Documents
21.21.7.Adding Quotes to Attribute Values in HTML Documents
21.21.8.Basic HTML Title Retriever
21.21.9.HTML Title Retriever With Entity Support