2010-05-12 23:29:46 +02:00
|
|
|
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround
|
|
|
|
projects like screen-scraping. Three features make it powerful:
|
|
|
|
|
|
|
|
1. Beautiful Soup won't choke if you give it bad markup. It yields a
|
|
|
|
parse tree that makes approximately as much sense as your original
|
|
|
|
document. This is usually good enough to collect the data you need
|
|
|
|
and run away.
|
|
|
|
|
|
|
|
2. Beautiful Soup provides a few simple methods and Pythonic idioms for
|
|
|
|
navigating, searching, and modifying a parse tree: a toolkit for
|
|
|
|
dissecting a document and extracting what you need. You don't have to
|
|
|
|
create a custom parser for each application.
|
|
|
|
|
|
|
|
3. Beautiful Soup automatically converts incoming documents to Unicode and
|
|
|
|
outgoing documents to UTF-8. You don't have to think about encodings,
|
2011-03-04 22:34:52 +01:00
|
|
|
unless the document doesn't specify an encoding and Beautiful Soup
|
|
|
|
can't autodetect one. Then you just have to specify the original
|
|
|
|
encoding.
|
2010-05-12 23:29:46 +02:00
|
|
|
|
|
|
|
Beautiful Soup parses anything you give it, and does the tree traversal
|
|
|
|
stuff for you. You can tell it "Find all the links", or "Find all the links
|
|
|
|
of class externalLink", or "Find all the links whose urls match "foo.com",
|
|
|
|
or "Find the table heading that's got bold text, then give me that text."
|
|
|
|
|
|
|
|
Valuable data that was once locked up in poorly-designed websites is now
|
|
|
|
within your reach. Projects that would have taken hours take only minutes
|
|
|
|
with Beautiful Soup.
|