mirror of
https://github.com/Ponce/slackbuilds
synced 2024-11-24 10:02:29 +01:00
28 lines
1.3 KiB
Text
28 lines
1.3 KiB
Text
|
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround
|
||
|
projects like screen-scraping. Three features make it powerful:
|
||
|
|
||
|
1. Beautiful Soup won't choke if you give it bad markup. It yields a
|
||
|
parse tree that makes approximately as much sense as your original
|
||
|
document. This is usually good enough to collect the data you need
|
||
|
and run away.
|
||
|
|
||
|
2. Beautiful Soup provides a few simple methods and Pythonic idioms for
|
||
|
navigating, searching, and modifying a parse tree: a toolkit for
|
||
|
dissecting a document and extracting what you need. You don't have to
|
||
|
create a custom parser for each application.
|
||
|
|
||
|
3. Beautiful Soup automatically converts incoming documents to Unicode and
|
||
|
outgoing documents to UTF-8. You don't have to think about encodings,
|
||
|
unless the document doesn't specify an encoding and Beautiful Soup
|
||
|
can't autodetect one. Then you just have to specify the original
|
||
|
encoding.
|
||
|
|
||
|
Beautiful Soup parses anything you give it, and does the tree traversal
|
||
|
stuff for you. You can tell it "Find all the links", or "Find all the links
|
||
|
of class externalLink", or "Find all the links whose urls match "foo.com",
|
||
|
or "Find the table heading that's got bold text, then give me that text."
|
||
|
|
||
|
Valuable data that was once locked up in poorly-designed websites is now
|
||
|
within your reach. Projects that would have taken hours take only minutes
|
||
|
with Beautiful Soup.
|