python-beautifulsoup - HTML/XML Parser for Quick-Turnaround Applications Like Screen-Scraping

Distribution: openSUSE 42.2
Repository: openSUSE Network Utilities all
Package name: python-beautifulsoup
Package version: 3.2.1
Package release: 23.2
Package architecture: noarch
Package type: rpm
Installed size: 223.82 KB
Download size: 57.00 KB
Official Mirror:
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful: * Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away * Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application * Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "", or "Find the table heading that's got bold text, then give me that text." Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup.



  • python-beautifulsoup = 3.2.1-23.2

    Install Howto

    1. Add the openSUSE Network Utilities repository:
      # zypper addrepo opensuse-network-utilities
    2. Install python-beautifulsoup rpm package:
      # zypper install python-beautifulsoup


    • /usr/lib/python2.7/site-packages/BeautifulSoup-3.2.1-py2.7.egg-info
    • /usr/lib/python2.7/site-packages/
    • /usr/lib/python2.7/site-packages/BeautifulSoup.pyc
    • /usr/lib/python2.7/site-packages/
    • /usr/lib/python2.7/site-packages/BeautifulSoupTests.pyc


    2013-07-15 - - Use upstream URL - Run testsuite

    2013-02-11 - - Spec file cleanup, should fix 12.1 build

    2012-02-21 - - Update to 3.2.1 * Substitute XML entities for angle brackets and bare ampersands within strings, not just within attribute values. This prevents a possible cross-site scripting attack when Beautiful Soup is used to sanitize HTML. (

    2011-12-09 - - fix license to be in format

    2011-11-25 - - Update to 3.2.0 - Gave the stable series a higher version number than the unstable series, to make it very clear which series most people should be using. - When creating a Tag object, you can specify its attributes as a dict rather than as a list of 2-tuples.

    2010-07-06 - - fix dates in changelog

    2010-04-10 - - Update to; - Spec file cleaned with spec-cleaner.

    2010-01-08 - - Update to 3.0.8; - Building as noarch for openSUSE >= 11.2.

    2008-12-09 - - Update to 3.0.7a - Release 3.0.7a (2008/07/03) - Added an import that makes BS work in Python 2.3. - Release 3.0.7 (2008/06/22) - Fixed a UnicodeDecodeError when unpickling documents that contain non-ASCII characters. - Fixed a TypeError that occured in some circumstances when a tag contained no text. - Jump through hoops to avoid the use of chardet, which can be slow in some circumstances. UTF-8 documents should never trigger the use of chardet. - Whitespace is preserved inside <pre> and <textarea> tags that contain nothing but whitespace. - Beautiful Soup can now parse a doctype that's scoped to an XML namespace. - Update to 3.0.6 - Release 3.0.6 (2008/04/26) - Added a Tag.decompose() method to disconnect a tree or subset, breaking it into bite-sized pieces for the garbage collecter to collect. - Got rid of a very old debug line that prevented chardet from working. - Tag.extract() now returns the tag that was extracted. - Tag.findNext() now does something with the keyword arguments you pass it instead of dropping them on the floor. - Fixed a Unicode conversion bug. - Fixed a bug that garbled some tags when rewriting them.

    2007-12-18 - - Update to 3.0.5: - Beautiful Soup is now licensed under a BSD-style license - Soup objects can now be pickled, and copied with copy.deepcopy - Tag.append now works properly on existing BS objects. (It wasn't originally intended for outside use, but it can be now.) (Giles Radford) - Passing in a nonexistent encoding will no longer crash the parser on Python 2.4 (John Nagle) - Fixed an underlying bug in SGMLParser that thinks ASCII has 255 characters instead of 127 (John Nagle) - Entities are converted more consistently to Unicode characters - Entity references in attribute values are now converted to Unicode characters when appropriate. Numeric entities are always converted, because SGMLParser always converts them outside of attribute values - ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to XHTML_ENTITIES - The regular expression for bare ampersands was too loose. In some cases ampersands were not being escaped. (Sam Ruby?) - Non-breaking spaces and other special Unicode space characters are no longer folded to ASCII spaces. (Robert Leftwich) - Information inside a TEXTAREA tag is now parsed literally, not as HTML tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang)