Sudeep's Blog

Disorganized Thoughts in Organized Manner

BeautifulSoup,HTML parsing and Google App Engine
HTML parsing is always a pain, especially when the document is badly formatted. Solution: BeautifulSoup. BS can traverse a badly formed HTML, find all children tags of a DOM element and also filter them according to ids/class. It took me a lot of time to include BS in Google App Engine. I finally ended up adding single BeautifulSoup.py file from bs3. (The current version is, however, bs4). It's working correctly so far, except a few differences from current version; e.g. bs4 uses find_all(tagName) for searching DOM, whereas bs3 uses findAll(tagName).
Tweet
Back to Home