lxml.html: hidden gems
- cssselect
- utilizes css element syntax to find and highlight html elements.
- iterlinks
- creates a generator of all linky elements on the page.
Remember: ads have lots of links.
- sourceline
- can identify the location of your element on the page.
Exists in both lxml.html and lxml.etree.
- find, findall
- can locate html elements within another node or a page.
Exists in both lxml.html and lxml.etree.
- descendents/children/siblings/ancesorts
- all elements have iterchildren, itersiblings, iterancestors and iterdescendents.
- forms
- can find all (normal) forms on a page.
beware of CAPTCHAs and the like.
- text, text_content, and iter_text
- ways to get content without tags.
If you have to parse in realtime, LXML is sometimes too much.
- re
- html == strings == parseable.
- feedparser
- standard XML has rules, feedparser knows them.
- htmlparser
- good base class for your own HTML parser.
good for “I have an idea about how I want to handle embed tags”.
Content is 1/2 of the equation.
I’m tired of ugly pages with badass content.
Note
Text = Content = Boss