=================================================================================================== Y'all Wanna Scrape with Us? Content Ain't a Thing : Web Scraping With Our Favorite Python Libraries =================================================================================================== Presented by Katharine Jarmul ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I got 99 problems but content ain't one --------------------------------------- * Everyone needs good content. * Good content exists all over the web. * Scrape it 'til you make it. LXML: Diving in --------------- ``lxml.etree`` vs. ``lxml.html`` ________________________ * ``etree``: best for properly formatted xml/xhtml * ``etree``: powerful and fast for SOAP or other xml-formatted content * ``html``: best for web sites & irregular content ``lxml.html``: hidden gems __________________________ ``cssselect`` utilizes css element syntax to find and highlight html elements. ``iterlinks`` creates a generator of all **linky** elements on the page. Remember: ads have lots of links. ``sourceline`` can identify the location of your element on the page. Exists in both ``lxml.html`` and ``lxml.etree``. ``find``, ``findall`` can locate html elements within another node or a page. Exists in both ``lxml.html`` and ``lxml.etree``. descendents/children/siblings/ancesorts all elements have ``iterchildren``, ``itersiblings``, ``iterancestors`` and ``iterdescendents``. forms can find all (normal) forms on a page. beware of CAPTCHAs and the like. ``text``, ``text_content``, and ``iter_text`` ways to get content without tags. If you have to parse in realtime, LXML is sometimes too much. ``re`` html == strings == parseable. ``feedparser`` standard XML has rules, feedparser knows them. ``htmlparser`` good base class for your own HTML parser. good for "I have an idea about how I want to handle ``embed`` tags". Content is 1/2 of the equation. I'm tired of ugly pages with badass content. .. note:: Text = Content = Boss