Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries

Presented by Katharine Jarmul

I got 99 problems but content ain’t one

  • Everyone needs good content.
  • Good content exists all over the web.
  • Scrape it ‘til you make it.

LXML: Diving in

lxml.etree vs. lxml.html

  • etree: best for properly formatted xml/xhtml
  • etree: powerful and fast for SOAP or other xml-formatted content
  • html: best for web sites & irregular content

lxml.html: hidden gems

cssselect
utilizes css element syntax to find and highlight html elements.
iterlinks
creates a generator of all linky elements on the page. Remember: ads have lots of links.
sourceline
can identify the location of your element on the page. Exists in both lxml.html and lxml.etree.
find, findall
can locate html elements within another node or a page. Exists in both lxml.html and lxml.etree.
descendents/children/siblings/ancesorts
all elements have iterchildren, itersiblings, iterancestors and iterdescendents.
forms
can find all (normal) forms on a page. beware of CAPTCHAs and the like.
text, text_content, and iter_text
ways to get content without tags.

If you have to parse in realtime, LXML is sometimes too much.

re
html == strings == parseable.
feedparser
standard XML has rules, feedparser knows them.
htmlparser
good base class for your own HTML parser. good for “I have an idea about how I want to handle embed tags”.

Content is 1/2 of the equation.

I’m tired of ugly pages with badass content.

Note

Text = Content = Boss