Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries¶

Presented by Katharine Jarmul¶

I got 99 problems but content ain’t one¶

Everyone needs good content.
Good content exists all over the web.
Scrape it ‘til you make it.

LXML: Diving in¶

`lxml.etree` vs. `lxml.html`¶

etree: best for properly formatted xml/xhtml
etree: powerful and fast for SOAP or other xml-formatted content
html: best for web sites & irregular content

`lxml.html`: hidden gems¶

cssselect

utilizes css element syntax to find and highlight html elements.

iterlinks

creates a generator of all linky elements on the page. Remember: ads have lots of links.

sourceline

can identify the location of your element on the page. Exists in both lxml.html and lxml.etree.

find, findall

can locate html elements within another node or a page. Exists in both lxml.html and lxml.etree.

descendents/children/siblings/ancesorts

all elements have iterchildren, itersiblings, iterancestors and iterdescendents.

forms

can find all (normal) forms on a page. beware of CAPTCHAs and the like.

text, text_content, and iter_text

ways to get content without tags.

If you have to parse in realtime, LXML is sometimes too much.

re: html == strings == parseable.
feedparser: standard XML has rules, feedparser knows them.
htmlparser: good base class for your own HTML parser. good for “I have an idea about how I want to handle embed tags”.

Content is 1/2 of the equation.

I’m tired of ugly pages with badass content.

Note

Text = Content = Boss

Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries¶

Presented by Katharine Jarmul¶

I got 99 problems but content ain’t one¶

LXML: Diving in¶

`lxml.etree` vs. `lxml.html`¶

`lxml.html`: hidden gems¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Y’all Wanna Scrape with Us? Content Ain’t a Thing : Web Scraping With Our Favorite Python Libraries¶

Presented by Katharine Jarmul¶

I got 99 problems but content ain’t one¶

LXML: Diving in¶

lxml.etree vs. lxml.html¶

lxml.html: hidden gems¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation

`lxml.etree` vs. `lxml.html`¶

`lxml.html`: hidden gems¶