class: center, middle, inverse, title-slide # Web Scraping ##
Statistical Programming ### Fall 2021 ###
Dr. Colin Rundel --- exclude: true ```r library(magrittr) library(rvest) ``` --- class: middle, center <img src="imgs/hex-rvest.png" width="45%" style="display: block; margin: auto;" /> --- ## Hypertext Markup Language Most of the data on the web is still largely available as HTML - while it is structured (hierarchical) it often is not available in a form useful for analysis (flat / tidy). ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> <br/> <div class="name" id="first">John</div> <div class="name" id="last">Doe</div> <div class="contact"> <div class="home">555-555-1234</div> <div class="home">555-555-2345</div> <div class="work">555-555-9999</div> <div class="fax">555-555-8888</div> </div> </body> </html> ``` --- ## rvest `rvest` is a package from the tidyverse that makes basic processing and manipulation of HTML data straight forward. It provides high level functions for interacting with html via the xml2 library. <br/> Core functions: * `read_html()` - read HTML data from a url or character string. * `html_elements()` / `html_elements()` - select specified elements from the HTML document using CSS selectors. * `html_table()` - parse an HTML table into a data frame. * `html_text()` / `html_text2()` - extract tag's text content. * `html_name` - extract tag's names. * `html_attrs` - extract all of each tag's attributes. * `html_attr` - extract tags' attribute value by name. --- ## html, rvest, & xml2 ```r html = '<html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> <br/> <div class="name" id="first">John</div> <div class="name" id="last">Doe</div> <div class="contact"> <div class="home">555-555-1234</div> <div class="home">555-555-2345</div> <div class="work">555-555-9999</div> <div class="fax">555-555-8888</div> </div> </body> </html>' read_html(html) ``` ``` ## {html_document} ## <html> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body>\n <p align="center">Hello world!</p>\n <br><div class="name" ... ``` --- ## css selectors We will be using a tool called selector gadget to help up identify the html elements of interest - it does this by constructing a css selector which can be used to subset the html document. <br/> .small[ Selector | Example | Description :-----------------|:-----------------|:-------------------------------------------------- element | `p` | Select all <p> elements element element | `div p` | Select all <p> elements inside a <div> element element>element | `div > p` | Select all <p> elements with <div> as a parent .class | `.title` | Select all elements with class="title" #id | `#name` | Select all elements with id="name" [attribute] | `[class]` | Select all elements with a class attribute [attribute=value] | `[class=title]` | Select all elements with class="title" ] .footnote[There are also a number of additional combinators and pseudo-classes that improve flexibility, see examples [here](https://www.w3schools.com/cssref/css_selectors.asp)] --- ## Selecting tags ```r read_html(html) %>% html_elements("p") ``` ``` ## {xml_nodeset (1)} ## [1] <p align="center">Hello world!</p> ``` -- ```r read_html(html) %>% html_elements("p") %>% html_text() ``` ``` ## [1] "Hello world!" ``` -- ```r read_html(html) %>% html_elements("p") %>% html_attrs() ``` ``` ## [[1]] ## align ## "center" ``` -- ```r read_html(html) %>% html_elements("p") %>% html_attr("align") ``` ``` ## [1] "center" ``` --- ## More selecting tags ```r read_html(html) %>% html_elements("div") ``` ``` ## {xml_nodeset (7)} ## [1] <div class="name" id="first">John</div> ## [2] <div class="name" id="last">Doe</div> ## [3] <div class="contact">\n <div class="home">555-555-1234</div>\n ... ## [4] <div class="home">555-555-1234</div> ## [5] <div class="home">555-555-2345</div> ## [6] <div class="work">555-555-9999</div> ## [7] <div class="fax">555-555-8888</div> ``` -- ```r read_html(html) %>% html_elements("div") %>% html_text() ``` ``` ## [1] "John" ## [2] "Doe" ## [3] "\n 555-555-1234\n 555-555-2345\n 555-555-9999\n 555-555-8888\n " ## [4] "555-555-1234" ## [5] "555-555-2345" ## [6] "555-555-9999" ## [7] "555-555-8888" ``` --- ## Nesting tags ```r read_html(html) %>% html_elements("body div") ``` ``` ## {xml_nodeset (7)} ## [1] <div class="name" id="first">John</div> ## [2] <div class="name" id="last">Doe</div> ## [3] <div class="contact">\n <div class="home">555-555-1234</div>\n ... ## [4] <div class="home">555-555-1234</div> ## [5] <div class="home">555-555-2345</div> ## [6] <div class="work">555-555-9999</div> ## [7] <div class="fax">555-555-8888</div> ``` -- ```r read_html(html) %>% html_elements("body>div") ``` ``` ## {xml_nodeset (3)} ## [1] <div class="name" id="first">John</div> ## [2] <div class="name" id="last">Doe</div> ## [3] <div class="contact">\n <div class="home">555-555-1234</div>\n ... ``` -- ```r read_html(html) %>% html_elements("body div div") ``` ``` ## {xml_nodeset (4)} ## [1] <div class="home">555-555-1234</div> ## [2] <div class="home">555-555-2345</div> ## [3] <div class="work">555-555-9999</div> ## [4] <div class="fax">555-555-8888</div> ``` --- ## CSS classes and ids ```r read_html(html) %>% html_elements(".name") ``` ``` ## {xml_nodeset (2)} ## [1] <div class="name" id="first">John</div> ## [2] <div class="name" id="last">Doe</div> ``` -- ```r read_html(html) %>% html_elements("div.name") ``` ``` ## {xml_nodeset (2)} ## [1] <div class="name" id="first">John</div> ## [2] <div class="name" id="last">Doe</div> ``` -- ```r read_html(html) %>% html_elements("#first") ``` ``` ## {xml_nodeset (1)} ## [1] <div class="name" id="first">John</div> ``` --- ## Mixing it up ```r read_html(html) %>% html_elements("[align]") ``` ``` ## {xml_nodeset (1)} ## [1] <p align="center">Hello world!</p> ``` ```r read_html(html) %>% html_elements(".contact div") ``` ``` ## {xml_nodeset (4)} ## [1] <div class="home">555-555-1234</div> ## [2] <div class="home">555-555-2345</div> ## [3] <div class="work">555-555-9999</div> ## [4] <div class="fax">555-555-8888</div> ``` --- ## `html_text()` vs `html_text2()` ```r html = read_html( "<p> This is the first sentence in the paragraph. This is the second sentence that should be on the same line as the first sentence.<br>This third sentence should start on a new line. </p>" ) ``` -- <br/> ```r html %>% html_text() %>% cat(sep="\n") ``` ``` ## ## This is the first sentence in the paragraph. ## This is the second sentence that should be on the same line as the first sentence.This third sentence should start on a new line. ## ``` ```r html %>% html_text2() %>% cat(sep="\n") ``` ``` ## This is the first sentence in the paragraph. This is the second sentence that should be on the same line as the first sentence. ## This third sentence should start on a new line. ``` --- ## html tables ```r html_table = '<html> <head> <title>This is a title</title> </head> <body> <table> <tr> <th>a</th> <th>b</th> <th>c</th> </tr> <tr> <td>1</td> <td>2</td> <td>3</td> </tr> <tr> <td>2</td> <td>3</td> <td>4</td> </tr> <tr> <td>3</td> <td>4</td> <td>5</td> </tr> </table> </body> </html>' ``` -- ```r read_html(html_table) %>% html_elements("table") %>% html_table() ``` ``` ## [[1]] ## # A tibble: 3 × 3 ## a b c ## <int> <int> <int> ## 1 1 2 3 ## 2 2 3 4 ## 3 3 4 5 ``` --- ## SelectorGadget This is a javascript based tool that helps you interactively build an appropriate CSS selector for the content you are interested in. .center[ <img src="imgs/selectorgadget.png" width="45%" style="display: block; margin: auto;" /> [selectorgadget.com](http://selectorgadget.com) ] --- class: middle # Web scraping considerations --- ## "Can you?" vs "Should you?" <img src="imgs/ok-cupid-1.png" width="60%" style="display: block; margin: auto;" /> .footnote[.small[ Source: Brian Resnick, [Researchers just released profile data on 70,000 OkCupid users without permission](https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release), Vox. ]] --- ## "Can you?" vs "Should you?" <img src="imgs/ok-cupid-2.png" width="70%" style="display: block; margin: auto;" /> --- ## Scraping permission & `robots.txt` There is a standard for communicating to users if it is acceptable to automatically scrape a website via the [robots exclusion standard] or `robots.txt`. You can find examples at all of your favorite websites: [google](https://www.google.com/robots.txt), [facebook](https://facebook.com/robots.txt), etc. -- These files are meant to be machine readable, but the `polite` package can handle this for us. ```r polite::bow("http://google.com") ``` ``` ## <polite session> http://google.com ## User-agent: polite R package - https://github.com/dmi3kno/polite ## robots.txt: 282 rules are defined for 4 bots ## Crawl delay: 5 sec ## The path is scrapable for this user-agent ``` ```r polite::bow("http://facebook.com") ``` ``` ## <polite session> http://facebook.com ## User-agent: polite R package - https://github.com/dmi3kno/polite ## robots.txt: 479 rules are defined for 20 bots ## Crawl delay: 5 sec ## The path is not scrapable for this user-agent ``` --- ## Example - Rotten Tomatoes For the movies listed in **Popular Steaming Movies** list on `rottentomatoes.com` create a data frame with the Movies' titles, their tomatometer score, and whether the movie is fresh or rotten, and the movie's url. --- ## Exercise Using the url for each movie, now go out and grab the number of reviews, the runtime, and number of user ratings. If you finish that you can then try to scrape the mpaa rating and the audience score,.