Google News Scraping in R

Google News is a popular news aggregator that can be used to search for news from diverse sources. You may have heard of Google News alert e-mails, but you can also use the service to scrape news stories for the purposes of systematic text analysis.

This post provides an example of how to use tidyRSS and rvest to identify and then scrape stories from Google News, using the following steps:

  1. Generate a URL to obtain a Google News RSS feed,
  2. Read the RSS feed using tidyRSS,
  3. Use rvest to visit the links of each news story and scrape the article content.

Before you proceed, please note that the process described here probably violates the website terms of service and copyright laws, as web scraping is typically not allowed. It is therefore recommended to use this system in a way that does not overload Google News or the pages you are scraping, or else your IP will likely get blocked.

The Google News RSS Feed

Google News has a fantastic feature that lets you use search terms to generate an RSS feed. An RSS feed is a summary of information (e.g. news articles or blog posts) that can easily be read by a machine (e.g. imported into a spreadsheet).

Google News Search is very systematic, so you can clearly specify precisely what you are looking for. For example, you can search for the name of a politician, and his comments on some topics, as reported by a particular news website. Your search could be:

“Boris Johnson” AND (“Ukraine” OR “Putin” OR “Russia”) site:bbc.co.uk

This would give you all coverage from the BBC concerning what BoJo has said about Ukraine, Putin and Russia. To see what kind of articles appear, do a trial search on the regular Google News website.

To generate a Google News RSS feed URL (= website address), you can use the following code to inject your search term:

search_term <- '"Boris Johnson" AND ("Ukraine" OR "Putin" OR "Russia") site:bbc.co.uk'
url <- paste0('https://news.google.com/rss/search?q=', URLencode(search_term, reserved= T), '&hl=en-MY&gl=MY&ceid=MY:en')

As you may notice, the search_term is transcribed into URL-language via the URLencode function. You may also notice the end of the URL having &hl=en-MY&gl=MY&ceid=MY:en, this is code specifying your language and country preference (in this case English language for Malaysia). If you do not provide this last bit of code, Google will automatically generate it based on your IP address and the language it thinks you use. For the sake of reproducibility, it may be correct to specify this last part, as Google is known to personalize search results.

TidyRSS

In the next step, TidyRSS is used to read the Google NEws RSS feed URL that was just generated. Note that Google News gives a maximum of 100 results via the RSS feed. The following code should work…

library(tidyRSS)
articles <- tidyfeed(url)

The articles output is a data frame that you can write to a CSV, spreadsheet, or analyze further.

The data frame contains the title of the article, a short description and a link to the original article (the item_link column).

The next step presents several options. When doing a systematic text analysis, it is possible to look only at article titles (the item_title column), but in this case we want to scrape the main article, so we use the links (item_link column).

Scraping news articles with rvest

Web scraping is a powerful tool to build a database of information, and what we are doing is writing a program (a bot), that goes to a specific list of links and reads and saves all the content.

In the next bit of code we use a loop (the for…-part), we read the page which displays in HTML (read_html), we then extract all text that is in paragraphs marked by ‘p’ (html_text, html_nodes), and we then add those paragraphs to our vector (basically a long list of paragraphs) names all_para. Then we tell the program to take a 10 second break before going to the next link (Sys.sleep), because if we don’t, our IP will likely be blocked. This is particularly important if you are scraping articles from the same website.

library(rvest)
all_para <- c()
for(n in 1:nrow(articles)){
  html <- read_html(articles$item_link[n])
  para <- html_text(html_nodes(html, 'p'))
  all_para <- c(all_para, para)
  Sys.sleep(10)
}

One of the points to note, is that we are now looking to do the text analysis on the level of a paragraph, not at that of an article. You can also do it at the level of an article, but you would adjust the code.

Secondly, not every piece of text that has been scraped is useful. Some of it may be irrelevant, such as a copyright notice, or the description of an ad or another news story.

One way of automatically filtering the data is to remove duplicate paragraphs, but a manual check should also be done. To remove duplicate paragraphs, you can use the following code:

para_tbl <- as.data.frame(table(all_para))
para_tbl <- subset(para_tbl, Freq < 2)

And there you have it! You have now scraped your very own “corpus” of news. You may wish to save your corpus as a CSV file for later use, or as a backup. Especially for a smaller corpus, it can be wise to look it through and remove any erroneously included paragraphs, as automatic web scraping is rarely an exact science. To save your data, you can use the following code:

write.csv(para_tbl, 'mycorpus.csv', row.names= F)

Next: Corpus Loading and Text Cleaning with Quanteda in R