This session provides a basic introduction to media analysis using R. It addresses the full workflow of acquiring the data, cleaning and processing, and then some simple analysis and visualization.
We will acquire data from two sources: Google News RSS feeds and Twitter.
Before joining the session, please make sure that you have:
- Internet connection.
- Installed R and R Studio on your laptop.
- Install the packages/libraries tidyRSS, rvest,
twitteR,quanteda, quanteda.textplots and ggplot2 (optional). Twitter account (to load Twitter data).
You can install the packages with the following code. If you already installed the packages previously, this code will re-install the latest version of the package. Note: depending on your setup, and the R packages you previously installed, the installation can take some time.
install.packages(c('tidyRSS', 'rvest', 'quanteda', 'quanteda.textplots', 'ggplot2'))
Now that you are ready, please go ahead and follow these steps (mini tutorials).
- Google News Scraping in R
-
Loading Twitter Data in R(no longer feasible for a demo due to updated API use rules) - Corpus Loading and Text Cleaning with Quanteda in R
- Basic Text Analysis and Visualization in R
Below, some more code examples as used during the actual demo… (please note, it’s not very well edited, nor clean)
#TUTORIAL: https://pstek.nl/2022/asb-research-center-reading-group-media-analysis-with-r/
setwd('~/textmining/')
#Scrape Google News RSS
name <- 'asbdemo'
query <- c('"Asia School of Business"')
library(tidyRSS)
for(n in 1:length(query)){
url <- paste0('https://news.google.com/rss/search?q=', URLencode(query[n], reserved= T), '&hl=en-MY&gl=MY&ceid=MY:en')
if(n == 1){ articles <- tidyfeed(url) } else { articles <- rbind(articles, tidyfeed(url)) }
Sys.sleep(5)
}
write.csv(articles[,1:13], paste0(name,'-articles.csv'), row.names= F)
#Scrape and clean
library(rvest)
all_para <- c()
for(n in 1:20){
html <- read_html(articles$item_link[n])
para <- html_text(html_nodes(html, 'p'))
all_para <- c(all_para, para)
Sys.sleep(5)
}
para_tbl <- as.data.frame(table(all_para))
para_tbl <- subset(para_tbl, Freq < 2) #remove repeats (trash)
write.csv(para_tbl, paste0(name,'-corpus.csv'), row.names= F)
#manually remove lines at the top in CSV (containing dates, key words, etc.)
para_tbl2 <- read.csv(paste0(name,'-corpus.csv'))
#Make corpus
library(quanteda)
mycorpus <- corpus(sub(" - .*", "", articles$item_title))
##para_tbl2$all_para <- as.character(para_tbl2$all_para)
##mycorpus <- corpus(para_tbl2$all_para)
#summary(mycorpus)
samesame <- c('asia', 'school', 'business')
#Clean corpus
mytokens <- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE)
mytokens <- tokens_select(mytokens, stopwords('en', source='stopwords-iso'), selection='remove')
mytokens <- tokens_select(mytokens, samesame, selection='remove')
mytokens <- tokens_wordstem(mytokens)
mytokens <- tokens_tolower(mytokens)
mydfm <- dfm(mytokens)
#Analyse
wfreq <-topfeatures(mydfm, 10) %>% as.data.frame()
library(ggplot2)
wfreq$n <- as.numeric(wfreq$.)
wfreq$word <- row.names(wfreq)
ggplot(wfreq, aes(x = reorder(word, n, function(n) -n), y = n)) + geom_bar(stat = "identity") + xlab('')
library(quanteda.textplots)
set.seed(123); textplot_wordcloud(mydfm, max_words = 100)
myfcm <- fcm(mydfm)
dim(myfcm)
feat <- names(topfeatures(myfcm, 50))
myselect <- fcm_select(myfcm, pattern = feat, selection = "keep")
dim(myselect)
size <- log(colSums(dfm_select(mydfm, feat, selection = "keep")))
set.seed(112)
textplot_network(myselect, min_freq = 0.8, vertex_size = size / max(size) * 3)
textplot_network(myselect)