Basic Text Analysis and Visualization in R

At its most basic level, text analysis is about counting words. If words are frequently used, we assume that they are important. If words occur together, we assume that they are related. Obviously, that is not always the case, but a discerning researcher like yourself will be able to filter this information, provide context and meaning, and draw the appropriate conclusions.

Text analysis often provides you with an opportunity to gather more solid evidence of the importance of certain words or concepts, and the relationship between them, and sometimes can lead to the discovery of hidden patterns, that you as a human observer may have missed.

In this short tutorial, using the quanteda, ggplot2 and quanteda.textplot packages, the following analysis methods are covered:

  • Word frequency analysis and visualization in a frequency chart and word cloud.
  • Co-word analysis and visualization in a co-word network.

Before starting, it is assumed that you have cleaned and created a Document Feature Matrix (DFM) called mydfm (see earlier tutorial).

Word Frequency Analysis

Word frequency analysis is about counting words, and it is useful to look at, say, the 100 most frequently used words (topfeatures), which can be extracted as follows:

library(quanteda)
wfreq <-data.frame(topfeatures(mydfm, 100))

Aside from looking at the words in the data frame, they can also be visualized.

Below an example of the 10 most frequently occurring words in a bar chart (based on a news corpus about sustainable investment in the Philippines). Note that we manipulate the wfreq data frame for it to be read properly by ggplot.

library(ggplot2)
wfreq <- as.data.frame(topfeatures(mydfm, 10))
wfreq$n <- as.numeric(wfreq$.)
wfreq$word <- row.names(wfreq)
ggplot(wfreq, aes(x = reorder(word, n, function(n) -n), y = n)) + geom_bar(stat = "identity") + xlab('')

And next, the 100 most frequently occurring words in a word cloud!

library(quanteda.textplots)
set.seed(123); textplot_wordcloud(mydfm, max_words = 100)

Word frequency analysis is a useful first step to explore your corpus. Based on it, you may wish to remove some additional highly frequent words which “overshadow” the rest of the corpus. But, these are always decisions that require thought and justification.

Co-Word Analysis

For co-word analysis, we convert the DFM into a Frequency Co-occurrence Matrix (FCM) using fcm. For the purpose of this tutorial, think of the FCM as a “box” filled with the connections between words.

Because the FCM can be very large, thus making it dificult to visually analyze, we find the 50 most frequently occurring words (topfeatures), and then select them in a new FCM called myselect using the fcm_select feature.

myfcm <- fcm(mydfm)
feat <- names(topfeatures(myfcm, 50))
myselect <- fcm_select(myfcm, pattern = feat, selection = "keep")

Having cut the FCM to size, we can now generate a co-occurrence network. To visualize the network well, it can be useful to impose a minimum frequency (min_freq), removing very rarely occurring links. You may also want to adjust the size of the dots (vertex_size) to reflect their importance.

size <- log(colSums(dfm_select(mydfm, feat, selection = "keep")))
set.seed(112)
textplot_network(myselect, min_freq = 0.8, vertex_size = size / max(size) * 3)

The present network can be analyzed further using network analysis techniques, its visualization can be further refined, etc. The present tutorial gives you a basis to start with.