Corpus Loading and Text Cleaning with Quanteda in R

Assuming you have a file with text data (perhaps a spreadsheet that you have exported as a CSV file, or data scraped from Google News), you can now start to build and clean your corpus. Fortunately, this is made very easy by functions in the Quanteda package.

First, we load quanteda and the corpus (mycorpus.csv with a column with text called all_para), and we make sure it is readable using quanteda’s corpus function, by transforming the loaded data into characters using as.characters.

library(quanteda)
mycorpus <- read.csv('mycorpus.csv')
mycorpus$all_para <- as.character(mycorpus$all_para)
mycorpus <- corpus(mycorpus$all_para)

Second, we use quanteda’s built-in functions to convert the corpus into tokens (for simplicity sake, think of tokens as individual words). We then carry our several “cleaning” steps, namely:

  1. Remove punctuation (remove_punct = TRUE) and remove numbers (remove_numbers = TRUE),
  2. Remove stop words such as “the”, “a”, “them”, “he”, “she”, etc. There are several repositories of stopwords for different languages, but do test them carefully as not all may be equally good (e.g. issues with Korean). In this case we use English stopwords (en) and the stopwords-iso source, which seems pretty good.
  3. Next, depending on the topic you are researching, it can be useful to remove some additional words. For instance, the words used in your search terms, which will likely be very frequent, and can “overshadow” other words used in your analysis, because almost everything is related to them. Should this be the case, you can create a vector, which is called samesame in the example. In your first round of analysis you may want to leave this out, hence the #.
  4. Then there are two more functions, one is shortening words (tokens) to their wordstem, e.g. making singular and plural into the singular. While this is a useful function, it sometimes truncates words in an odd way.
  5. It is usually a good idea to also convert all words to lowercase, so that “Happy” and “happy” are not seen as two different words.

In the final step, the tokens are converted into a “Document Feature Matrix” (DFM). For the purposes of this tutorial, just imagine that it’s your box of cleaned words.

mytokens <- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE)
mytokens <- tokens_select(mytokens, stopwords('en', source='stopwords-iso'), selection='remove')
#mytokens <- tokens_select(mytokens, samesame, selection='remove')
mytokens <- tokens_wordstem(mytokens)
mytokens <- tokens_tolower(mytokens)

mydfm <- dfm(mytokens)

With your bright, shiny and clean DFM loaded, it’s time to do some text analysis!

Next: Basic Text Analysis and Visualization in R