(Korean) Text Analysis in R and Pajek (incomplete)

R and its almost endless library of packages and plug-ins (CRAN) mean that you can do almost anything in R, including text analysis and network analysis. While you could do everything in R, that doesn’t mean you should. Specialized network analysis software can also be very useful when interpreting, analyzing or visualizing a network, as opposed to trying to automate everything with an R script. You don’t have to be monogamous: you can love R and you can love other software too.

The following is a tutorial explains how R can be used for text analysis (including creating word clouds) and then how your network can be exported, so you can analyze it in Pajek.

  • Loading the corpus (text), processing it and doing basic analysis (word counts) is done using the quanteda package (detailed guide here)
  • Making a Word Cloud is done using ggplot2 (detailed guide here)
  • We then show you how to export the package to Pajek, a popular open source network analysis and visualization software (official page here)

You can install the aforementioned R packages by typing:

install.packages('quanteda')
install.packages('ggplot2')

Even if you did install the packages earlier, typing the install command again will simply re-check, and if needed, update the package.

Loading Text Corpus

First you need to load the text you want to analyze into R. In this particular example the text is comments scraped from a website and stored in a TXT file (which you can open in Windows with Notepad). Every line is a new comment. The filename is ‘comments.txt‘.

In the first code chunk below, we start by loading the package quanteda using the library command.

Then we import the text from comments.txt into a data frame using the read.csv command. This command is used to load CSV files (“comma separated values”, a kind of spreadsheet). Because commas are the default separator in a CSV file, and our comments might contain commas, we need to put something else as a separator to not mess everything up. Basically anything that we are sure won’t appear in the text. In this example we make sep = “|”.

For easier processing we name the column containing the text ‘text’ using the names command.

library(quanteda)
textfile <- read.csv('comments.txt', sep= "|")
names(textfile) <- c('text')

Next we add the comment_ label to each piece of text (this is a feature of quanteda, which can also do more complex text analysis than what is shown in this example). And then we view a summary of the text corpus. You can also click on text.corpus in R Studio to see what’s inside.

textfile$label <- paste0('comment_', row.names(textfile))
text.corpus <- corpus(textfile)
summary(text.corpus)

Tadaaa! We have imported our text corpus. Now it’s time to process!

Text Processing

The next bit of code let’s you process text, basically cleaning it up.

We begin by removing punctuation and numbers, because they are not important in this particular situation.

text.tokens <- tokens(text.corpus, remove_punct = TRUE, remove_numbers = TRUE)

Then we remove stop words like “the”, “a”, etc. which we do not want to analyze. quanteda is awesome in that it has libraries of commonly used stop words for multiple languages. See here. In this case we use Korean stopwords (language ko) from the marimo repository.

text.tokens <- tokens_select(text.tokens, stopwords('ko', source='marimo'), selection='remove')

The list of stopwords should be critically assessed. The scraped website comments that we used to try this out were filled with slang, for example. And so many additional stopwords were added. An example of the kind of Korean-language stopwords that might need to be added can be found in this example of Korean text analysis with quanteda on Github. You can also identify stopwords from the word frequency analysis (see below).

You can also do some other processing, such as reducing words to the word stem or harmonizing all words to lower case. These aren’t relevant when processing a Korean language text, but they may be relevant for other languages, such as English.

text.tokens <- tokens_wordstem(text.tokens)
text.tokens <- tokens_tolower(text.tokens)

When you’re done, you can compile all the beautifully clean processed text into a document feature matrix.

text.dfm.final <- dfm(text.tokens)

Word Frequency and Word Cloud

Finally, we can start to do the fun stuff, text analysis! As a first step its worthwhile to look at the word frequency analysis to see if there are any frequently used words “polluting” your analysis. For example, in an analysis about a movie, you may want to remove the title of the movie. The code for producing a word frequency data frame named wfreq is below, with the 100 most frequently occurring words:

wfreq <-topfeatures(text.dfm.final, 100) %>% as.data.frame()

The word frequency data can also be converted into a word cloud, whereby more frequently occurring words appear larger and in the center of the cloud.

set.seed(132); textplot_wordcloud(text.dfm.final, max_words = 100)

Depending on the text used, this enables you to generate word clouds that will look something like this…

Export to Pajek

(to be added)

co.matrix <- fcm(text.tokens, context= 'document', tri= F) #generate co-word matrix (within same review) feat <- names(topfeatures(co.matrix, 30)) #select top-30 words