(Korean) Text Analysis in R and Pajek [INCOMPLETE]

R and its almost endless library of packages and plug-ins (CRAN) mean that you can do almost anything in R, including text analysis and network analysis. While you could do everything in R, that doesn’t mean you should. Specialized network analysis software can also be very useful when interpreting, analyzing or visualizing a network, as opposed to trying to automate everything with an R script. You don’t have to be monogamous: you can love R and you can love other software too.

The following is a tutorial explains how R can be used for text analysis (including creating word clouds) and then how your network can be exported, so you can analyze it in Pajek.

  • Loading the corpus (text), processing it and doing basic analysis (word counts) is done using the quanteda package (detailed guide here)
  • Making a Word Cloud is done using ggplot2 (detailed guide here)
  • We then show you how to export the package to Pajek, a popular open source network analysis and visualization software (official page here)

You can install the aforementioned R packages by typing:

install.packages('quanteda')
install.packages('ggplot2')

Even if you did install the packages earlier, typing the install command again will simply re-check, and if needed, update the package.

Loading Text Corpus

First you need to load the text you want to analyze into R. In this particular example the text is comments scraped from a website and stored in a TXT file (which you can open in Windows with Notepad). Every line is a new comment. The filename is ‘comments.txt‘.

In the first code chunk below, we start by loading the package quanteda using the library command.

Then we import the text from comments.txt into a data frame using the read.csv command. This command is used to load CSV files (“comma separated values”, a kind of spreadsheet). Because commas are the default separator in a CSV file, and our comments might contain commas, we need to put something else as a separator to not mess everything up. Basically anything that we are sure won’t appear in the text. In this example we make sep = “|”.

For easier processing we name the column containing the text ‘text’ using the names command.

library(quanteda)
textfile <- read.csv('comments.txt', sep= "|")
names(textfile) <- c('text')

Next we add the comment_ label to each piece of text (this is a feature of quanteda, which can also do more complex text analysis than what is shown in this example). And then we view a summary of the text corpus. You can also click on text.corpus in R Studio to see what’s inside.

textfile$label <- paste0('comment_', row.names(textfile))
text.corpus <- corpus(textfile)
summary(text.corpus)

Tadaaa! We have imported our text corpus. Now it’s time to process!

Text Processing

The next bit of code let’s you process text, basically cleaning it up.

We begin by removing punctuation and numbers, because they are not important in this particular situation.

text.tokens <- tokens(text.corpus, remove_punct = TRUE, remove_numbers = TRUE)

Then we remove stop words like “the”, “a”, etc. which we do not want to analyze. quanteda is awesome in that it has libraries of commonly used stop words for multiple languages. See here. In this case we use Korean stopwords (language ko) from the marimo repository.

text.tokens <- tokens_select(text.tokens, stopwords('ko', source='marimo'), selection='remove')

The list of stopwords should be critically assessed. The scraped website comments that we used to try this out were filled with slang, for example. And so many additional stopwords were added. An example of the kind of Korean-language stopwords that might need to be added can be found in this example of Korean text analysis with quanteda on Github. You can also identify stopwords from the word frequency analysis (see below).

You can also do some other processing, such as reducing words to the word stem or harmonizing all words to lower case. These aren’t relevant when processing a Korean language text, but they may be relevant for other languages, such as English.

text.tokens <- tokens_wordstem(text.tokens)
text.tokens <- tokens_tolower(text.tokens)

When you’re done, you can compile all the beautifully clean processed text into a document feature matrix.

text.dfm.final <- dfm(text.tokens)

Word Frequency and Word Cloud

Finally, we can start to do the fun stuff, text analysis! As a first step its worthwhile to look at the word frequency analysis to see if there are any frequently used words “polluting” your analysis. For example, in an analysis about a movie, you may want to remove the title of the movie. The code for producing a word frequency data frame named wfreq is below, with the 100 most frequently occurring words:

wfreq <-topfeatures(text.dfm.final, 100) %>% as.data.frame()

The word frequency data can also be converted into a word cloud, whereby more frequently occurring words appear larger and in the center of the cloud.

set.seed(132); textplot_wordcloud(text.dfm.final, max_words = 100)

Depending on the text used, this enables you to generate word clouds that will look something like this…

Export to Pajek

(to be added)

co.matrix <- fcm(text.tokens, context= 'document', tri= F) #generate co-word matrix (within same review) feat <- names(topfeatures(co.matrix, 30)) #select top-30 words 

limer code examples

limer is an R package that enable R users to connect directly to a Lime Survey installation via its API (for details, see earlier post), essentially giving you remote control and a possibility of automating certain procedures.

Because the documentation of limer and the Lime Survey API is a bit minimal and therefore quite confusing for a first-time user, I give some simple coding examples below to get you started.

First, we connect to our Lime Survey instance, which is installed at LIMESURVEY.URL and can be accessed with a LIME.USERNAME and LIME.PASSWORD. Obviously you should replace these with your own installations’ details in the code below. The get_session_key() command gets you a unique key through which you can securely access the Lime Survey installation. This is automatically used in all the limer calls.

library(limer)

#LimeSurvey Server Info
options(lime_api = 'https://LIMESURVEY.URL/index.php/admin/remotecontrol')
options(lime_username = 'LIME.USERNAME')
options(lime_password = 'LIME.PASSWORD')

get_session_key()

The first example is a simple one that uses a built-in function from limer, get_responses. This simply allows you to download all the data (or only completed data) from a particular survey, #10001 in this case.

responses <- get_responses(10001, sCompletionStatus = 'all')

The second example requires greater knowledge of the LimeSurvey API language because the limer package does not have a neat wrapper for these functions. Instead the generic call_limer function is used in which calls from the original API can be introduced. The full guide of these API functions is available here.

The example below involves listing all the surveys on the Lime Survey installation (server) and then getting the number of completed responses. Note that the method inserted into the call_limer() function is the same method that is listed in the API documentation and the params are the arguments of that respective method. So in this sense, it’s actually quite straight forward

call_limer(method = "list_surveys") #list surveys on server

call_limer(method = "get_summary", #get number of completed responses
           params = list(iSurveyID = 10001,
                         sStatname = "completed_responses"))

The third and last example showcases some of the more sophisticated automation options. We aim to copy survey 123456, setup a participant table with two extra attributes: Institution and File, add one participant, activate the survey, compose the survey link and then, delete the survey.

When the initial survey is copied and users are created, details are stored in tmp and tmp2 because we wish to use these outputs and inputs for later functions.

The fromJSON function (which is from the JSONlite package) is also used to feed arrays with multiple pieces of data into the call_limer function. There might also be other ways to do this, but the below example works.

tmp <- call_limer(method = "copy_survey", #copy a survey
                  params = list(iSurveyID_org = 123456,
                                sNewname = 'The Copied Survey'))

call_limer(method = "activate_tokens", #setup participant table
           params = list(iSurveyID = tmp$newsid,
                         aAttributeFields = fromJSON('{"attribute_1":"Institution","attribute_2":"File"}')))

tmp2 <- call_limer(method = "add_participants", #add participant
                   params = list(iSurveyID = tmp$newsid,
                                 aParticipantsData = fromJSON('[{"email":"[email protected]","lastname":"Bond","firstname":"James","attribute_1":"Secret Service","attribute_2":"mi5","usesleft":999999}]'),
                                 bCreateToken = TRUE))

call_limer(method = "activate_survey", #activate survey
           params = list(iSurveyID = tmp$newsid))

paste0('https://LIMESURVEY.URL/index.php/', tmp$newsid, '?token=', tmp2$token, '&newtest=Y') #generate survey link

call_limer(method = "delete_survey", #delete the survey
           params = list(iSurveyID = tmp$newsid))

As is hopefully clear now, the limer package offers some powerful options for automating the setting up of surveys as well as importing the data into R.

Finally, its good practice to close off the session with the following call.

release_session_key()

Lime Survey and R: limer

Lime Survey is (probably) the world’s most popular open source survey package and R is the world’s most popular open source statistical and data analysis software (probably). So it seems only natural that there should be a bridge between the two: where the limitations of Lime Survey begin, R can take over and vice-versa.

Thankfully the bridge has been laid by an R package called limer. limer uses the API functionality that is built into Lime Survey to make calls to a Lime Survey instance. The most useful of these is probably the ability to get survey responses. But there is a wealth of other options too, including copying and deleting surveys, getting survey statistics, etc.

The github page of limer provides instructions on how to install it and how to make some basic calls. You also need to enable the API in your Lime Survey installation, which you can do under Settings –> Global Settings –> Interfaces. Be sure to set it to “JSON-RPC” and you will also see the URL to access the API (see image below)

Details on the Lime Survey API are available, although admittedly the documentation is a bit thin.

Assuming you have configured your Lime Survey instance with an SSL certificate (giving the https:// URL), the connection between R and Lime Survey is also encrypted and therefore any personal or sensitive data being exchanged between Lime Survey and R, is secured.

limer coding examples are to follow in a later post.