(Korean) Text Analysis in R and Pajek [INCOMPLETE]

R and its almost endless library of packages and plug-ins (CRAN) mean that you can do almost anything in R, including text analysis and network analysis. While you could do everything in R, that doesn’t mean you should. Specialized network analysis software can also be very useful when interpreting, analyzing or visualizing a network, as opposed to trying to automate everything with an R script. You don’t have to be monogamous: you can love R and you can love other software too.

The following is a tutorial explains how R can be used for text analysis (including creating word clouds) and then how your network can be exported, so you can analyze it in Pajek.

  • Loading the corpus (text), processing it and doing basic analysis (word counts) is done using the quanteda package (detailed guide here)
  • Making a Word Cloud is done using ggplot2 (detailed guide here)
  • We then show you how to export the package to Pajek, a popular open source network analysis and visualization software (official page here)

You can install the aforementioned R packages by typing:

install.packages('quanteda')
install.packages('ggplot2')

Even if you did install the packages earlier, typing the install command again will simply re-check, and if needed, update the package.

Loading Text Corpus

First you need to load the text you want to analyze into R. In this particular example the text is comments scraped from a website and stored in a TXT file (which you can open in Windows with Notepad). Every line is a new comment. The filename is ‘comments.txt‘.

In the first code chunk below, we start by loading the package quanteda using the library command.

Then we import the text from comments.txt into a data frame using the read.csv command. This command is used to load CSV files (“comma separated values”, a kind of spreadsheet). Because commas are the default separator in a CSV file, and our comments might contain commas, we need to put something else as a separator to not mess everything up. Basically anything that we are sure won’t appear in the text. In this example we make sep = “|”.

For easier processing we name the column containing the text ‘text’ using the names command.

library(quanteda)
textfile <- read.csv('comments.txt', sep= "|")
names(textfile) <- c('text')

Next we add the comment_ label to each piece of text (this is a feature of quanteda, which can also do more complex text analysis than what is shown in this example). And then we view a summary of the text corpus. You can also click on text.corpus in R Studio to see what’s inside.

textfile$label <- paste0('comment_', row.names(textfile))
text.corpus <- corpus(textfile)
summary(text.corpus)

Tadaaa! We have imported our text corpus. Now it’s time to process!

Text Processing

The next bit of code let’s you process text, basically cleaning it up.

We begin by removing punctuation and numbers, because they are not important in this particular situation.

text.tokens <- tokens(text.corpus, remove_punct = TRUE, remove_numbers = TRUE)

Then we remove stop words like “the”, “a”, etc. which we do not want to analyze. quanteda is awesome in that it has libraries of commonly used stop words for multiple languages. See here. In this case we use Korean stopwords (language ko) from the marimo repository.

text.tokens <- tokens_select(text.tokens, stopwords('ko', source='marimo'), selection='remove')

The list of stopwords should be critically assessed. The scraped website comments that we used to try this out were filled with slang, for example. And so many additional stopwords were added. An example of the kind of Korean-language stopwords that might need to be added can be found in this example of Korean text analysis with quanteda on Github. You can also identify stopwords from the word frequency analysis (see below).

You can also do some other processing, such as reducing words to the word stem or harmonizing all words to lower case. These aren’t relevant when processing a Korean language text, but they may be relevant for other languages, such as English.

text.tokens <- tokens_wordstem(text.tokens)
text.tokens <- tokens_tolower(text.tokens)

When you’re done, you can compile all the beautifully clean processed text into a document feature matrix.

text.dfm.final <- dfm(text.tokens)

Word Frequency and Word Cloud

Finally, we can start to do the fun stuff, text analysis! As a first step its worthwhile to look at the word frequency analysis to see if there are any frequently used words “polluting” your analysis. For example, in an analysis about a movie, you may want to remove the title of the movie. The code for producing a word frequency data frame named wfreq is below, with the 100 most frequently occurring words:

wfreq <-topfeatures(text.dfm.final, 100) %>% as.data.frame()

The word frequency data can also be converted into a word cloud, whereby more frequently occurring words appear larger and in the center of the cloud.

set.seed(132); textplot_wordcloud(text.dfm.final, max_words = 100)

Depending on the text used, this enables you to generate word clouds that will look something like this…

Export to Pajek

(to be added)

co.matrix <- fcm(text.tokens, context= 'document', tri= F) #generate co-word matrix (within same review) feat <- names(topfeatures(co.matrix, 30)) #select top-30 words 

Dutch Pillarisation, Malaysian Rojak

Pillarisation (or verzuiling in Dutch) is the state of a society that is divided into groups that self-segregate. Until the 1960s and 1970s, the Netherlands was a country whose population was divided along sectarian lines. There was a Catholic pillar, a Protestant pillar, a Socialist pillar and a Liberal pillar. These groups had their own schools, broadcasters, newspapers, political parties, labor unions, employer federations, universities, hospitals, shops and sports clubs. Marriage and friendships between families from different pillars were either discouraged or simply not allowed. The Catholic school kids would always fight with the Protestant school kids. A good Catholic would only buy from Catholic shops. The priest or minister would make house visits to ensure everything was being done “correctly”. The Netherlands was a segregated society with extensive social control within the respective pillars.

To a Malaysian this system may seem strangely familiar, as Peninsular Malaysia has its own racial-linguistic pillars: Malay, Tamil, Chinese and English. Each has their own media outlets, political parties, educational institutions, neighborhoods, popular shopping malls, cuisine, places of worship, social clubs, chambers of commerce, etc. And while Malaysians of different races do mix regularly, especially in the workplace, the number of Malaysians who marry or maintain deep friendships across racial lines, is relatively limited. Within many groups there is still a strong sense of social control, and opinion polls show that a large segment of Malaysian society is still very conservative regarding social issues.

The big difference between Malaysia and the Netherlands is racial and linguistic: the Netherlands during pillarisation was, for all intents and purposes, a mono-lingual and mono-ethnic country. Malaysia of course is multi-racial and multi-lingual. Malaysians often refer to their society as ‘Rojak’, a salad of fruit, vegetables and sometimes egg and tofu, all mixed together and covered in a sauce. The point is that each of the items in the salad retains their individual characteristics, they do not melt or assimilate into one uniform Malaysian soup or porridge. There is only a Malaysian sauce that unites them.

In the Netherlands more progressively minded individuals from within each pillar tried to break down barriers between them. The Netherlands became a much less religious and more individualistic society during the later part of the 20th century, and this weakened the social control from within the pillars. This loosening eventually lead to various mergers between labor unions, political parties, broadcasters, etc. Schools accepting students from diverse backgrounds. Inter-religious marriages and friendships losing their stigma. Today the process of de-pilarisation in the Netherlands that took place in the 1960s and 1970s is primarily seen as a social process which brought about institutional and political change.

Since the 1990s the Netherlands is widely seen as one of the most liberal countries, having legalized prostitution, soft drugs, gay marriage and euthanasia, all abhorred by the conservatives. However this is not to say that all religion or conservative values have disappeared. In fact, the Netherlands is also home to a substantial ‘bible belt’ of mainly conservative Protestants, who have maintained their pillars. The stereotype is of large conservative families, who attend church regularly, strictly observe Sunday as a day of rest, and in some cases, oppose modern technologies such as television and vaccinations. This diversity in views is represented in the Dutch parliament: there is a conservative political party that would like to deny women the right to vote, a party for animals, a party that would like to deport all Muslims and recently, a party that fights (only) for the rights of Muslims.

So does the Dutch experience suggest that Malaysia will inevitably de-pillarise? That the ‘Rojak’ will become ‘Laksa’ or ‘Bubur’? If anything, modern Malaysia seems to have pillarised more since independence. Many Malaysians growing up during the 1950s and 1960s in Malaysia remember a more multi-ethnic society in areas such as education or the civil service. Yet this may also have been an illusion of the elite: Malaysian society at the level of the working class was perhaps always more deeply divided along racial and religious lines. Government policy since independence has largely aimed to maintain or reinforce those divisions, perhaps primarily as a tool to maintain social control, and not dissimilar to the ‘neat’ political divisions in the Netherlands after World War II.

What should be remembered is that Dutch de-pillarisation was accompanied by a phenomenal economic transformation of the country after 1945. In Malaysia, arguably a greater degree of de-pillarisation has occurred in more prosperous urban areas, such as the Klang Valley. In those areas more multi-ethnic parties tend to perform well in elections, presumably reflecting different social values of the local population. Ethnic-based parties tend to perform better in less prosperous rural areas of Malaysia.

While Malaysia will not experience de-pillarisation in the same way that The Netherlands has, the comparison with The Netherlands suggests two things that might be relevant in a Malaysian context. First, that socio-economic changes are the main drivers of the cultural and political changes that brought about de-pillarisation. Second, that the pillars — the institutions, social bonds, ways of life — will survive, although they will lose influence.