{"id":189,"date":"2021-06-06T09:18:01","date_gmt":"2021-06-06T09:18:01","guid":{"rendered":"https:\/\/ap.pstek.nl\/pstek_wp\/blog\/?p=189"},"modified":"2022-06-21T02:57:55","modified_gmt":"2022-06-21T02:57:55","slug":"doing-text-analysis-in-r-work-in-progress","status":"publish","type":"post","link":"https:\/\/ap.pstek.nl\/pstek_wp\/2021\/doing-text-analysis-in-r-work-in-progress\/","title":{"rendered":"(Korean) Text Analysis in R and Pajek (incomplete)"},"content":{"rendered":"\n
R and its almost endless library of packages and plug-ins (CRAN) mean that you can do almost anything<\/em> in R, including text analysis and network analysis. While you could<\/em> do everything in R, that doesn’t mean you should<\/em>. Specialized network analysis software can also be very useful when interpreting, analyzing or visualizing a network, as opposed to trying to automate everything with an R script. You don’t have to be monogamous: you can love R and you can love other software too.<\/p>\n\n\n\n The following is a tutorial explains how R can be used for text analysis (including creating word clouds) and then how your network can be exported, so you can analyze it in Pajek.<\/p>\n\n\n\n You can install the aforementioned R packages by typing:<\/p>\n\n\n\n Even if you did install the packages earlier, typing the install command again will simply re-check, and if needed, update the package.<\/p>\n\n\n\n First you need to load the text you want to analyze into R. In this particular example the text is comments scraped from a website and stored in a TXT file (which you can open in Windows with Notepad). Every line is a new comment. The filename is ‘comments.txt<\/strong>‘.<\/p>\n\n\n\n In the first code chunk below, we start by loading the package quanteda using the library<\/strong> command.<\/p>\n\n\n\n Then we import the text from comments.txt into a data frame using the read.csv<\/strong> command. This command is used to load CSV files (“comma separated values”, a kind of spreadsheet). Because commas are the default separator in a CSV file, and our comments might contain commas, we need to put something else as a separator to not mess everything up. Basically anything that we are sure won’t appear in the text. In this example we make sep = “|”<\/strong>.<\/p>\n\n\n\n For easier processing we name the column containing the text ‘text’ using the names<\/strong> command.<\/p>\n\n\n\n Next we add the comment_<\/strong> label to each piece of text (this is a feature of quanteda, which can also do more complex text analysis than what is shown in this example). And then we view a summary of the text corpus. You can also click on text.corpus<\/strong> in R Studio to see what’s inside.<\/p>\n\n\n\n Tadaaa! We have imported our text corpus. Now it’s time to process!<\/p>\n\n\n\n The next bit of code let’s you process text, basically cleaning it up.<\/p>\n\n\n\n We begin by removing punctuation<\/strong> and numbers<\/strong>, because they are not important in this particular situation.<\/p>\n\n\n\n Then we remove<\/strong> stop words<\/strong> like “the”, “a”, etc. which we do not want to analyze. quanteda is awesome in that it has libraries of commonly used stop words for multiple languages. See here<\/strong><\/a>. In this case we use Korean stopwords (language ko<\/strong>) from the marimo<\/strong> repository. <\/p>\n\n\n\ninstall.packages('quanteda')\ninstall.packages('ggplot2')<\/pre>\n\n\n\n
Loading Text Corpus<\/h2>\n\n\n\n
library(quanteda)\ntextfile <- read.csv('comments.txt<\/span><\/strong>', sep= \"|\")\nnames(textfile) <- c('text')<\/pre>\n\n\n\n
textfile$label <- paste0('comment<\/strong>_<\/span>', row.names(textfile))\ntext.corpus <- corpus(textfile)\nsummary(text.corpus)<\/pre>\n\n\n\n
Text Processing<\/h2>\n\n\n\n
text.tokens <- tokens(text.corpus, remove_punct = TRUE, remove_numbers = TRUE)<\/pre>\n\n\n\n
text.tokens <- tokens_select(text.tokens, stopwords('ko', source='marimo'), selection='remove')<\/pre>\n\n\n\n