{"id":189,"date":"2021-06-06T09:18:01","date_gmt":"2021-06-06T09:18:01","guid":{"rendered":"https:\/\/ap.pstek.nl\/pstek_wp\/blog\/?p=189"},"modified":"2022-06-21T02:57:55","modified_gmt":"2022-06-21T02:57:55","slug":"doing-text-analysis-in-r-work-in-progress","status":"publish","type":"post","link":"https:\/\/ap.pstek.nl\/pstek_wp\/2021\/doing-text-analysis-in-r-work-in-progress\/","title":{"rendered":"(Korean) Text Analysis in R and Pajek (incomplete)"},"content":{"rendered":"\n<p>R and its almost endless library of packages and plug-ins (CRAN) mean that you can do almost <em>anything<\/em> in R, including text analysis and network analysis. While you <em>could<\/em> do everything in R, that doesn&#8217;t mean you <em>should<\/em>. Specialized network analysis software can also be very useful when interpreting, analyzing or visualizing a network, as opposed to trying to automate everything with an R script. You don&#8217;t have to be monogamous: you can love R and you can love other software too.<\/p>\n\n\n\n<p>The following is a tutorial explains how R can be used for text analysis (including creating word clouds) and then how your network can be exported, so you can analyze it in Pajek.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Loading the corpus (text), processing it and doing basic analysis (word counts) is done using the<strong> quanteda <\/strong>package (<a href=\"https:\/\/data.library.virginia.edu\/a-beginners-guide-to-text-analysis-with-quanteda\/\">detailed guide here<\/a>)<\/li><li>Making a Word Cloud is done using <strong>ggplot2<\/strong> (<a href=\"https:\/\/tutorials.quanteda.io\/statistical-analysis\/frequency\/\">detailed guide here<\/a>)<\/li><li>We then show you how to export the package to <strong>Pajek<\/strong>, a popular open source network analysis and visualization software (<a href=\"http:\/\/mrvar.fdv.uni-lj.si\/pajek\/\">official page here<\/a>)<\/li><\/ul>\n\n\n\n<p>You can install the aforementioned R packages by typing:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">install.packages('quanteda')\ninstall.packages('ggplot2')<\/pre>\n\n\n\n<p>Even if you did install the packages earlier, typing the install command again will simply re-check, and if needed, update the package.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"loading-text-corpus\">Loading Text Corpus<\/h2>\n\n\n\n<p>First you need to load the text you want to analyze into R. In this particular example the text is comments scraped from a website and stored in a TXT file (which you can open in Windows with Notepad). Every line is a new comment. The filename is &#8216;<strong>comments.txt<\/strong>&#8216;.<\/p>\n\n\n\n<p>In the first code chunk below, we start by loading the package quanteda using the <strong>library<\/strong> command.<\/p>\n\n\n\n<p>Then we import the text from comments.txt into a data frame using the <strong>read.csv<\/strong> command. This command is used to load CSV files (&#8220;comma separated values&#8221;, a kind of spreadsheet). Because commas are the default separator in a CSV file, and our comments might contain commas, we need to put something else as a separator to not mess everything up. Basically anything that we are sure won&#8217;t appear in the text. In this example we make <strong>sep = &#8220;|&#8221;<\/strong>.<\/p>\n\n\n\n<p>For easier processing we name the column containing the text &#8216;text&#8217; using the <strong>names<\/strong> command.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">library(quanteda)\ntextfile &lt;- read.csv('<strong><span class=\"has-inline-color has-red-color\">comments.txt<\/span><\/strong>', sep= \"|\")\nnames(textfile) &lt;- c('text')<\/pre>\n\n\n\n<p>Next we add the <strong>comment_<\/strong> label to each piece of text (this is a feature of quanteda, which can also do more complex text analysis than what is shown in this example). And then we view a summary of the text corpus. You can also click on <strong>text.corpus<\/strong> in R Studio to see what&#8217;s inside.<\/p>\n\n\n\n<pre id=\"block-72c2f2c8-2cf3-43db-aef6-6e8f8d098884\" class=\"wp-block-preformatted\">textfile$label &lt;- paste0('<span class=\"has-inline-color has-red-color\"><strong>comment<\/strong>_<\/span>', row.names(textfile))\ntext.corpus &lt;- corpus(textfile)\nsummary(text.corpus)<\/pre>\n\n\n\n<p>Tadaaa! We have imported our text corpus. Now it&#8217;s time to process!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"text-processing\">Text Processing<\/h2>\n\n\n\n<p>The next bit of code let&#8217;s you process text, basically cleaning it up.<\/p>\n\n\n\n<p>We begin by removing <strong>punctuation<\/strong> and <strong>numbers<\/strong>, because they are not important in this particular situation.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">text.tokens &lt;- tokens(text.corpus, remove_punct = TRUE, remove_numbers = TRUE)<\/pre>\n\n\n\n<p>Then we <strong>remove<\/strong> <strong>stop words<\/strong> like &#8220;the&#8221;, &#8220;a&#8221;, etc. which we do not want to analyze. quanteda is awesome in that it has libraries of commonly used stop words for multiple languages. See <a href=\"https:\/\/github.com\/quanteda\/stopwords\"><strong>here<\/strong><\/a>. In this case we use Korean stopwords (language <strong>ko<\/strong>) from the <strong>marimo<\/strong> repository. <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">text.tokens &lt;- tokens_select(text.tokens, stopwords('ko', source='marimo'), selection='remove')<\/pre>\n\n\n\n<p>The list of stopwords should be <strong>critically assessed<\/strong>. The scraped website comments that we used to try this out were filled with slang, for example. And so many additional stopwords were added. An example of the kind of Korean-language stopwords that might need to be added can be found in <strong><a href=\"https:\/\/github.com\/Deuy76\/Korean_Quantative_Text_Mining\/blob\/master\/Korean%20Text%20Mining%20with%20Quateda.R\">this example<\/a><\/strong> of Korean text analysis with quanteda on Github. You can also identify stopwords from the word frequency analysis (see below).<\/p>\n\n\n\n<p>You can also do some other processing, such as reducing words to the <strong>word stem<\/strong> or harmonizing all words to <strong>lower case<\/strong>. These aren&#8217;t relevant when processing a Korean language text, but they may be relevant for other languages, such as English.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">text.tokens &lt;- tokens_wordstem(text.tokens)\ntext.tokens &lt;- tokens_tolower(text.tokens)<\/pre>\n\n\n\n<p>When you&#8217;re done, you can compile all the beautifully clean processed text into a <strong>document feature matrix<\/strong>.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">text.dfm.final &lt;- dfm(text.tokens)<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"word-frequency-and-word-cloud\">Word Frequency and Word Cloud<\/h2>\n\n\n\n<p>Finally, we can start to do the fun stuff, text analysis! As a first step its worthwhile to look at the word frequency analysis to see if there are any frequently used words &#8220;polluting&#8221; your analysis. For example, in an analysis about a movie, you may want to remove the title of the movie. The code for producing a word frequency data frame named <strong>wfreq<\/strong> is below, with the 100 most frequently occurring words:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">wfreq &lt;-topfeatures(text.dfm.final, 100) %&gt;% as.data.frame()<\/pre>\n\n\n\n<p>The word frequency data can also be converted into a word cloud, whereby more frequently occurring words appear larger and in the center of the cloud.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">set.seed(132); textplot_wordcloud(text.dfm.final, max_words = 100)<\/pre>\n\n\n\n<p>Depending on the text used, this enables you to generate word clouds that will look something like this&#8230;<\/p>\n\n\n\n\n\n<h2 class=\"wp-block-heading\" id=\"export-to-pajek\">Export to Pajek<\/h2>\n\n\n\n<p>(to be added)<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">co.matrix &lt;- fcm(text.tokens, context= 'document', tri= F) #generate co-word matrix (within same review) feat &lt;- names(topfeatures(co.matrix, 30)) #select top-30 words <\/pre>\n","protected":false},"excerpt":{"rendered":"<p>R and its almost endless library of packages and plug-ins (CRAN) mean that you can do almost anything in R, including text analysis and network analysis. While you could do everything in R, that doesn&#8217;t mean you should. Specialized network analysis software can also be very useful when interpreting, analyzing or visualizing a network, as [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[12,19,39,40,42,43,59],"class_list":["post-189","post","type-post","status-publish","format-standard","hentry","category-tutorial","tag-co-word","tag-ggplot2","tag-networks","tag-pajek","tag-quanteda","tag-r","tag-word-cloud"],"_links":{"self":[{"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/posts\/189","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/comments?post=189"}],"version-history":[{"count":1,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/posts\/189\/revisions"}],"predecessor-version":[{"id":236,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/posts\/189\/revisions\/236"}],"wp:attachment":[{"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/media?parent=189"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/categories?post=189"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/tags?post=189"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}