{"id":351,"date":"2022-10-31T04:21:27","date_gmt":"2022-10-31T04:21:27","guid":{"rendered":"https:\/\/ap.pstek.nl\/pstek_wp\/?p=351"},"modified":"2022-10-31T04:21:27","modified_gmt":"2022-10-31T04:21:27","slug":"basic-text-analysis-and-visualization-in-r","status":"publish","type":"post","link":"https:\/\/ap.pstek.nl\/pstek_wp\/2022\/basic-text-analysis-and-visualization-in-r\/","title":{"rendered":"Basic Text Analysis and Visualization in R"},"content":{"rendered":"\n
At its most basic level, text analysis is about counting words. If words are frequently used, we assume that they are important. If words occur together, we assume that they are related. Obviously, that is not always the case, but a discerning researcher like yourself will be able to filter this information, provide context and meaning, and draw the appropriate conclusions.<\/p>\n\n\n\n
Text analysis often provides you with an opportunity to gather more solid evidence of the importance of certain words or concepts, and the relationship between them, and sometimes can lead to the discovery of hidden patterns, that you as a human observer may have missed.<\/p>\n\n\n\n
In this short tutorial, using the quanteda<\/strong>, ggplot2<\/strong> and quanteda.textplot<\/strong> packages, the following analysis methods are covered:<\/p>\n\n\n\n Before starting, it is assumed that you have cleaned and created a Document Feature Matrix (DFM) called mydfm<\/strong> (see earlier tutorial<\/a>).<\/p>\n\n\n\n Word frequency analysis is about counting words, and it is useful to look at, say, the 100 most frequently used words (topfeatures<\/strong>), which can be extracted as follows:<\/p>\n\n\n\n Aside from looking at the words in the data frame, they can also be visualized.<\/p>\n\n\n\n Below an example of the 10 most frequently occurring words in a bar chart (based on a news corpus about sustainable investment in the Philippines). Note that we manipulate the wfreq data frame for it to be read properly by ggplot<\/strong>.<\/p>\n\n\n\n\n
Word Frequency Analysis<\/h3>\n\n\n\n
library(quanteda)\nwfreq <-data.frame(topfeatures(mydfm, 100))<\/code><\/pre>\n\n\n\n
library(ggplot2)\nwfreq <- as.data.frame(topfeatures(mydfm, 10))\nwfreq$n <- as.numeric(wfreq$.)\nwfreq$word <- row.names(wfreq)\nggplot(wfreq, aes(x = reorder(word, n, function(n) -n), y = n)) + geom_bar(stat = \"identity\") + xlab('')<\/code><\/pre>\n\n\n\n