{"id":345,"date":"2022-10-31T03:10:57","date_gmt":"2022-10-31T03:10:57","guid":{"rendered":"https:\/\/ap.pstek.nl\/pstek_wp\/?p=345"},"modified":"2022-10-31T04:21:45","modified_gmt":"2022-10-31T04:21:45","slug":"corpus-loading-and-text-cleaning-with-quanteda-in-r","status":"publish","type":"post","link":"https:\/\/ap.pstek.nl\/pstek_wp\/2022\/corpus-loading-and-text-cleaning-with-quanteda-in-r\/","title":{"rendered":"Corpus Loading and Text Cleaning with Quanteda in R"},"content":{"rendered":"\n
Assuming you have a file with text data (perhaps a spreadsheet that you have exported as a CSV file, or data scraped from Google News<\/a>), you can now start to build and clean your corpus. Fortunately, this is made very easy by functions in the Quanteda<\/a> package.<\/p>\n\n\n\n First, we load quanteda and the corpus (mycorpus.csv<\/strong> with a column with text called all_para<\/strong>), and we make sure it is readable using quanteda’s corpus<\/strong> function, by transforming the loaded data into characters using as.characters<\/strong>.<\/p>\n\n\n\n Second, we use quanteda’s built-in functions to convert the corpus into tokens<\/strong> (for simplicity sake, think of tokens as individual words). We then carry our several “cleaning” steps, namely:<\/p>\n\n\n\n In the final step, the tokens are converted into a “Document Feature Matrix” (DFM<\/strong>). For the purposes of this tutorial, just imagine that it’s your box of cleaned words.<\/p>\n\n\n\n With your bright, shiny and clean DFM loaded, it’s time to do some text analysis!<\/p>\n\n\n\nlibrary(quanteda)\nmycorpus <- read.csv('mycorpus.csv')\nmycorpus$all_para <- as.character(mycorpus$all_para)\nmycorpus <- corpus(mycorpus$all_para)<\/code><\/pre>\n\n\n\n
\n
mytokens <- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE)\nmytokens <- tokens_select(mytokens, stopwords('en', source='stopwords-iso'), selection='remove')\n#mytokens <- tokens_select(mytokens, samesame, selection='remove')\nmytokens <- tokens_wordstem(mytokens)\nmytokens <- tokens_tolower(mytokens)\n\nmydfm <- dfm(mytokens)<\/code><\/pre>\n\n\n\n