{"id":345,"date":"2022-10-31T03:10:57","date_gmt":"2022-10-31T03:10:57","guid":{"rendered":"https:\/\/ap.pstek.nl\/pstek_wp\/?p=345"},"modified":"2022-10-31T04:21:45","modified_gmt":"2022-10-31T04:21:45","slug":"corpus-loading-and-text-cleaning-with-quanteda-in-r","status":"publish","type":"post","link":"https:\/\/ap.pstek.nl\/pstek_wp\/2022\/corpus-loading-and-text-cleaning-with-quanteda-in-r\/","title":{"rendered":"Corpus Loading and Text Cleaning with Quanteda in R"},"content":{"rendered":"\n<p>Assuming you have a file with text data (perhaps a spreadsheet that you have exported as a CSV file, or data <a href=\"https:\/\/ap.pstek.nl\/pstek_wp\/2022\/google-news-scraping-in-r\/\" data-type=\"post\" data-id=\"333\">scraped from Google News<\/a>), you can now start to build and clean your corpus. Fortunately, this is made very easy by functions in the <a href=\"https:\/\/github.com\/quanteda\/quanteda\/\">Quanteda<\/a> package.<\/p>\n\n\n\n<p>First, we load quanteda and the corpus (<strong>mycorpus.csv<\/strong> with a column with text called <strong>all_para<\/strong>), and we make sure it is readable using quanteda&#8217;s <strong>corpus<\/strong> function, by transforming the loaded data into characters using <strong>as.characters<\/strong>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>library(quanteda)\nmycorpus &lt;- read.csv('mycorpus.csv')\nmycorpus$all_para &lt;- as.character(mycorpus$all_para)\nmycorpus &lt;- corpus(mycorpus$all_para)<\/code><\/pre>\n\n\n\n<p>Second, we use quanteda&#8217;s built-in functions to convert the corpus into <strong>tokens<\/strong> (for simplicity sake, think of tokens as individual words). We then carry our several &#8220;cleaning&#8221; steps, namely:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Remove punctuation (<strong>remove_punct = TRUE<\/strong>) and remove numbers (<strong>remove_numbers = TRUE<\/strong>),<\/li>\n\n\n\n<li>Remove stop words such as &#8220;the&#8221;, &#8220;a&#8221;, &#8220;them&#8221;, &#8220;he&#8221;, &#8220;she&#8221;, etc. There are several <a href=\"https:\/\/github.com\/quanteda\/stopwords\">repositories of stopwords<\/a> for different languages, but do test them carefully as not all may be equally good (e.g. issues with <a href=\"https:\/\/ap.pstek.nl\/pstek_wp\/2021\/doing-text-analysis-in-r-work-in-progress\/\" data-type=\"post\" data-id=\"189\">Korean<\/a>). In this case we use English stopwords (<strong>en<\/strong>) and the <strong>stopwords-iso<\/strong> source, which seems pretty good.<\/li>\n\n\n\n<li>Next, depending on the topic you are researching, it can be useful to remove some additional words. For instance, the words used in your search terms, which will likely be very frequent, and can &#8220;overshadow&#8221; other words used in your analysis, because almost everything is related to them. Should this be the case, you can create a vector, which is called <strong>samesame<\/strong> in the example. In your first round of analysis you may want to leave this out, hence the #.<\/li>\n\n\n\n<li>Then there are two more functions, one is shortening words (tokens) to their <strong>wordstem<\/strong>, e.g. making singular and plural into the singular. While this is a useful function, it sometimes truncates words in an odd way.<\/li>\n\n\n\n<li>It is usually a good idea to also convert all words to lowercase, so that &#8220;Happy&#8221; and &#8220;happy&#8221; are not seen as two different words.<\/li>\n<\/ol>\n\n\n\n<p>In the final step, the tokens are converted into a &#8220;Document Feature Matrix&#8221; (<strong>DFM<\/strong>). For the purposes of this tutorial, just imagine that it&#8217;s your box of cleaned words.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>mytokens &lt;- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE)\nmytokens &lt;- tokens_select(mytokens, stopwords('en', source='stopwords-iso'), selection='remove')\n#mytokens &lt;- tokens_select(mytokens, samesame, selection='remove')\nmytokens &lt;- tokens_wordstem(mytokens)\nmytokens &lt;- tokens_tolower(mytokens)\n\nmydfm &lt;- dfm(mytokens)<\/code><\/pre>\n\n\n\n<p>With your bright, shiny and clean DFM loaded, it&#8217;s time to do some text analysis!<\/p>\n\n\n\n<p>Next: <a href=\"https:\/\/ap.pstek.nl\/pstek_wp\/2022\/basic-text-analysis-and-visualization-in-r\/\" data-type=\"post\" data-id=\"351\">Basic Text Analysis and Visualization in R<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Assuming you have a file with text data (perhaps a spreadsheet that you have exported as a CSV file, or data scraped from Google News), you can now start to build and clean your corpus. Fortunately, this is made very easy by functions in the Quanteda package. First, we load quanteda and the corpus (mycorpus.csv [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[85],"tags":[99,42,43,100,96],"class_list":["post-345","post","type-post","status-publish","format-standard","hentry","category-tutorial","tag-corpus","tag-quanteda","tag-r","tag-stopwords","tag-text-analysis"],"_links":{"self":[{"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/posts\/345","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/comments?post=345"}],"version-history":[{"count":5,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/posts\/345\/revisions"}],"predecessor-version":[{"id":360,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/posts\/345\/revisions\/360"}],"wp:attachment":[{"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/media?parent=345"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/categories?post=345"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ap.pstek.nl\/pstek_wp\/wp-json\/wp\/v2\/tags?post=345"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}