-
Coconuts Falling from the Sky: Believing in Hydropower RECs with Zero Emissions
Renewable Energy Certificate (REC) trading is due to commence on the Bursa Carbon Exchange (BCX) on 25 June this year. While Bursa Malaysia claims that the hydrocarbon RECs being traded have zero carbon emissions “on paper”, the reality is somewhat different.
Businesses in Malaysia can use RECs to offset their “Scope 2” emissions: indirect greenhouse gas (GHG) emissions produced when electricity is used. . This measure is especially relevant for businesses in Peninsular Malaysia, where electricity generation capacity is primarily from coal power plants. By purchasing RECs, their Scope 2 emissions can effectively be “reduced” from a relatively high 0.8 Gg CO2e/kWh, to zero.
On paper, this system also works for the new hydropower RECs issued by Sarawak Energy. The hydropower RECs are certified according to the International Tracking Standard Foundation (I-TRACK) I-REC standard. At a recent stakeholder engagement webinar, BCX and Sarawak Energy noted that the hydropower electricity that backs the RECs is certified as having zero GHG emissions.
In reality, these zero emissions claims do not hold up to scientific scrutiny. A recent study by Universiti Tenaga Nasional suggests that man-made hydropower reservoirs are a significant source of methane emissions, a powerful greenhouse gas. The amount of greenhouse gasses emitted by reservoirs depends on the volume of organic material (vegetation, peaty topsoil) that has been flooded. If the reservoir is considered to be a part of the hydropower generation plant, then reservoir GHG emissions fall under hydropower’s so-called “Scope 1” emissions, and they should be counted as part of the electricity generation process.
Rather than having zero emissions, hydropower likely has an emissions intensity of around 0.2 Gg CO2e/kWh. While this impact is still much lower than the emissions from the Peninsula Malaysia electricity grid, it is not zero, and in fact similar to the emissions of electricity generated from natural gas. In Malaysia, emissions from hydropower reservoirs alone may account for around 3% of national emissions. Thus, ignoring the emissions from these reservoirs is like saying that coconuts don’t grow on trees, but simply fall from the sky.
While Bursa Malaysia and Sarawak Energy continue to point at I-TRACK for their zero emissions claim, businesses that buy RECs need to decide if these zero-emission claims are really credible. Especially if businesses use the RECs to offset their Scope 2 emissions for compliance reasons or to make public claims about having net-zero emissions, there are significant risks due to the “hidden” methane emissions from hydropower reservoirs. Regulators or customers could declare the Scope 2 emissions offsets invalid, while businesses could also face claims of greenwashing.
Fortunately, potential REC buyers do have alternatives: they can buy RECs based on solar and biogas, which have a more solid claim to zero emissions (and which the BCX plans to trade in future). Businesses can also choose to increase their investment in energy savings or their own renewable energy generation.
-
Malaysia’s Voluntary Carbon Market
The first auction on the Bursa Carbon Exchange (BCX), Malaysia’s new voluntary carbon market, took place on March 16 with around RM7.7 million in carbon credits sold.
Although the auction was described in the official press release as evidence of “strong interest and [a] healthy price signal by the domestic corporate sector”, it must be noted that all credits were sold at the minimum reserve price set by Bursa Malaysia, and that Bursa itself appears to have been one of the 15 buyers. Viewed from this perspective, the claim by Bursa Malaysia chief executive officer Datuk Muhamad Umar Swift that “we now have a proven market mechanism which provides price discovery” seems somewhat questionable.
The BCX has been launched at a time where there is broad agreement that Malaysia needs some kind of carbon pricing scheme to realise its international climate change commitments. However, no carbon tax or other clear driver of a Malaysian carbon market has emerged as yet. Although the government has hinted at the introduction of a carbon tax for several years, it continues to subsidise carbon emissions through its petrol subsidies.
Read the full article on The Edge, The Malaysian Insight or Eco-Business, or listen to the interview on BFM Radio.
-
Korea Should Lead on International Carbon Trade Policies
Korea has always responded quickly to emerging economic trends, and the European Union’s recent introduction of a carbon import tax (CBAM) is no exception.
The Korean government has started negotiations and announced improvements to the domestic carbon trading system (K-ETS). Korean businesses have announced plans for green technological upgrading.
Meanwhile, the United States is planning a similar tax under its Clean Competition Act.
While a “rapid response” to these measures is important, Korean policymakers should not lose sight of the bigger picture. Korea is well-positioned to shape ― and benefit from ― the emerging “green” economic order and Korea can change its role from “rapid responder” to international policy leader.Read the complete article at The Korea Times.
-
In EU-Malaysia Trade Relations, Urgent Need to Address Carbon Pricing
On 23 December deputy prime minister Fadillah Yusof complained about the European Union’s (EU) new regulations banning the import of commodities linked to deforestation. He called the restrictions “unfair”, based on “unsound reasoning” and said they were “offensive to Malaysia”. Fadillah’s comments are only the latest salvo by a Malaysian minister against the EU, with palm oil imports often entering the crossfire, and they will certainly not be the last. The EU is also planning to introduce a carbon import tax (CBAM), which will hit Malaysian exports of iron, steel and aluminum to the EU, and which will be expanded to other manufactured goods in the future. CBAM will likely have a far larger impact on the EU-Malaysia than its palm oil restrictions.
“So why is the EU doing this to Malaysia?” one might ask.
To put it simply, the EU (and especially the European Parliament) is very focused on climate change action, an area in which Malaysia appears to be lagging. An illustrative example is the fact that Malaysia only recently launched its voluntary Bursa Carbon Exchange, with the first auction of carbon credits expected in 2023, a very cautious first step towards carbon pricing. By comparison, the EU has had a full-fledged compulsory Emissions Trading System (ETS) since 2008! Therefore, although the EU’s “green” trade policies can seem unfair to Malaysia, they are consistent with the EU’s internal environmental-economic policies.
While it can be argued that Malaysia is an emerging economy, and therefore deserves certain exemptions, Malaysia’s per capita income is similar to that of EU-member states such as Bulgaria, Croatia and Greece. In this sense, Malaysia can also be viewed as a developed economy. As a result, Malaysia’s slow response to climate change is increasingly becoming a point of friction in EU-Malaysia economic relations, and one that requires action and accommodation from both sides.
Understanding CBAM, the EU’s Carbon Import Tax
The policy that will likely define the EU-Malaysia trade relationship during the coming decade is the EU’s Carbon Border Adjustment Mechanism (CBAM). Under CBAM, all non-EU goods sold in the EU would be subject to a “top up” carbon tax based on the difference between the foreign carbon price and the EU’s own domestic carbon price (ETS). With a metric ton of CO2 priced at approximately US$ 100 in the EU today, and expected to rise further, and most voluntary carbon markets typically pricing around US$ 4 per ton, this could mean a significant tax on Malaysian goods entering the EU.
While CBAM has been criticized, and some legal scholars have suggested that it violates WTO rules, the EU needs to impose some kind of trade barrier to avoid “carbon leakage” to other economies. So even if CBAM is modified, some kind of carbon import tax is highly likely to come into force. The United States is already planning an import tax under the Clean Competition Act (CCA) to avoid carbon leakage.
Developing a credible domestic carbon market framework with relatively high carbon prices may be an effective policy response by Malaysia in the face of these emerging “green” trade barriers. In addition to avoiding or limiting CBAM-type taxation, the policy could have wider benefits for the Malaysian economy.
Opportunities for Malaysia
Malaysia is in an interesting position, because it is a fairly large emitter of greenhouse gasses (per capita emissions are similar to Germany and China, according to World Bank data), but it also has large potential to develop carbon sinks through reforestation, conservation and other nature-based solutions. McKinsey & Co estimates that Malaysian nature-based projects could produce 40 million tonnes of CO2-equivalent offsets annually, which could amount to billions of ringgit in revenue. Nature-based carbon credits would mainly be generated by forest and mangrove restoration, which are often found on state land. As a result, state governments, such as Sabah and Sarawak, and rural communities, could be major beneficiaries of higher carbon prices.
Higher carbon prices also create a strong incentive for lowering carbon emissions by encouraging energy saving and the adoption of emission-reducing technologies. In Malaysia, a carbon pricing policy is supported by diverse stakeholders, including environmental groups (WWF), think-tanks (Penang Institute) and energy companies (Shell). Aside from reducing carbon emissions, higher carbon prices also encourage innovation. As finance minister Lawrence Wong explained in February, when introducing Singapore’s carbon tax: “We aim to move Singapore into the forefront of green technologies, where new innovations are developed, trialed, scaled up, and eventually exported to the rest of the world. We will work hard to grab the first-mover advantage.” Malaysia, which is already the world’s second-largest solar panel exporter, should take note. Singapore is raising carbon prices as part of a broader environmental-economic strategy, like the EU. One way to raise domestic carbon prices is by taxing greenhouse gas emissions. This would also raise government revenue, which could then be used to subsidize green innovation and emission reduction projects.
Regulatory Innovation: Internationalizing Carbon Credit Trading
An alternative and more innovative approach to raising carbon prices would be for Malaysia to internationalize its carbon market and attract international capital to finance its energy transition. Although the regulatory mechanisms for international carbon trading are still under development, article 6 of the Paris Agreement allows such trade if the carbon offsets are transferred from one country to another (under so-called NDC accounting). Within the context of EU-Malaysia trade relations, Malaysia could aim for a mechanism to link its carbon market to the EU ETS.
In practical terms, how would such a carbon credit trading mechanism work, and what would be the benefits for Malaysia?
As an example, let us take the early retirement of coal power plants. The International Energy Agency (IEA) has noted that replacing coal power with renewable energy alternatives pays for itself, because over their lifetime, renewable energy is now less expensive than coal power. Coal power is a large source of greenhouse gas emissions and in Malaysia around 40% of electricity is generated from coal. However, closing a relatively new coal power plant also means that the plant’s owners would have to write off a large investment. This means that the Malaysian treasury would probably have to finance the phasing-out of such plants for Malaysia to meet its emission reduction targets.
However, what if the closing of a power plant could be certified as an emission reduction and sold on a foreign carbon market like the EU ETS? A rough calculation based on data from the Energy Information Administration and Our World in Data suggests that Malaysia’s coal power plants represent an emission reduction opportunity of 150 million tonnes of CO2 annually, which at current EU ETS carbon prices, equates to RM60 billion in carbon credits.
From a Malaysian perspective, such a strategy of selling carbon credits would mean that the emission reductions would not be recognized in Malaysia’s national greenhouse gas accounts (NDC) for several years. Carbon credits are typically issued for a fixed period, so if Malaysia were to export 10 years’ worth of carbon credits to cover a project, these credits cannot be included in Malaysia’s NDC until 10 years later, say in 2033. After this period, the emissions reductions can be re-included in Malaysia’s NDC and count towards Malaysia’s net-zero emission goals.
For this transaction to work, the EU would have to accept Malaysian carbon credits on its ETS, something that it may be reluctant to do. There are many reasons for this, including the large regulatory gap, pricing gap and the fact that Malaysia’s carbon market is still voluntary. While Malaysia’s carbon market would have to quickly develop, Malaysia should also argue that the EU’s unilateral imposition of CBAM, which imposes a tax based on the EU ETS, justifies that EU trading partners be allowed a degree of access to the EU ETS.
Linking Malaysia to the EU ETS need not be done as a “free market”, as is the case with the EU and Switzerland. The EU and Malaysia could agree on a quota of credits linked to CBAM. Malaysia could also impose a carbon credit export tax (similar to its palm oil export tax) to ensure higher prices abroad, and lower prices domestically, for Malaysian carbon credits.
Linking Malaysia’s nascent carbon market to that of major trading partners could be part of a broader climate change-centric trade and investment policy. Carbon prices in other large trading partners such as China, Japan and South Korea are much lower than in the EU, but that might also make negotiating access to their carbon markets easier.
Need for Environmental-Economic Cooperation
Whatever direction Malaysia chooses, as a highly trade-dependent economy, it needs to factor climate change into its international trade and investment policies. If climate change is not addressed, it will increasingly emerge as a trade barrier.
On the EU’s part, it needs to make CBAM acceptable to its trading partners, including Malaysia. If climate change is seen as a kind of “green imperialism” or “green protectionism”, then both the EU and policies to promote climate change action will be harmed. Instead, CBAM should be part of a broader “Green New Deal” on trade and investment between the EU and its international partners, that could also include access to the EU ETS, among other areas of environmental-economic cooperation.
-
R-Conference 2022
Presentation materials for “A Math-Fearing Social Scientist’s Basic R Toolkit: Scraping, Content and Network Analysis” presented at the R-Conference 2022 organized by the Malaysian R-User Group (MyRUG), 26-27 Nov 2022.
Slides (PDF)
R code for Google News example
-
Basic Text Analysis and Visualization in R
At its most basic level, text analysis is about counting words. If words are frequently used, we assume that they are important. If words occur together, we assume that they are related. Obviously, that is not always the case, but a discerning researcher like yourself will be able to filter this information, provide context and meaning, and draw the appropriate conclusions.
Text analysis often provides you with an opportunity to gather more solid evidence of the importance of certain words or concepts, and the relationship between them, and sometimes can lead to the discovery of hidden patterns, that you as a human observer may have missed.
In this short tutorial, using the quanteda, ggplot2 and quanteda.textplot packages, the following analysis methods are covered:
- Word frequency analysis and visualization in a frequency chart and word cloud.
- Co-word analysis and visualization in a co-word network.
Before starting, it is assumed that you have cleaned and created a Document Feature Matrix (DFM) called mydfm (see earlier tutorial).
Word Frequency Analysis
Word frequency analysis is about counting words, and it is useful to look at, say, the 100 most frequently used words (topfeatures), which can be extracted as follows:
library(quanteda) wfreq <-data.frame(topfeatures(mydfm, 100))
Aside from looking at the words in the data frame, they can also be visualized.
Below an example of the 10 most frequently occurring words in a bar chart (based on a news corpus about sustainable investment in the Philippines). Note that we manipulate the wfreq data frame for it to be read properly by ggplot.
library(ggplot2) wfreq <- as.data.frame(topfeatures(mydfm, 10)) wfreq$n <- as.numeric(wfreq$.) wfreq$word <- row.names(wfreq) ggplot(wfreq, aes(x = reorder(word, n, function(n) -n), y = n)) + geom_bar(stat = "identity") + xlab('')
And next, the 100 most frequently occurring words in a word cloud!
library(quanteda.textplots) set.seed(123); textplot_wordcloud(mydfm, max_words = 100)
Word frequency analysis is a useful first step to explore your corpus. Based on it, you may wish to remove some additional highly frequent words which “overshadow” the rest of the corpus. But, these are always decisions that require thought and justification.
Co-Word Analysis
For co-word analysis, we convert the DFM into a Frequency Co-occurrence Matrix (FCM) using fcm. For the purpose of this tutorial, think of the FCM as a “box” filled with the connections between words.
Because the FCM can be very large, thus making it dificult to visually analyze, we find the 50 most frequently occurring words (topfeatures), and then select them in a new FCM called myselect using the fcm_select feature.
myfcm <- fcm(mydfm) feat <- names(topfeatures(myfcm, 50)) myselect <- fcm_select(myfcm, pattern = feat, selection = "keep")
Having cut the FCM to size, we can now generate a co-occurrence network. To visualize the network well, it can be useful to impose a minimum frequency (min_freq), removing very rarely occurring links. You may also want to adjust the size of the dots (vertex_size) to reflect their importance.
size <- log(colSums(dfm_select(mydfm, feat, selection = "keep"))) set.seed(112) textplot_network(myselect, min_freq = 0.8, vertex_size = size / max(size) * 3)
The present network can be analyzed further using network analysis techniques, its visualization can be further refined, etc. The present tutorial gives you a basis to start with.
-
Corpus Loading and Text Cleaning with Quanteda in R
Assuming you have a file with text data (perhaps a spreadsheet that you have exported as a CSV file, or data scraped from Google News), you can now start to build and clean your corpus. Fortunately, this is made very easy by functions in the Quanteda package.
First, we load quanteda and the corpus (mycorpus.csv with a column with text called all_para), and we make sure it is readable using quanteda’s corpus function, by transforming the loaded data into characters using as.characters.
library(quanteda) mycorpus <- read.csv('mycorpus.csv') mycorpus$all_para <- as.character(mycorpus$all_para) mycorpus <- corpus(mycorpus$all_para)
Second, we use quanteda’s built-in functions to convert the corpus into tokens (for simplicity sake, think of tokens as individual words). We then carry our several “cleaning” steps, namely:
- Remove punctuation (remove_punct = TRUE) and remove numbers (remove_numbers = TRUE),
- Remove stop words such as “the”, “a”, “them”, “he”, “she”, etc. There are several repositories of stopwords for different languages, but do test them carefully as not all may be equally good (e.g. issues with Korean). In this case we use English stopwords (en) and the stopwords-iso source, which seems pretty good.
- Next, depending on the topic you are researching, it can be useful to remove some additional words. For instance, the words used in your search terms, which will likely be very frequent, and can “overshadow” other words used in your analysis, because almost everything is related to them. Should this be the case, you can create a vector, which is called samesame in the example. In your first round of analysis you may want to leave this out, hence the #.
- Then there are two more functions, one is shortening words (tokens) to their wordstem, e.g. making singular and plural into the singular. While this is a useful function, it sometimes truncates words in an odd way.
- It is usually a good idea to also convert all words to lowercase, so that “Happy” and “happy” are not seen as two different words.
In the final step, the tokens are converted into a “Document Feature Matrix” (DFM). For the purposes of this tutorial, just imagine that it’s your box of cleaned words.
mytokens <- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE) mytokens <- tokens_select(mytokens, stopwords('en', source='stopwords-iso'), selection='remove') #mytokens <- tokens_select(mytokens, samesame, selection='remove') mytokens <- tokens_wordstem(mytokens) mytokens <- tokens_tolower(mytokens) mydfm <- dfm(mytokens)
With your bright, shiny and clean DFM loaded, it’s time to do some text analysis!
-
[ASB Research Center Reading Group] Media Analysis with R
This session provides a basic introduction to media analysis using R. It addresses the full workflow of acquiring the data, cleaning and processing, and then some simple analysis and visualization.
We will acquire data from
two sources:Google News RSS feedsand Twitter.Before joining the session, please make sure that you have:
- Internet connection.
- Installed R and R Studio on your laptop.
- Install the packages/libraries tidyRSS, rvest,
twitteR,quanteda, quanteda.textplots and ggplot2 (optional). Twitter account (to load Twitter data).
You can install the packages with the following code. If you already installed the packages previously, this code will re-install the latest version of the package. Note: depending on your setup, and the R packages you previously installed, the installation can take some time.
install.packages(c('tidyRSS', 'rvest', 'quanteda', 'quanteda.textplots', 'ggplot2'))
Now that you are ready, please go ahead and follow these steps (mini tutorials).
- Google News Scraping in R
-
Loading Twitter Data in R(no longer feasible for a demo due to updated API use rules) - Corpus Loading and Text Cleaning with Quanteda in R
- Basic Text Analysis and Visualization in R
Below, some more code examples as used during the actual demo… (please note, it’s not very well edited, nor clean)
#TUTORIAL: https://pstek.nl/2022/asb-research-center-reading-group-media-analysis-with-r/ setwd('~/textmining/') #Scrape Google News RSS name <- 'asbdemo' query <- c('"Asia School of Business"') library(tidyRSS) for(n in 1:length(query)){ url <- paste0('https://news.google.com/rss/search?q=', URLencode(query[n], reserved= T), '&hl=en-MY&gl=MY&ceid=MY:en') if(n == 1){ articles <- tidyfeed(url) } else { articles <- rbind(articles, tidyfeed(url)) } Sys.sleep(5) } write.csv(articles[,1:13], paste0(name,'-articles.csv'), row.names= F) #Scrape and clean library(rvest) all_para <- c() for(n in 1:20){ html <- read_html(articles$item_link[n]) para <- html_text(html_nodes(html, 'p')) all_para <- c(all_para, para) Sys.sleep(5) } para_tbl <- as.data.frame(table(all_para)) para_tbl <- subset(para_tbl, Freq < 2) #remove repeats (trash) write.csv(para_tbl, paste0(name,'-corpus.csv'), row.names= F) #manually remove lines at the top in CSV (containing dates, key words, etc.) para_tbl2 <- read.csv(paste0(name,'-corpus.csv')) #Make corpus library(quanteda) mycorpus <- corpus(sub(" - .*", "", articles$item_title)) ##para_tbl2$all_para <- as.character(para_tbl2$all_para) ##mycorpus <- corpus(para_tbl2$all_para) #summary(mycorpus) samesame <- c('asia', 'school', 'business') #Clean corpus mytokens <- tokens(mycorpus, remove_punct = TRUE, remove_numbers = TRUE) mytokens <- tokens_select(mytokens, stopwords('en', source='stopwords-iso'), selection='remove') mytokens <- tokens_select(mytokens, samesame, selection='remove') mytokens <- tokens_wordstem(mytokens) mytokens <- tokens_tolower(mytokens) mydfm <- dfm(mytokens) #Analyse wfreq <-topfeatures(mydfm, 10) %>% as.data.frame() library(ggplot2) wfreq$n <- as.numeric(wfreq$.) wfreq$word <- row.names(wfreq) ggplot(wfreq, aes(x = reorder(word, n, function(n) -n), y = n)) + geom_bar(stat = "identity") + xlab('') library(quanteda.textplots) set.seed(123); textplot_wordcloud(mydfm, max_words = 100) myfcm <- fcm(mydfm) dim(myfcm) feat <- names(topfeatures(myfcm, 50)) myselect <- fcm_select(myfcm, pattern = feat, selection = "keep") dim(myselect) size <- log(colSums(dfm_select(mydfm, feat, selection = "keep"))) set.seed(112) textplot_network(myselect, min_freq = 0.8, vertex_size = size / max(size) * 3) textplot_network(myselect)
-
Google News Scraping in R
Google News is a popular news aggregator that can be used to search for news from diverse sources. You may have heard of Google News alert e-mails, but you can also use the service to scrape news stories for the purposes of systematic text analysis.
This post provides an example of how to use tidyRSS and rvest to identify and then scrape stories from Google News, using the following steps:
- Generate a URL to obtain a Google News RSS feed,
- Read the RSS feed using tidyRSS,
- Use rvest to visit the links of each news story and scrape the article content.
Before you proceed, please note that the process described here probably violates the website terms of service and copyright laws, as web scraping is typically not allowed. It is therefore recommended to use this system in a way that does not overload Google News or the pages you are scraping, or else your IP will likely get blocked.
The Google News RSS Feed
Google News has a fantastic feature that lets you use search terms to generate an RSS feed. An RSS feed is a summary of information (e.g. news articles or blog posts) that can easily be read by a machine (e.g. imported into a spreadsheet).
Google News Search is very systematic, so you can clearly specify precisely what you are looking for. For example, you can search for the name of a politician, and his comments on some topics, as reported by a particular news website. Your search could be:
“Boris Johnson” AND (“Ukraine” OR “Putin” OR “Russia”) site:bbc.co.uk
This would give you all coverage from the BBC concerning what BoJo has said about Ukraine, Putin and Russia. To see what kind of articles appear, do a trial search on the regular Google News website.
To generate a Google News RSS feed URL (= website address), you can use the following code to inject your search term:
search_term <- '"Boris Johnson" AND ("Ukraine" OR "Putin" OR "Russia") site:bbc.co.uk' url <- paste0('https://news.google.com/rss/search?q=', URLencode(search_term, reserved= T), '&hl=en-MY&gl=MY&ceid=MY:en')
As you may notice, the search_term is transcribed into URL-language via the URLencode function. You may also notice the end of the URL having &hl=en-MY&gl=MY&ceid=MY:en, this is code specifying your language and country preference (in this case English language for Malaysia). If you do not provide this last bit of code, Google will automatically generate it based on your IP address and the language it thinks you use. For the sake of reproducibility, it may be correct to specify this last part, as Google is known to personalize search results.
TidyRSS
In the next step, TidyRSS is used to read the Google NEws RSS feed URL that was just generated. Note that Google News gives a maximum of 100 results via the RSS feed. The following code should work…
library(tidyRSS) articles <- tidyfeed(url)
The articles output is a data frame that you can write to a CSV, spreadsheet, or analyze further.
The data frame contains the title of the article, a short description and a link to the original article (the item_link column).
The next step presents several options. When doing a systematic text analysis, it is possible to look only at article titles (the item_title column), but in this case we want to scrape the main article, so we use the links (item_link column).
Scraping news articles with rvest
Web scraping is a powerful tool to build a database of information, and what we are doing is writing a program (a bot), that goes to a specific list of links and reads and saves all the content.
In the next bit of code we use a loop (the for…-part), we read the page which displays in HTML (read_html), we then extract all text that is in paragraphs marked by ‘p’ (html_text, html_nodes), and we then add those paragraphs to our vector (basically a long list of paragraphs) names all_para. Then we tell the program to take a 10 second break before going to the next link (Sys.sleep), because if we don’t, our IP will likely be blocked. This is particularly important if you are scraping articles from the same website.
library(rvest) all_para <- c() for(n in 1:nrow(articles)){ html <- read_html(articles$item_link[n]) para <- html_text(html_nodes(html, 'p')) all_para <- c(all_para, para) Sys.sleep(10) }
One of the points to note, is that we are now looking to do the text analysis on the level of a paragraph, not at that of an article. You can also do it at the level of an article, but you would adjust the code.
Secondly, not every piece of text that has been scraped is useful. Some of it may be irrelevant, such as a copyright notice, or the description of an ad or another news story.
One way of automatically filtering the data is to remove duplicate paragraphs, but a manual check should also be done. To remove duplicate paragraphs, you can use the following code:
para_tbl <- as.data.frame(table(all_para)) para_tbl <- subset(para_tbl, Freq < 2)
And there you have it! You have now scraped your very own “corpus” of news. You may wish to save your corpus as a CSV file for later use, or as a backup. Especially for a smaller corpus, it can be wise to look it through and remove any erroneously included paragraphs, as automatic web scraping is rarely an exact science. To save your data, you can use the following code:
write.csv(para_tbl, 'mycorpus.csv', row.names= F)