Task 1

Fisrt, load packages that will be used in the following analysis

library(tidyverse)
library(ggplot2)
library(ggthemes)
library(lubridate)
library(ggrepel)
library(scales)
library(viridis)

Download the exercise data from sg.data. In the zip file, there are resale flats data ranging from 1990 to 2021. I wish to frame the study scope to recent six years, therefore the first step is to load and merge resale flats dataset of 2015-2017 and 2017-now.

Prices2017 <- read_csv("data/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")
Prices2015 <- read.csv("data/resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv")

I try to bind the two datasets using “bind_rows(Prices2015,Prices2017)”,however R shows a warning, because the “remaining_lease” factor in 2015 is integer data whereas in 2017 is character data. Observing the two datasets, the “remaining_lease” in 2017 shows as “x year y month”. To match with the 2015 data, I need to extract the first two characters and convert it to interger. Google tells me to use the substr command.

Prices2017$remaining_lease <- substr(Prices2017$remaining_lease,1,2)
Prices2017$remaining_lease <- as.integer(Prices2017$remaining_lease)
# Merge the processed data again
Prices2015_2021 <- bind_rows(Prices2015,Prices2017)
str(Prices2015_2021)
## 'data.frame':    127276 obs. of  11 variables:
##  $ month              : chr  "2015-01" "2015-01" "2015-01" "2015-01" ...
##  $ town               : chr  "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" ...
##  $ flat_type          : chr  "3 ROOM" "3 ROOM" "3 ROOM" "3 ROOM" ...
##  $ block              : chr  "174" "541" "163" "446" ...
##  $ street_name        : chr  "ANG MO KIO AVE 4" "ANG MO KIO AVE 10" "ANG MO KIO AVE 4" "ANG MO KIO AVE 10" ...
##  $ storey_range       : chr  "07 TO 09" "01 TO 03" "01 TO 03" "01 TO 03" ...
##  $ floor_area_sqm     : num  60 68 69 68 68 67 68 68 67 68 ...
##  $ flat_model         : chr  "Improved" "New Generation" "New Generation" "New Generation" ...
##  $ lease_commence_date: num  1986 1981 1980 1979 1980 ...
##  $ remaining_lease    : int  70 65 64 63 64 64 64 65 62 69 ...
##  $ resale_price       : num  255000 275000 285000 290000 290000 ...

This time it worked. This new dataset is the basis for all subsequent analysis. After looking at the factors in the dataset, as well as reviewing articles on the Singapore HDB resale market, I found five interesting questions.

Question 1: What is the general trend of HDB resale prices in recent six years?

The dataset records the resale month information, from which we need to extract the resale year and merge it into the dataset as a new column.

Prices2015_2021 <- 
  Prices2015_2021 %>% 
  mutate(year = year(ym(month))) %>% 
  select(year, everything())

Next, calculate the mean value of resale flat prices for each year and plot it using ggplot2.

Prices2015_2021 %>% 
  group_by(year) %>% 
  summarise(average_price = mean(resale_price)) %>% 
  
  ggplot(mapping = aes(x=year, y=average_price)) +
  geom_smooth(alpha = 0.1, fill = "#6300a7ff", color = "#6300a7ff") + # The color is from the viridis color palettes
  geom_point(size = 2) +
  # Adjust styles, lables and add title
  theme_minimal() +
  scale_x_continuous(breaks = 2015:2021, minor_breaks = NULL) +
  labs(title = "Trend of HDB resale price from 2015 to 2021", y = "Resale price (SGD)", 
       x = "Year", caption = "Source of data: Singapore Housing and Development Board, 2021.")

It can be seen from this figure that resale flat prices were relatively stable between 2015-2019, fluctuating within a range of 15000, with a slight increase in 2015-2017 and decrease in 2018 and 2019. In 2020, however, the resale price began to rise rapidly and by early 2021 the average price had exceeded 485,000. such a dramatic change was reminiscent of an important social change in 2020 - the COVID-19 pandemic. Therefore in the latter question I will look more closely at the changes in the resale market during the pandemic.

Question 3: How have prices fluctuated in the last year for different flat types?

Here I play with boxplot to create fun visualisation as well as reduce data transpormation steps. As “month” contains 12 variables, it’s a bit messy to plot with. Using lubridate package, month can be transformed to quarters.

Prices2015_2021 %>% 
  filter(year == 2020) %>% 
  group_by(flat_type) %>% 
  # Extract quarter and price per sqm
  mutate(quarter = quarter(ym(month)),
         price_per_sqm = resale_price/floor_area_sqm) %>%
  # Filter out some outliers to make the figure looks nicer
  filter(price_per_sqm <= 7500) %>% 
  
  ggplot(mapping = aes(x = flat_type, y = price_per_sqm)) +
  # Make grouped boxplot
  geom_boxplot(aes(fill = as.factor(quarter)), color = "grey") +
  theme_minimal() +
  theme(legend.position = "top") +
  scale_fill_viridis_d(option = "C") +
  # Adjust lables and add title
  labs(title = "HDB resale prices in 2020 by flat type", y="Price per square metre (SGD)", fill = "Quarter",
       x = "Flat type", caption = "Source of data: Singapore Housing and Development Board, 2021.")