Task 1

Fisrt, load packages that will be used in the following analysis

library(tidyverse)
library(ggplot2)
library(ggthemes)
library(lubridate)
library(ggrepel)
library(scales)
library(viridis)

Download the exercise data from sg.data. In the zip file, there are resale flats data ranging from 1990 to 2021. I wish to frame the study scope to recent six years, therefore the first step is to load and merge resale flats dataset of 2015-2017 and 2017-now.

Prices2017 <- read_csv("data/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")
Prices2015 <- read.csv("data/resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv")

I try to bind the two datasets using “bind_rows(Prices2015,Prices2017)”,however R shows a warning, because the “remaining_lease” factor in 2015 is integer data whereas in 2017 is character data. Observing the two datasets, the “remaining_lease” in 2017 shows as “x year y month”. To match with the 2015 data, I need to extract the first two characters and convert it to interger. Google tells me to use the substr command.

Prices2017$remaining_lease <- substr(Prices2017$remaining_lease,1,2)
Prices2017$remaining_lease <- as.integer(Prices2017$remaining_lease)
# Merge the processed data again
Prices2015_2021 <- bind_rows(Prices2015,Prices2017)
str(Prices2015_2021)
## 'data.frame':    127276 obs. of  11 variables:
##  $ month              : chr  "2015-01" "2015-01" "2015-01" "2015-01" ...
##  $ town               : chr  "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" "ANG MO KIO" ...
##  $ flat_type          : chr  "3 ROOM" "3 ROOM" "3 ROOM" "3 ROOM" ...
##  $ block              : chr  "174" "541" "163" "446" ...
##  $ street_name        : chr  "ANG MO KIO AVE 4" "ANG MO KIO AVE 10" "ANG MO KIO AVE 4" "ANG MO KIO AVE 10" ...
##  $ storey_range       : chr  "07 TO 09" "01 TO 03" "01 TO 03" "01 TO 03" ...
##  $ floor_area_sqm     : num  60 68 69 68 68 67 68 68 67 68 ...
##  $ flat_model         : chr  "Improved" "New Generation" "New Generation" "New Generation" ...
##  $ lease_commence_date: num  1986 1981 1980 1979 1980 ...
##  $ remaining_lease    : int  70 65 64 63 64 64 64 65 62 69 ...
##  $ resale_price       : num  255000 275000 285000 290000 290000 ...

This time it worked. This new dataset is the basis for all subsequent analysis. After looking at the factors in the dataset, as well as reviewing articles on the Singapore HDB resale market, I found five interesting questions.

Question 1: What is the general trend of HDB resale prices in recent six years?

The dataset records the resale month information, from which we need to extract the resale year and merge it into the dataset as a new column.

Prices2015_2021 <- 
  Prices2015_2021 %>% 
  mutate(year = year(ym(month))) %>% 
  select(year, everything())

Next, calculate the mean value of resale flat prices for each year and plot it using ggplot2.

Prices2015_2021 %>% 
  group_by(year) %>% 
  summarise(average_price = mean(resale_price)) %>% 
  
  ggplot(mapping = aes(x=year, y=average_price)) +
  geom_smooth(alpha = 0.1, fill = "#6300a7ff", color = "#6300a7ff") + # The color is from the viridis color palettes
  geom_point(size = 2) +
  # Adjust styles, lables and add title
  theme_minimal() +
  scale_x_continuous(breaks = 2015:2021, minor_breaks = NULL) +
  labs(title = "Trend of HDB resale price from 2015 to 2021", y = "Resale price (SGD)", 
       x = "Year", caption = "Source of data: Singapore Housing and Development Board, 2021.")

It can be seen from this figure that resale flat prices were relatively stable between 2015-2019, fluctuating within a range of 15000, with a slight increase in 2015-2017 and decrease in 2018 and 2019. In 2020, however, the resale price began to rise rapidly and by early 2021 the average price had exceeded 485,000. such a dramatic change was reminiscent of an important social change in 2020 - the COVID-19 pandemic. Therefore in the latter question I will look more closely at the changes in the resale market during the pandemic.

Question 3: How have prices fluctuated in the last year for different flat types?

Here I play with boxplot to create fun visualisation as well as reduce data transpormation steps. As “month” contains 12 variables, it’s a bit messy to plot with. Using lubridate package, month can be transformed to quarters.

Prices2015_2021 %>% 
  filter(year == 2020) %>% 
  group_by(flat_type) %>% 
  # Extract quarter and price per sqm
  mutate(quarter = quarter(ym(month)),
         price_per_sqm = resale_price/floor_area_sqm) %>%
  # Filter out some outliers to make the figure looks nicer
  filter(price_per_sqm <= 7500) %>% 
  
  ggplot(mapping = aes(x = flat_type, y = price_per_sqm)) +
  # Make grouped boxplot
  geom_boxplot(aes(fill = as.factor(quarter)), color = "grey") +
  theme_minimal() +
  theme(legend.position = "top") +
  scale_fill_viridis_d(option = "C") +
  # Adjust lables and add title
  labs(title = "HDB resale prices in 2020 by flat type", y="Price per square metre (SGD)", fill = "Quarter",
       x = "Flat type", caption = "Source of data: Singapore Housing and Development Board, 2021.")

As the number of rooms rises, the average price per sqm decreases. 3,4,5 ROOM have many outliers with extremely high prices per sqm. The flat type with the highest price fluctuations in the last year was 1 ROOM, which increased by about 1000 SGD/sqm from Q1 to Q4, I guess it’s because 1 ROOM is traded in smaller volumes, while the other flat types are traded in larger volumes so their market is more stable. In Q2, the average price fluctuated in different directions, but in Q3 and Q4, they were all higher than in Q1.

Questiong 4: Are similar flats sold across Singapore at the same time?

ggplot(data = Prices2015_2021, mapping = aes(y=town, fill=flat_type)) +
  theme_minimal() +
  geom_bar(position = "fill") +
  scale_fill_viridis_d(option = "C") +
  labs(title = "Transaction propotion of different flat types by neighborhood", fill = "Flat type", 
       x = "Count", y = "", caption = "Source of data: Singapore Housing and Development Board, 2021.")

Neighborhoods exhibit differences in the flat type structure, especially in the propotion of 3 ROOM flat. In Sembawang, Sengkang, Pasir ris, there are very few 3 ROOM resale flats with large propotion of 4,5 ROOM flats. However in Ang mo kio, Queenstown and Clementi, the propotion of 3 ROOM flats reach nearly 50%.

Question 5: How does the feature of a neighborhood different from others’?

The dataset contains many informative factors that could give us hints of the neighborhood identity: resale price could indicate the neighborhood’s target group; lease commence date could show the neighborhood age; resale volume can picture the size of a neighborhood. Plotting the three factors in one figure, we can compare the different identities of neighborhoods.

town_resale_volume <- as.data.frame(table(Prices2015_2021$town))

Prices2015_2021 %>% 
  group_by(town) %>% 
  # Calculate housing age
  mutate(housing_age = 2021 - lease_commence_date) %>% 
  summarise(average_price = mean(resale_price), average_housing_age = mean(housing_age)) %>% 
  bind_cols(volume = town_resale_volume$Freq) %>% 
  ggplot(mapping = aes(x=average_housing_age, y=average_price)) +
  geom_count(aes(size = volume, color = volume), alpha = 0.6) +
  # Change size of count points
  scale_size_area(max_size = 18) +
  # Add lables next to count points
  geom_text_repel(aes(label = town),size = 2.7) +
  scale_y_continuous(labels = comma, limits = c(330000, 730000)) +
  theme_minimal() +
  theme(legend.position = "top") +
  scale_color_viridis_c(option = "C") +
  labs(title = "HDB Resale flat features by neighborhood", x = "Housing age", y = "Resale price (SGD)", 
       color = "Resale volume", size = "Resale volume", caption = "Source of data: Singapore Housing and Development Board, 2021.")

In Punggol and Sengkang, the housing age is young while of large volume and low price, indicating they are large new residential neighborhood. This finding matching with their identity as “new town”. Contrary to this is Bukit Timah that is of old, small vomue of housing and high price. According to wikipedia, owing to its prime location, Bukit Timah has some of the densest clusters of luxury condominiums and landed property in the city, with very few public housing.

Task 2

Load the package for rigdeline plot

library(ggridges)
Prices2015_2021 %>% 
  filter(year == c(2017:2019)) %>%
  
  # Referring to www.datanovia.com/en/blog/elegant-visualization-of-density-distribution-in-r-using-ridgeline to add fill color
  ggplot(mapping = aes(x = resale_price, y = as.factor(year), fill = stat(x))) +
  geom_density_ridges_gradient(color = "white") +
  scale_fill_viridis_c(option = "C") +
  theme_minimal() +
  # Remove legend and add title
  theme(legend.position = "none") +
  labs(title = "HDB resale prices in the last three years", x = "Resale price (SGD)", y = "Year", 
       caption = "Data: Housing and Development Board, Singapore, 2020.")

The plot shows the resale price and resale volume from 2017 to 2019. In 2018, there was an increase in the number of flats with a resale price over 1,000,000 and an uplift in the maximum price. In 2018 and 2019, the number of flats with prices in the 500,000-750,000 range was larger relative to 2017, but the median resale price was basically static. Overall the HDB resale price in the last three years remained generally stable.

Prices2015_2021 %>% 
  filter(year == 2019) %>%
  group_by(town) %>% 
  mutate(price_per_sqm = resale_price/floor_area_sqm) %>% 
  
  # Plot the neighborhoods in an ascending order of price per sqm
  ggplot(mapping = aes(x = price_per_sqm, y = reorder(as.factor(town),-price_per_sqm),
                       fill = stat(x)
                       )) +
  geom_density_ridges_gradient(color = "white") +
  scale_fill_viridis_c(option = "C") +
  theme_minimal() +
  
  # Remove legend, grid line add title
  theme(legend.position = "none") +
  theme(panel.grid = element_blank()) +
  labs(title = "HDB resale prices in 2019 by neighborhood", x = "Price per square metre (SGD)", y = "",
       subtitle = "Neighborhood exhibit large differences",
       caption = "Data: Housing and Development Board, Singapore, 2020.")

The plot shows distribution of price per sqm across neighborhoods. Central area looks interesting as it has the highest housing price whereas the price distribution is scattered. This match with the diversity of housing in central area that it not only has modern skyrise apartments but also has old communities. Looking at the neighborhood with high concentration of price like Choa chu kang, Woodlands, Sembawang, it’s easy to tell that they are all residential neighborhoods with little mix use. From the figure’s top to bottom, as average price per sqm increase, the diversity also increase, this reveals that the concentration of residential housing does improve the affordability.