install.packages('tidyverse')
Peak Bloom Prediction Demo
Instructions
In this analysis we demonstrate a simple way of predicting the peak bloom date over the next decade for all five locations required by the competition. The models are simple in that they only use the past bloom dates observed at each location—no other covariates or additional information are considered. At the end of this document (Appendix A), we demonstrate a simple way to get historic temperature data for the four locations via the rnoaa
package.
For this demo analysis we are using methods from the tidyverse of R packages. They can be installed via
and then loaded via
library(tidyverse)
Loading the data
The data for four sites is provided as a simple text file in CSV format. Each file contains the dates of the peak bloom of the cherry trees at the respective site, alongside the geographical location of the site.
The six columns in each data file are
- location a human-readable location identifier (
string
). - lat (approximate) latitude of the cherry trees (
double
). - long (approximate) longitude of the cherry trees (
double
). - alt (approximate) altitude of the cherry trees (
double
). - year year of the observation (
integer
). - bloom_date date of peak bloom of the cherry trees (ISO 8601 date
string
). The “peak bloom date” may be defined differently for different sites - bloom_doy days since January 1st of the year until peak bloom (
integer
). January 1st corresponds to1
.
In R, the data files can be read with read.csv()
and concatenated with the bind_rows()
function:
<- read.csv("data/washingtondc.csv") %>%
cherry bind_rows(read.csv("data/liestal.csv")) %>%
bind_rows(read.csv("data/kyoto.csv")) %>%
bind_rows(read.csv("data/vancouver.csv"))
For example, the latest 3 observations for each location can be extracted with:
%>%
cherry group_by(location) %>%
slice_tail(n = 3)
# A tibble: 11 × 7
# Groups: location [4]
location lat long alt year bloom_date bloom_doy
<chr> <dbl> <dbl> <int> <int> <chr> <int>
1 kyoto 35.0 136. 44 2021 2021-03-26 85
2 kyoto 35.0 136. 44 2022 2022-04-01 91
3 kyoto 35.0 136. 44 2023 2023-03-25 84
4 liestal 47.5 7.73 350 2021 2021-03-28 87
5 liestal 47.5 7.73 350 2022 2022-03-26 85
6 liestal 47.5 7.73 350 2023 2023-03-29 88
7 vancouver 49.2 -123. 24 2022 2022-03-27 86
8 vancouver 49.2 -123. 24 2023 2023-04-07 97
9 washingtondc 38.9 -77.0 0 2021 2021-03-28 87
10 washingtondc 38.9 -77.0 0 2022 2022-03-21 80
11 washingtondc 38.9 -77.0 0 2023 2023-03-23 82
Visualizing the time series
%>%
cherry filter(year >= 1880) %>%
ggplot(aes(x = year, y = bloom_doy)) +
geom_point() +
geom_step(linetype = 'dotted', color = 'gray50') +
scale_x_continuous(breaks = seq(1880, 2020, by = 20)) +
facet_grid(cols = vars(str_to_title(location))) +
labs(x = "Year", y = "Peak bloom (days since Jan 1st)")
Predicting the peak bloom
A simple method to predict peak bloom date in the future is to fit a least-squares line through the observed dates and extrapolate the regression function. We want to have a separate line for each location, hence we tell R to estimate interaction effects. We only use data from 1880 to fit the trends, as prior data may not be as reliable/relevant.
# Fit simple least-squares lines for all sites.
<- lm(bloom_doy ~ location * year, data = cherry, subset = year >= 1880) ls_fit
This simple linear regression functions suggest a trend toward earlier peak bloom at all 3 sites. We can compute the actual predictions using the predict()
function and
# Compute the predictions for all 3 sites
<- expand_grid(location = unique(cherry$location),
predictions year = 1880:2024) %>%
bind_cols(predict(ls_fit, newdata = ., interval = 'prediction', level = 0.9)) %>%
rename(prediction = fit, lower = lwr, upper = upr)
# Plot the predictions alongside the actual observations for 2015 up to 2023.
%>%
cherry right_join(predictions, by = c('year', 'location')) %>%
filter((location == 'vancouver' & year >= 2021) | (location != 'vancouver' & year >= 2015)) %>%
ggplot(aes(x = year, y = prediction, ymin = lower, ymax = upper)) +
geom_line(linewidth = 1) +
geom_ribbon(color = 'black', linetype = '22', linewidth = 0.8, fill = NA) +
geom_point(aes(y = bloom_doy)) +
scale_x_continuous(breaks = c(2015, 2018, 2021, 2024)) +
facet_grid(cols = vars(str_to_title(location))) +
labs(x = "Year", y = "Peak bloom (days since Jan 1st)")
Based on this very simple model, the peak bloom dates at the four sites are:
#' Small helper function to convert the day of year to
#' the actual date.
#'
#' @param year year as an integer
#' @param doy day of the year as integer (1 means January 1st)
#' @return date string
<- function (year, doy) {
doy_to_date strptime(paste(year, doy, sep = '-'), '%Y-%j') %>% # create date object
strftime('%Y-%m-%d') # translate back to date string in ISO 8601 format
}
%>%
predictions filter(year == 2024) %>%
mutate(prediction = round(prediction),
lower = floor(lower),
upper = ceiling(upper),
prediction_date = doy_to_date(year, prediction))
# A tibble: 4 × 6
location year prediction lower upper prediction_date
<chr> <int> <dbl> <dbl> <dbl> <chr>
1 washingtondc 2024 90 76 103 2024-03-30
2 liestal 2024 94 80 107 2024-04-03
3 kyoto 2024 94 81 107 2024-04-03
4 vancouver 2024 108 77 139 2024-04-17
Extrapolating to Vancouver, BC
For the cherry trees in Vancouver, BC, few historical observations are available. This shows in the simple analysis above in the very wide prediction interval. The trees are located approximately at 49.2236916°N (latitude), -123.1636251°E (longitude), 24 meters above sea levels (altitude). Casual observations have been recorded in the way of photos posted to the VCBF Neighbourhood Blog for Kerrisdale. You can search the forum for the keywords “Akebono” (i.e., the name of the cultivar) and “Maple Grove Park” (i.e., the location of the trees).
We need to extrapolate from what we have learned about the peak bloom dates in the other locations to Vancouver. The simple model we have fitted above, however, does not allow us to transfer any knowledge from the other sites to Vancouver – we have only used the history trend at the respective sites.
Although the climate in Vancouver is different from the other locations, the simplest way to borrow information from the other locations is to average across these three sites. Hence, we want to fit a straight line through the peak bloom dates, ignoring the actual site:
# Fit simple least-squares lines for all sites.
# We use larger weights for observations from vancouver than for the other sites
<- lm(bloom_doy ~ year, data = cherry, subset = year >= 1880,
ls_fit_for_van weights = (location == 'vancouver') +
0.2 * (location != 'vancouver'))
<- tibble(location = 'vancouver', year = 2022:2024) %>%
predictions_vancouver bind_cols(predict(ls_fit_for_van, newdata = ., interval = 'prediction', level = 0.9)) %>%
rename(prediction = fit, lower = lwr, upper = upr)
Not surprisingly, the predicted peak bloom date for Vancouver is now very similar to the other 3 sites:
predictions_vancouver
# A tibble: 3 × 5
location year prediction lower upper
<chr> <int> <dbl> <dbl> <dbl>
1 vancouver 2022 92.0 85.9 98.0
2 vancouver 2023 91.8 85.8 97.9
3 vancouver 2024 91.7 85.7 97.8
We can check the predictions against the data from previous competition years.
# Plot the predictions alongside the actual observations for 2015 up to 2023.
%>%
cherry right_join(predictions_vancouver, by = c('year', 'location')) %>%
ggplot(aes(x = year, y = prediction, ymin = lower, ymax = upper)) +
geom_line(linewidth = 1) +
geom_ribbon(color = 'black', linetype = '22', linewidth = 0.8, fill = NA) +
geom_point(aes(y = bloom_doy)) +
scale_x_continuous(breaks = 2022:2024) +
facet_grid(cols = vars(str_to_title(location))) +
labs(x = "Year", y = "Peak bloom (days since Jan 1st)")
If satisfied with the predictions, we can use them instead of the predictions from before.
<- predictions %>%
predictions filter(location != 'vancouver') %>%
bind_rows(predictions_vancouver)
Extrapolating to New York City, NY using USA-NPN data
Similar to Vancouver, BC, only few historical observations are available for our location in New York City, NY. There are some historical observations dating back to 2019 in the data provided by USA-NPN. The Washington Square Park has site id 32789 and the Yoshino cherry you should predict has species id 228.
<- read_csv("data/USA-NPN_status_intensity_observations_data.csv") %>%
nyc_data_npn filter(Site_ID == 32789,
== 228) %>%
Species_ID mutate(Observation_Date = as_date(Observation_Date, format = '%m/%d/%y'))
This data, however, needs to be transformed as it only contains individual observations of the phenophase and not the actual peak bloom date. For simplicity, we take the first day someone observed the flowers to be open as the peak bloom day. This could be done in a more sophisticated way by also looking at the reported intensity value.
<- nyc_data_npn %>%
nyc_data arrange(Observation_Date) %>%
mutate(year = year(Observation_Date)) %>%
group_by(year) %>%
summarize(first_flower_index = min(which(Phenophase_Status == 1)),
bloom_date = strftime(Observation_Date[first_flower_index], format = '%Y-%m-%d'),
bloom_doy = Day_of_Year[first_flower_index],
.groups = 'drop') %>%
filter(!is.na(bloom_doy)) %>%
select(-first_flower_index) %>%
mutate(location = 'newyorkcity')
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `first_flower_index = min(which(Phenophase_Status == 1))`.
ℹ In group 2: `year = 2020`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
<- cherry %>%
cherry_with_nyc bind_rows(nyc_data)
For 2020, no bloom was reported, hence the warning.
Using the same steps as for Vancouver, BC, a simple linear model can be fitted.
# Fit simple least-squares lines for all sites.
# We use larger weights for observations from NYC than for the other sites
<- lm(bloom_doy ~ year, data = cherry_with_nyc, subset = year >= 1880,
ls_fit_for_nyc weights = (location == 'newyorkcity') +
0.2 * (location != 'newyorkcity'))
<- tibble(location = 'newyorkcity', year = 2019:2024) %>%
predictions_nyc bind_cols(predict(ls_fit_for_nyc, newdata = ., interval = 'prediction', level = 0.9)) %>%
rename(prediction = fit, lower = lwr, upper = upr)
predictions_nyc
# A tibble: 6 × 5
location year prediction lower upper
<chr> <int> <dbl> <dbl> <dbl>
1 newyorkcity 2019 92.5 86.5 98.5
2 newyorkcity 2020 92.4 86.4 98.4
3 newyorkcity 2021 92.3 86.3 98.3
4 newyorkcity 2022 92.2 86.2 98.2
5 newyorkcity 2023 92.1 86.1 98.1
6 newyorkcity 2024 92.0 86.0 98.0
We can check the predictions against the data from previous competition years.
# Plot the predictions alongside the actual observations for 2015 up to 2023.
%>%
cherry_with_nyc right_join(predictions_nyc, by = c('year', 'location')) %>%
ggplot(aes(x = year, y = prediction, ymin = lower, ymax = upper)) +
geom_line(linewidth = 1) +
geom_ribbon(color = 'black', linetype = '22', linewidth = 0.8, fill = NA) +
geom_point(aes(y = bloom_doy)) +
scale_x_continuous(breaks = 2019:2024) +
facet_grid(cols = vars(str_to_title(location))) +
labs(x = "Year", y = "Peak bloom (days since Jan 1st)")
If satisfied with the predictions, we can use them instead of the predictions from before.
<- predictions %>%
predictions filter(location != 'newyorkcity') %>%
bind_rows(predictions_nyc)
Preparing the submission file
Once we have the predictions for all four sites, we have to save them in the correct format for the competition.
<- predictions %>%
submission_predictions filter(year == 2024) %>%
mutate(prediction = round(prediction),
lower = floor(lower),
upper = ceiling(upper)) %>%
select(-year)
submission_predictions
# A tibble: 5 × 4
location prediction lower upper
<chr> <dbl> <dbl> <dbl>
1 washingtondc 90 76 103
2 liestal 94 80 107
3 kyoto 94 81 107
4 vancouver 92 85 98
5 newyorkcity 92 85 98
For submission, these predictions must be saved as a CSV file. Important: the CSV file must not have row names, which R adds by default. Specify row.names=FALSE
to suppress them:
write.csv(submission_predictions, file = "cherry-predictions.csv",
row.names = FALSE)
Appendix: Adding Covariates
We encourage you to find additional publicly-available data that will improve your predictions. For example, one source of global meteorological data comes from the Global Historical Climatology Network (GHCN), available in the rnoaa
package. The package can also be installed via
install.packages("rnoaa")
and the loaded via
library(rnoaa)
The list of stations can be retrieved using the ghcnd_stations()
function. Note that the closest weather station to each city with continuously collected maximum temperatures are USC00186350 (Washington D.C.), GME00127786 (Liestal), JA000047759 (Kyoto), and CA001108395 (Vancouver).
<- ghcnd_stations() stations
As a simple demonstration, we retrieve the average seasonal maximum daily temperature (in 1/10 °C) from these stations using our own get_temperature()
function, which wraps the ghcnd_search()
function in the rnoaa
package. (N.b. ghcnd_search()
returns a list. Each element of the list corresponds to an element of the var
argument.)
#' Get the annual average maximum temperature at the given station,
#' separated into the 4 meteorological seasons (Winter, Spring, Summer, Fall).
#'
#' The seasons are span 3 months each.
#' Winter is from December to February, Spring from March to May,
#' Summer from June to August, and Fall from September to November.
#' Note that December is counted towards the Winter of the next year, i.e.,
#' temperatures in December 2020 are accounted for in Winter 2021.
#'
#' @param stationid the `rnoaa` station id (see [ghcnd_stations()])
#' @return a data frame with columns
#' - `year` ... the year of the observations
#' - `season` ... the season (Winter, Spring, Summer, Fall)
#' - `tmax_avg` ... average maximum temperate in tenth degree Celsius
<- function (stationid) {
get_temperature ghcnd_search(stationid = stationid, var = c("tmax"),
date_min = "1950-01-01", date_max = "2023-01-31")[[1]] %>%
mutate(year = as.integer(format(date, "%Y")),
month = as.integer(strftime(date, '%m')) %% 12, # make December "0"
season = cut(month, breaks = c(0, 2, 5, 8, 11),
include.lowest = TRUE,
labels = c("Winter", "Spring", "Summer", "Fall")),
year = if_else(month == 0, year + 1L, year)) %>%
group_by(year, season) %>%
summarize(tmax_avg = mean(tmax, na.rm = TRUE))
}
<-
historic_temperatures tibble(location = "washingtondc", get_temperature("USC00186350")) %>%
bind_rows(tibble(location = "liestal", get_temperature("GME00127786"))) %>%
bind_rows(tibble(location = "kyoto", get_temperature("JA000047759"))) %>%
bind_rows(tibble(location = "vancouver", get_temperature("CA001108395")))
%>%
historic_temperatures ggplot() +
aes(year, tmax_avg) +
geom_line() +
xlim(1950, 2024) +
labs(x = "Year", y = "Average maximum temperature (1/10 °C)") +
facet_grid(factor(season) ~ str_to_title(location))
We then predict the peak bloom date in two steps. Step 1 extrapolates average winter and spring temperature until 2032 using simple linear regression. Step 2 predicts the peak bloom date given the extrapolated temperatures, again using simple linear regression.
# Step 1. extrapolate average seasonal maximum temperature
<- lm(tmax_avg ~ year * season + location,
ls_fit_temperature data = historic_temperatures)
<-
temperature_predictions expand_grid(location = c("washingtondc", "liestal", "kyoto", "vancouver" ),
season = c("Winter", "Spring", "Summer", "Fall"),
year = 1950:2024) %>%
bind_cols(predicted_temperature =
predict(ls_fit_temperature, newdata = .)) %>%
filter(season %in% c("Winter", "Spring")) %>%
pivot_wider(names_from = season, values_from = predicted_temperature)
# Step 2. predict bloom day from extrapolated temperatures
<-
predictions_temperature %>%
temperature_predictions left_join(cherry,
by = c("location", "year")) %>%
lm(bloom_doy ~ Spring * Winter, data = .) %>%
predict(newdata = temperature_predictions) %>%
round() %>%
bind_cols(predicted_doy_temperature = ., temperature_predictions)
The following plot shows a comparison of predictions for Vancouver using the two methods described in this demo.
%>%
predictions_vancouver left_join(predictions_temperature,
by = c("location", "year")) %>%
pivot_longer(cols = c("prediction", "predicted_doy_temperature")) %>%
mutate(name = ifelse(name == "prediction",
"Method 1: location-based model",
"Method 2: temperature-based model")) %>%
ggplot() +
aes(x = year, y = value, linetype = name) +
geom_line() +
scale_x_continuous(breaks = 2022:2024) +
labs(x = "Year", linetype = "",
y = "Predicted peak bloom (days since Jan 1st) for Vancouver") +
theme(legend.position = "bottom")