Peak Bloom Prediction Demo

Author

Eager Learner

Published

January 5, 2024

Instructions

In this analysis we demonstrate a simple way of predicting the peak bloom date over the next decade for all five locations required by the competition. The models are simple in that they only use the past bloom dates observed at each location—no other covariates or additional information are considered. At the end of this document (Appendix A), we demonstrate a simple way to get historic temperature data for the four locations via the rnoaa package.

For this demo analysis we are using methods from the tidyverse of R packages. They can be installed via

install.packages('tidyverse')

and then loaded via

library(tidyverse)

Loading the data

The data for four sites is provided as a simple text file in CSV format. Each file contains the dates of the peak bloom of the cherry trees at the respective site, alongside the geographical location of the site.

The six columns in each data file are

  • location a human-readable location identifier (string).
  • lat (approximate) latitude of the cherry trees (double).
  • long (approximate) longitude of the cherry trees (double).
  • alt (approximate) altitude of the cherry trees (double).
  • year year of the observation (integer).
  • bloom_date date of peak bloom of the cherry trees (ISO 8601 date string). The “peak bloom date” may be defined differently for different sites
  • bloom_doy days since January 1st of the year until peak bloom (integer). January 1st corresponds to 1.

In R, the data files can be read with read.csv() and concatenated with the bind_rows() function:

cherry <- read.csv("data/washingtondc.csv") %>% 
  bind_rows(read.csv("data/liestal.csv")) %>% 
  bind_rows(read.csv("data/kyoto.csv")) %>% 
  bind_rows(read.csv("data/vancouver.csv"))

For example, the latest 3 observations for each location can be extracted with:

cherry %>% 
  group_by(location) %>% 
  slice_tail(n = 3)
# A tibble: 11 × 7
# Groups:   location [4]
   location       lat    long   alt  year bloom_date bloom_doy
   <chr>        <dbl>   <dbl> <int> <int> <chr>          <int>
 1 kyoto         35.0  136.      44  2021 2021-03-26        85
 2 kyoto         35.0  136.      44  2022 2022-04-01        91
 3 kyoto         35.0  136.      44  2023 2023-03-25        84
 4 liestal       47.5    7.73   350  2021 2021-03-28        87
 5 liestal       47.5    7.73   350  2022 2022-03-26        85
 6 liestal       47.5    7.73   350  2023 2023-03-29        88
 7 vancouver     49.2 -123.      24  2022 2022-03-27        86
 8 vancouver     49.2 -123.      24  2023 2023-04-07        97
 9 washingtondc  38.9  -77.0      0  2021 2021-03-28        87
10 washingtondc  38.9  -77.0      0  2022 2022-03-21        80
11 washingtondc  38.9  -77.0      0  2023 2023-03-23        82

Visualizing the time series

cherry %>% 
  filter(year >= 1880) %>%
  ggplot(aes(x = year, y = bloom_doy)) +
  geom_point() +
  geom_step(linetype = 'dotted', color = 'gray50') +
  scale_x_continuous(breaks = seq(1880, 2020, by = 20)) +
  facet_grid(cols = vars(str_to_title(location))) +
  labs(x = "Year", y = "Peak bloom (days since Jan 1st)")

Time series of peak bloom of cherry trees since 1880 at four different sites.

Predicting the peak bloom

A simple method to predict peak bloom date in the future is to fit a least-squares line through the observed dates and extrapolate the regression function. We want to have a separate line for each location, hence we tell R to estimate interaction effects. We only use data from 1880 to fit the trends, as prior data may not be as reliable/relevant.

# Fit simple least-squares lines for all sites.
ls_fit <- lm(bloom_doy ~ location * year, data = cherry, subset = year >= 1880)

This simple linear regression functions suggest a trend toward earlier peak bloom at all 3 sites. We can compute the actual predictions using the predict() function and

# Compute the predictions for all 3 sites
predictions <- expand_grid(location = unique(cherry$location),
                           year = 1880:2024) %>% 
  bind_cols(predict(ls_fit, newdata = ., interval = 'prediction', level = 0.9)) %>% 
  rename(prediction = fit, lower = lwr, upper = upr)

# Plot the predictions alongside the actual observations for 2015 up to 2023.
cherry %>% 
  right_join(predictions, by = c('year', 'location')) %>%
  filter((location == 'vancouver' & year >= 2021) | (location != 'vancouver' & year >= 2015)) %>% 
  ggplot(aes(x = year, y = prediction, ymin = lower, ymax = upper)) +
  geom_line(linewidth = 1) +
  geom_ribbon(color = 'black', linetype = '22', linewidth = 0.8, fill = NA) +
  geom_point(aes(y = bloom_doy)) +
  scale_x_continuous(breaks = c(2015, 2018, 2021, 2024)) +
  facet_grid(cols = vars(str_to_title(location))) +
  labs(x = "Year", y = "Peak bloom (days since Jan 1st)")

Predictions and 90% prediction intervals from simple linear regression models fitted to four sites.

Based on this very simple model, the peak bloom dates at the four sites are:

#' Small helper function to convert the day of year to
#' the actual date.
#' 
#' @param year year as an integer
#' @param doy day of the year as integer (1 means January 1st)
#' @return date string
doy_to_date <- function (year, doy) {
  strptime(paste(year, doy, sep = '-'), '%Y-%j') %>% # create date object
    strftime('%Y-%m-%d') # translate back to date string in ISO 8601 format
}

predictions %>% 
  filter(year == 2024) %>% 
  mutate(prediction = round(prediction),
         lower = floor(lower),
         upper = ceiling(upper),
         prediction_date = doy_to_date(year, prediction))
# A tibble: 4 × 6
  location      year prediction lower upper prediction_date
  <chr>        <int>      <dbl> <dbl> <dbl> <chr>          
1 washingtondc  2024         90    76   103 2024-03-30     
2 liestal       2024         94    80   107 2024-04-03     
3 kyoto         2024         94    81   107 2024-04-03     
4 vancouver     2024        108    77   139 2024-04-17     

Extrapolating to Vancouver, BC

For the cherry trees in Vancouver, BC, few historical observations are available. This shows in the simple analysis above in the very wide prediction interval. The trees are located approximately at 49.2236916°N (latitude), -123.1636251°E (longitude), 24 meters above sea levels (altitude). Casual observations have been recorded in the way of photos posted to the VCBF Neighbourhood Blog for Kerrisdale. You can search the forum for the keywords “Akebono” (i.e., the name of the cultivar) and “Maple Grove Park” (i.e., the location of the trees).

We need to extrapolate from what we have learned about the peak bloom dates in the other locations to Vancouver. The simple model we have fitted above, however, does not allow us to transfer any knowledge from the other sites to Vancouver – we have only used the history trend at the respective sites.

Although the climate in Vancouver is different from the other locations, the simplest way to borrow information from the other locations is to average across these three sites. Hence, we want to fit a straight line through the peak bloom dates, ignoring the actual site:

# Fit simple least-squares lines for all sites.
# We use larger weights for observations from vancouver than for the other sites
ls_fit_for_van <- lm(bloom_doy ~ year, data = cherry, subset = year >= 1880,
                     weights = (location == 'vancouver') +
                       0.2 * (location != 'vancouver'))

predictions_vancouver <- tibble(location = 'vancouver', year = 2022:2024) %>% 
  bind_cols(predict(ls_fit_for_van, newdata = ., interval = 'prediction', level = 0.9)) %>% 
  rename(prediction = fit, lower = lwr, upper = upr)

Not surprisingly, the predicted peak bloom date for Vancouver is now very similar to the other 3 sites:

predictions_vancouver
# A tibble: 3 × 5
  location   year prediction lower upper
  <chr>     <int>      <dbl> <dbl> <dbl>
1 vancouver  2022       92.0  85.9  98.0
2 vancouver  2023       91.8  85.8  97.9
3 vancouver  2024       91.7  85.7  97.8

We can check the predictions against the data from previous competition years.

# Plot the predictions alongside the actual observations for 2015 up to 2023.
cherry %>% 
  right_join(predictions_vancouver, by = c('year', 'location')) %>%
  ggplot(aes(x = year, y = prediction, ymin = lower, ymax = upper)) +
  geom_line(linewidth = 1) +
  geom_ribbon(color = 'black', linetype = '22', linewidth = 0.8, fill = NA) +
  geom_point(aes(y = bloom_doy)) +
  scale_x_continuous(breaks = 2022:2024) +
  facet_grid(cols = vars(str_to_title(location))) +
  labs(x = "Year", y = "Peak bloom (days since Jan 1st)")

Predictions and 90% prediction intervals from a simple linear regression model for Vancouver using data from all four sites.

If satisfied with the predictions, we can use them instead of the predictions from before.

predictions <- predictions %>% 
  filter(location != 'vancouver') %>% 
  bind_rows(predictions_vancouver)

Extrapolating to New York City, NY using USA-NPN data

Similar to Vancouver, BC, only few historical observations are available for our location in New York City, NY. There are some historical observations dating back to 2019 in the data provided by USA-NPN. The Washington Square Park has site id 32789 and the Yoshino cherry you should predict has species id 228.

nyc_data_npn <- read_csv("data/USA-NPN_status_intensity_observations_data.csv") %>% 
  filter(Site_ID == 32789,
         Species_ID == 228) %>% 
  mutate(Observation_Date = as_date(Observation_Date, format = '%m/%d/%y'))

This data, however, needs to be transformed as it only contains individual observations of the phenophase and not the actual peak bloom date. For simplicity, we take the first day someone observed the flowers to be open as the peak bloom day. This could be done in a more sophisticated way by also looking at the reported intensity value.

nyc_data <- nyc_data_npn %>% 
  arrange(Observation_Date) %>% 
  mutate(year = year(Observation_Date)) %>% 
  group_by(year) %>% 
  summarize(first_flower_index = min(which(Phenophase_Status == 1)),
            bloom_date = strftime(Observation_Date[first_flower_index], format = '%Y-%m-%d'),
            bloom_doy = Day_of_Year[first_flower_index],
            .groups = 'drop') %>% 
  filter(!is.na(bloom_doy)) %>% 
  select(-first_flower_index) %>% 
  mutate(location = 'newyorkcity')
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `first_flower_index = min(which(Phenophase_Status == 1))`.
ℹ In group 2: `year = 2020`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
cherry_with_nyc <- cherry %>% 
  bind_rows(nyc_data)

For 2020, no bloom was reported, hence the warning.

Using the same steps as for Vancouver, BC, a simple linear model can be fitted.

# Fit simple least-squares lines for all sites.
# We use larger weights for observations from NYC than for the other sites
ls_fit_for_nyc <- lm(bloom_doy ~ year, data = cherry_with_nyc, subset = year >= 1880,
                     weights = (location == 'newyorkcity') +
                       0.2 * (location != 'newyorkcity'))

predictions_nyc <- tibble(location = 'newyorkcity', year = 2019:2024) %>% 
  bind_cols(predict(ls_fit_for_nyc, newdata = ., interval = 'prediction', level = 0.9)) %>% 
  rename(prediction = fit, lower = lwr, upper = upr)

predictions_nyc
# A tibble: 6 × 5
  location     year prediction lower upper
  <chr>       <int>      <dbl> <dbl> <dbl>
1 newyorkcity  2019       92.5  86.5  98.5
2 newyorkcity  2020       92.4  86.4  98.4
3 newyorkcity  2021       92.3  86.3  98.3
4 newyorkcity  2022       92.2  86.2  98.2
5 newyorkcity  2023       92.1  86.1  98.1
6 newyorkcity  2024       92.0  86.0  98.0

We can check the predictions against the data from previous competition years.

# Plot the predictions alongside the actual observations for 2015 up to 2023.
cherry_with_nyc %>% 
  right_join(predictions_nyc, by = c('year', 'location')) %>%
  ggplot(aes(x = year, y = prediction, ymin = lower, ymax = upper)) +
  geom_line(linewidth = 1) +
  geom_ribbon(color = 'black', linetype = '22', linewidth = 0.8, fill = NA) +
  geom_point(aes(y = bloom_doy)) +
  scale_x_continuous(breaks = 2019:2024) +
  facet_grid(cols = vars(str_to_title(location))) +
  labs(x = "Year", y = "Peak bloom (days since Jan 1st)")

Predictions and 90% prediction intervals from a simple linear regression model for Washington Square Park in NYC using data from all five sites.

If satisfied with the predictions, we can use them instead of the predictions from before.

predictions <- predictions %>% 
  filter(location != 'newyorkcity') %>% 
  bind_rows(predictions_nyc)

Preparing the submission file

Once we have the predictions for all four sites, we have to save them in the correct format for the competition.

submission_predictions <- predictions %>% 
  filter(year == 2024) %>% 
  mutate(prediction = round(prediction),
         lower = floor(lower),
         upper = ceiling(upper)) %>% 
  select(-year)

submission_predictions
# A tibble: 5 × 4
  location     prediction lower upper
  <chr>             <dbl> <dbl> <dbl>
1 washingtondc         90    76   103
2 liestal              94    80   107
3 kyoto                94    81   107
4 vancouver            92    85    98
5 newyorkcity          92    85    98

For submission, these predictions must be saved as a CSV file. Important: the CSV file must not have row names, which R adds by default. Specify row.names=FALSE to suppress them:

write.csv(submission_predictions, file = "cherry-predictions.csv",
          row.names = FALSE)

Appendix: Adding Covariates

We encourage you to find additional publicly-available data that will improve your predictions. For example, one source of global meteorological data comes from the Global Historical Climatology Network (GHCN), available in the rnoaa package. The package can also be installed via

install.packages("rnoaa")

and the loaded via

library(rnoaa)

The list of stations can be retrieved using the ghcnd_stations() function. Note that the closest weather station to each city with continuously collected maximum temperatures are USC00186350 (Washington D.C.), GME00127786 (Liestal), JA000047759 (Kyoto), and CA001108395 (Vancouver).

stations <- ghcnd_stations()

As a simple demonstration, we retrieve the average seasonal maximum daily temperature (in 1/10 °C) from these stations using our own get_temperature() function, which wraps the ghcnd_search() function in the rnoaa package. (N.b. ghcnd_search() returns a list. Each element of the list corresponds to an element of the var argument.)

#' Get the annual average maximum temperature at the given station,
#' separated into the 4 meteorological seasons (Winter, Spring, Summer, Fall).
#' 
#' The seasons are span 3 months each.
#' Winter is from December to February, Spring from March to May,
#' Summer from June to August, and Fall from September to November.
#' Note that December is counted towards the Winter of the next year, i.e.,
#' temperatures in December 2020 are accounted for in Winter 2021.
#' 
#' @param stationid the `rnoaa` station id (see [ghcnd_stations()])
#' @return a data frame with columns
#'   - `year` ... the year of the observations
#'   - `season` ... the season (Winter, Spring, Summer, Fall)
#'   - `tmax_avg` ... average maximum temperate in tenth degree Celsius
get_temperature <- function (stationid) {
  ghcnd_search(stationid = stationid, var = c("tmax"), 
               date_min = "1950-01-01", date_max = "2023-01-31")[[1]] %>%
  mutate(year = as.integer(format(date, "%Y")),
         month = as.integer(strftime(date, '%m')) %% 12, # make December "0"
         season = cut(month, breaks = c(0, 2, 5, 8, 11),
                      include.lowest = TRUE,
                      labels = c("Winter", "Spring", "Summer", "Fall")),
         year = if_else(month == 0, year + 1L, year)) %>%
  group_by(year, season) %>%
  summarize(tmax_avg = mean(tmax, na.rm = TRUE))
}

historic_temperatures <-
  tibble(location = "washingtondc", get_temperature("USC00186350")) %>%
  bind_rows(tibble(location = "liestal", get_temperature("GME00127786"))) %>%
  bind_rows(tibble(location = "kyoto", get_temperature("JA000047759"))) %>%
  bind_rows(tibble(location = "vancouver", get_temperature("CA001108395")))

historic_temperatures %>%
  ggplot() + 
  aes(year, tmax_avg) + 
  geom_line() +
  xlim(1950, 2024) +
  labs(x = "Year", y = "Average maximum temperature (1/10 °C)") +
  facet_grid(factor(season) ~ str_to_title(location))

We then predict the peak bloom date in two steps. Step 1 extrapolates average winter and spring temperature until 2032 using simple linear regression. Step 2 predicts the peak bloom date given the extrapolated temperatures, again using simple linear regression.

# Step 1. extrapolate average seasonal maximum temperature
ls_fit_temperature <- lm(tmax_avg ~ year * season + location, 
                         data = historic_temperatures)

temperature_predictions <-
  expand_grid(location = c("washingtondc", "liestal", "kyoto", "vancouver" ),
              season = c("Winter", "Spring", "Summer", "Fall"),
              year = 1950:2024) %>%
  bind_cols(predicted_temperature = 
              predict(ls_fit_temperature, newdata = .)) %>%
  filter(season %in% c("Winter", "Spring")) %>%
  pivot_wider(names_from = season, values_from = predicted_temperature)

# Step 2. predict bloom day from extrapolated temperatures
predictions_temperature <-
  temperature_predictions %>%
  left_join(cherry,
            by = c("location", "year")) %>%
  lm(bloom_doy ~ Spring * Winter, data = .) %>%
  predict(newdata = temperature_predictions) %>%
  round() %>%
  bind_cols(predicted_doy_temperature = ., temperature_predictions)

The following plot shows a comparison of predictions for Vancouver using the two methods described in this demo.

predictions_vancouver %>%
  left_join(predictions_temperature,
            by = c("location", "year")) %>%
  pivot_longer(cols = c("prediction", "predicted_doy_temperature")) %>%
  mutate(name = ifelse(name == "prediction", 
                       "Method 1: location-based model", 
                       "Method 2: temperature-based model")) %>%
  ggplot() +
  aes(x = year, y = value, linetype = name) +
  geom_line() +
  scale_x_continuous(breaks = 2022:2024) +
  labs(x = "Year", linetype = "",
       y = "Predicted peak bloom (days since Jan 1st) for Vancouver") +
  theme(legend.position = "bottom")