install.packages('tidyverse')
Peak Bloom Prediction Demo
Instructions
In this analysis we demonstrate a simple way of predicting the peak bloom date over the next decade for all five locations required by the competition. The models are simple in that they only use the past bloom dates observed at each location—no other covariates or additional information are considered. At the end of this document (Appendix A), we demonstrate a simple way to get historic temperature data for the four locations via the NOAA web API.
For this demo analysis we are using methods from the tidyverse of R packages. They can be installed via
and then loaded via
library(tidyverse)
Loading the data
The data for the five sites is provided as a simple text file in CSV format. Each file contains the dates of the peak bloom of the cherry trees at the respective site, alongside the geographical location of the site.
The six columns in each data file are
- location a human-readable location identifier (
string
). - lat (approximate) latitude of the cherry trees (
double
). - long (approximate) longitude of the cherry trees (
double
). - alt (approximate) altitude of the cherry trees (
double
). - year year of the observation (
integer
). - bloom_date date of peak bloom of the cherry trees (ISO 8601 date
string
). The “peak bloom date” may be defined differently for different sites - bloom_doy days since January 1st of the year until peak bloom (
integer
). January 1st corresponds to1
.
In R, the data files can be read with read.csv()
and concatenated with the bind_rows()
function:
<- read.csv("data/washingtondc.csv") |>
cherry bind_rows(read.csv("data/liestal.csv")) |>
bind_rows(read.csv("data/kyoto.csv")) |>
bind_rows(read.csv("data/vancouver.csv")) |>
bind_rows(read.csv("data/nyc.csv"))
For example, the latest 3 observations for each location can be extracted with:
|>
cherry group_by(location) |>
slice_tail(n = 3)
# A tibble: 13 × 7
# Groups: location [5]
location lat long alt year bloom_date bloom_doy
<chr> <dbl> <dbl> <dbl> <int> <chr> <int>
1 kyoto 35.0 136. 44 2022 2022-04-01 91
2 kyoto 35.0 136. 44 2023 2023-03-25 84
3 kyoto 35.0 136. 44 2024 2024-04-04 95
4 liestal 47.5 7.73 350 2022 2022-03-26 85
5 liestal 47.5 7.73 350 2023 2023-03-29 88
6 liestal 47.5 7.73 350 2024 2024-03-20 80
7 newyorkcity 40.7 -74.0 8.5 2024 2024-03-28 88
8 vancouver 49.2 -123. 24 2022 2022-03-27 86
9 vancouver 49.2 -123. 24 2023 2023-04-07 96
10 vancouver 49.2 -123. 24 2024 2024-03-23 83
11 washingtondc 38.9 -77.0 0 2022 2022-03-21 80
12 washingtondc 38.9 -77.0 0 2023 2023-03-23 82
13 washingtondc 38.9 -77.0 0 2024 2024-03-17 77
Visualizing the time series
|>
cherry filter(year >= 1880) |>
ggplot(aes(x = year, y = bloom_doy)) +
geom_point() +
geom_step(linetype = 'dotted', color = 'gray50') +
scale_x_continuous(breaks = seq(1880, 2020, by = 20)) +
facet_grid(cols = vars(str_to_title(location))) +
labs(x = "Year", y = "Peak bloom (days since Jan 1st)")
Predicting the peak bloom
A simple method to predict peak bloom date in the future is to fit a least-squares line through the observed dates and extrapolate the regression function. We want to have a separate line for each location, hence we tell R to estimate interaction effects. We only use data from 1880 to fit the trends, as prior data may not be as reliable/relevant. Since for New York City we only have one observed bloom date, which is not enough to fit a linear regression model, we will omit the site from this simple analysis.
# Fit simple least-squares lines for all sites.
<- cherry |>
cherry_no_nyc filter(location != 'newyorkcity')
<- lm(bloom_doy ~ 0 + location + location : year,
ls_fit data = cherry_no_nyc, subset = year >= 1880)
This simple linear regression functions suggest a trend toward earlier peak bloom at the sites. We can compute the actual predictions using the predict()
function and
# Compute the predictions for all 4 sites
<- expand_grid(location = unique(cherry_no_nyc$location),
cherry_no_nyc_grid year = 1880:2025)
<- cherry_no_nyc_grid |>
predictions bind_cols(predict(ls_fit, newdata = cherry_no_nyc_grid,
interval = 'prediction', level = 0.9)) |>
rename(prediction = fit, lower = lwr, upper = upr)
# Plot the predictions alongside the actual observations for 2015 up to 2023.
|>
cherry right_join(predictions, by = c('year', 'location')) |>
filter((location == 'vancouver' & year >= 2021) |
!= 'vancouver' & year >= 2015)) |>
(location ggplot(aes(x = year, y = prediction, ymin = lower, ymax = upper)) +
geom_line(linewidth = 1) +
geom_ribbon(color = 'black', linetype = '22', linewidth = 0.8, fill = NA) +
geom_point(aes(y = bloom_doy)) +
scale_x_continuous(breaks = c(2015, 2018, 2021, 2024)) +
facet_grid(cols = vars(str_to_title(location))) +
labs(x = "Year", y = "Peak bloom (days since Jan 1st)")
Based on this very simple model, the peak bloom dates at these sites are:
#' Small helper function to convert the day of year to
#' the actual date.
#'
#' @param year year as an integer
#' @param doy day of the year as integer (1 means January 1st)
#' @return date string
<- function (year, doy) {
doy_to_date strptime(paste(year, doy, sep = '-'), '%Y-%j') |> # create date object
strftime('%Y-%m-%d') # translate back to date string in ISO 8601 format
}
|>
predictions filter(year == 2025) |>
mutate(prediction = round(prediction),
lower = floor(lower),
upper = ceiling(upper),
prediction_date = doy_to_date(year, prediction))
# A tibble: 4 × 6
location year prediction lower upper prediction_date
<chr> <int> <dbl> <dbl> <dbl> <chr>
1 washingtondc 2025 89 76 102 2025-03-30
2 liestal 2025 93 80 106 2025-04-03
3 kyoto 2025 94 81 107 2025-04-04
4 vancouver 2025 85 62 109 2025-03-26
Extrapolating to Vancouver, BC and New York City
For the cherry trees in Vancouver, BC and New York City few historical observations are available. This shows in the simple analysis above in the very wide prediction interval. The trees in Vancouver, BC are located approximately at 49.2236916°N (latitude), -123.1636251°E (longitude), 24 meters above sea levels (altitude). Casual observations for Vancouver, BC have been recorded in the way of photos posted to the VCBF Neighbourhood Blog for Kerrisdale. You can search the forum for the keywords “Akebono” (i.e., the name of the cultivar) and “Maple Grove Park” (i.e., the location of the trees).
We need to extrapolate from what we have learned about the peak bloom dates in the other locations to Vancouver and NYC. The simple model we have fitted above, however, does not allow us to transfer any knowledge from the other sites – we have only used the history trend at the respective sites.
Although the climate in Vancouver and NYC is different from the other locations, the simplest way to borrow information from the other locations is to average across these three sites. Hence, we want to fit a straight line through the peak bloom dates, ignoring the actual site:
# Fit simple least-squares lines for all sites.
# We use larger weights for observations from Vancouver than for the other sites
<- lm(bloom_doy ~ year, data = cherry, subset = year >= 1880,
ls_fit_for_van weights = (location == 'vancouver') +
0.2 * (location != 'vancouver'))
<- tibble(location = 'vancouver', year = 2023:2025)
vancouver_grid <- vancouver_grid |>
predictions_vancouver bind_cols(predict(ls_fit_for_van, newdata = vancouver_grid,
interval = 'prediction', level = 0.9)) |>
rename(prediction = fit, lower = lwr, upper = upr)
Not surprisingly, the predicted peak bloom date for Vancouver and NYC is now very similar to the other 3 sites:
predictions_vancouver
# A tibble: 3 × 5
location year prediction lower upper
<chr> <int> <dbl> <dbl> <dbl>
1 vancouver 2023 91.2 85.1 97.3
2 vancouver 2024 91.1 85.0 97.2
3 vancouver 2025 91.0 84.9 97.0
We can check the predictions against the data from previous competition years.
# Plot the predictions alongside the actual observations for 2015 up to 2023.
|>
cherry right_join(predictions_vancouver, by = c('year', 'location')) |>
ggplot(aes(x = year, y = prediction, ymin = lower, ymax = upper)) +
geom_line(linewidth = 1) +
geom_ribbon(color = 'black', linetype = '22', linewidth = 0.8, fill = NA) +
geom_point(aes(y = bloom_doy)) +
scale_x_continuous(breaks = 2022:2024) +
facet_grid(cols = vars(str_to_title(location))) +
labs(x = "Year", y = "Peak bloom (days since Jan 1st)")
If satisfied with the predictions, we can use them instead of the predictions from before.
<- predictions |>
predictions filter(location != 'vancouver') |>
bind_rows(predictions_vancouver)
Extrapolating to New York City, NY using USA-NPN data
Similar to Vancouver, BC, only few historical observations are available for our location in New York City, NY. There are some historical observations dating back to 2019 in the data provided by USA-NPN. The Washington Square Park has site id 32789 and the Yoshino cherry you should predict has species id 228.
<- read_csv("data/USA-NPN_status_intensity_observations_data.csv") |>
nyc_data_npn filter(Site_ID == 32789,
== 228) |>
Species_ID mutate(Observation_Date = as_date(Observation_Date, format = '%m/%d/%y'))
This data, however, needs to be transformed as it only contains individual observations of the phenophase and not the actual peak bloom date. For simplicity, we take the first day someone observed the flowers to be open as the peak bloom day. This could be done in a more sophisticated way by also looking at the reported intensity value.
<- nyc_data_npn |>
nyc_data arrange(Observation_Date) |>
mutate(year = year(Observation_Date)) |>
group_by(year) |>
summarize(first_flower_index = min(which(Phenophase_Status == 1)),
bloom_date = strftime(Observation_Date[first_flower_index], format = '%Y-%m-%d'),
bloom_doy = Day_of_Year[first_flower_index],
.groups = 'drop') |>
filter(!is.na(bloom_doy)) |>
select(-first_flower_index) |>
mutate(location = 'newyorkcity')
<- cherry |>
cherry_with_nyc bind_rows(nyc_data)
For 2020, no bloom was reported, hence the warning.
Using the same steps as for Vancouver, BC, a simple linear model can be fitted.
# Fit simple least-squares lines for all sites.
# We use larger weights for observations from NYC than for the other sites
<- lm(bloom_doy ~ year, data = cherry_with_nyc, subset = year >= 1880,
ls_fit_for_nyc weights = (location == 'newyorkcity') +
0.2 * (location != 'newyorkcity'))
<- tibble(location = 'newyorkcity', year = 2022:2025)
nyc_grid <- nyc_grid |>
predictions_nyc bind_cols(predict(ls_fit_for_nyc, newdata = nyc_grid,
interval = 'prediction', level = 0.9)) |>
rename(prediction = fit, lower = lwr, upper = upr)
predictions_nyc
# A tibble: 4 × 5
location year prediction lower upper
<chr> <int> <dbl> <dbl> <dbl>
1 newyorkcity 2022 91.5 85.5 97.5
2 newyorkcity 2023 91.4 85.4 97.4
3 newyorkcity 2024 91.3 85.3 97.3
4 newyorkcity 2025 91.2 85.1 97.2
We can check the predictions against the data from previous competition years.
# Plot the predictions alongside the actual observations for 2015 up to 2023.
|>
cherry_with_nyc right_join(predictions_nyc, by = c('year', 'location')) |>
ggplot(aes(x = year, y = prediction, ymin = lower, ymax = upper)) +
geom_line(linewidth = 1) +
geom_ribbon(color = 'black', linetype = '22', linewidth = 0.8, fill = NA) +
geom_point(aes(y = bloom_doy)) +
scale_x_continuous(breaks = 2022:2025) +
facet_grid(cols = vars(str_to_title(location))) +
labs(x = "Year", y = "Peak bloom (days since Jan 1st)")
If satisfied with the predictions, we can use them instead of the predictions from before.
<- predictions |>
predictions filter(location != 'newyorkcity') |>
bind_rows(predictions_nyc)
Submission
To submit your entries, enter the predicted bloom dates from your models in the submission form.
|>
predictions filter(year == 2025) |>
mutate(predicted_date = strptime(paste(year, prediction), "%Y %j") |>
as_date())
# A tibble: 5 × 6
location year prediction lower upper predicted_date
<chr> <int> <dbl> <dbl> <dbl> <date>
1 washingtondc 2025 89.0 76.2 102. 2025-03-29
2 liestal 2025 93.2 80.4 106. 2025-04-03
3 kyoto 2025 93.9 81.2 107. 2025-04-03
4 vancouver 2025 91.0 84.9 97.0 2025-03-31
5 newyorkcity 2025 91.2 85.1 97.2 2025-04-01
Appendix: Adding Covariates
We encourage you to find additional publicly-available data that will improve your predictions. For example, one source of global meteorological data comes from the Global Historical Climatology Network (GHCN), available through the NOAA web API.
To use the web API, you first need a web service token. You can request this token (free of charge) via https://www.ncdc.noaa.gov/cdo-web/token. Once you have been issued the token, note it somewhere in your code (or make it available through an environment variable):
<- '...'
NOAA_WEB_API_TOKEN # or
<- Sys.getenv("NOAA_WEB_API_TOKEN") NOAA_WEB_API_TOKEN
To connect to and use the web API you may use the following R packages:
install.packages("httr2")
install.packages("jsonlite")
and the loaded via
library(httr2)
library(jsonlite)
The stations closest to the sites for the competition with continuously collected maximum temperatures are USC00186350 (Washington D.C.), GME00127786 (Liestal), JA000047759 (Kyoto), CA001108395 (Vancouver) and
<- "https://www.ncei.noaa.gov/cdo-web/api/v2/data"
NOAA_API_BASE_URL
# Define the station IDs for the specified locations
<- c(
stations "washingtondc" = "GHCND:USW00013743",
"vancouver" = "GHCND:CA001108395",
"newyorkcity" = "GHCND:USW00014732",
"liestal" = "GHCND:SZ000001940",
"kyoto" = "GHCND:JA000047759")
As a simple demonstration, we retrieve the average seasonal maximum daily temperature (in 1/10 °C) from these stations using our own get_temperature()
function, which wraps the ghcnd_search()
function in the rnoaa
package. (N.b. ghcnd_search()
returns a list. Each element of the list corresponds to an element of the var
argument.)
<- function (x) {
nested_to_tibble # Determine the variable names in the response
<- map(x, names) |>
variable_names unlist(use.names = FALSE) |>
unique()
names(variable_names) <- variable_names
# Reshape the response from a nested list into a table
map(variable_names, \(i) {
map(x, \(y) {
if (is.null(y[[i]])) {
NA_character_
else {
}
y[[i]]
}|>
}) unlist(use.names = FALSE)
|>
}) as_tibble()
}
<- function(station_id, start_date, end_date,
get_daily_avg_temp window_size = 300) {
api_key, base_url, <- seq(as_date(start_date),
windows as_date(end_date) + days(window_size + 1),
by = sprintf("%d days", window_size))
<- map2(windows[-length(windows)], windows[-1] - days(1), \(from, to) {
batches if (from > Sys.Date()) {
return(NULL)
}<- tryCatch(
response request(base_url) |>
req_headers(token = api_key) |>
req_url_query(
datasetid = "GHCND",
stationid = station_id,
datatypeid = "TAVG,TMAX",
startdate = from,
enddate = min(as_date(to), Sys.Date()),
units = "metric",
limit = 1000
|>
) req_retry(max_tries = 10) |>
req_perform() |>
resp_body_json(),
httr2_http = \(cnd) {
::warn(sprintf("Failed to retrieve data for station %s in time window %s--%s",
rlang
station_id, from, to),parent = cnd)
NULL
})
})
map(batches, \(x) nested_to_tibble(x$results)) |>
list_rbind() |>
mutate(date = as_date(date))
}
<- cherry |>
historic_temperatures group_by(location) |>
summarize(start_date = sprintf('%d-01-01', pmax(1970, min(year)) - 1)) |>
left_join(tibble(location = names(stations),
station_id = stations),
by = 'location') |>
group_by(location) |>
group_modify(\(x, gr) {
get_daily_avg_temp(station_id = x$station_id,
start_date = x$start_date,
end_date = Sys.Date(),
api_key = NOAA_WEB_API_TOKEN,
base_url = NOAA_API_BASE_URL)
})
|>
historic_temperatures filter(datatype == 'TMAX') |>
ggplot(aes(x = date, y = value)) +
geom_line() +
labs(x = "Year", y = "Average maximum temperature (°C)") +
facet_grid(rows = vars(location))
A simple model may simply take the average maximum winter temperature (Dec. 1st of the previous year until end of February) into account:
<- historic_temperatures |>
avg_winter_temp filter(datatype == 'TMAX') |>
mutate(year = case_when(
month(date) < 3 ~ year(date),
month(date) == 12 ~ year(date) + 1,
TRUE ~ NA_integer_
|>
)) filter(!is.na(year), year >= 1970) |>
group_by(location, year) |>
summarize(avg_tmax = mean(value),
.groups = 'drop')
|>
avg_winter_temp ggplot(aes(x = year, y = avg_tmax)) +
geom_line() +
labs(x = "Year", y = "Average maximum temperature (°C)") +
facet_grid(rows = vars(location))
Using these average temperature, we can predict the peak bloom date again using linear regression with location-specific temporal trends and a global effect of average winter temperatures.
<- cherry |>
ls_fit_with_temp inner_join(avg_winter_temp,
by = c("location", "year")) |>
lm(formula = bloom_doy ~ year * location + avg_tmax)
<- expand_grid(location = unique(cherry$location),
cherry_grid year = 1990:2025) |>
inner_join(avg_winter_temp,
by = c("location", "year"))
<- cherry_grid |>
predictions_from_temp mutate(pred_bloom = predict(ls_fit_with_temp, newdata = cherry_grid))
|>
predictions_from_temp left_join(cherry,
by = c("location", "year")) |>
ggplot(aes(x = year)) +
geom_point(aes(y = bloom_doy)) +
geom_line(aes(y = pred_bloom)) +
facet_grid(rows = vars(location))
The following plot shows a comparison of predictions for Vancouver using the two methods described in this demo.
|>
predictions_vancouver left_join(predictions_from_temp,
by = c("location", "year")) |>
select(year, pred_temporal = prediction, pred_temp = pred_bloom) |>
pivot_longer(cols = -year) |>
mutate(name = if_else(name == "pred_temporal",
"Method 1: location-based model",
"Method 2: temperature-based model")) |>
ggplot() +
aes(x = year, y = value, linetype = name) +
geom_line() +
scale_x_continuous(breaks = 2022:2024) +
labs(x = "Year", linetype = "",
y = "Predicted peak bloom (days since Jan 1st) for Vancouver") +
theme(legend.position = "bottom")