Technical introduction

PHIDO fits a set of generalized additive models on the input data on case counts and outputs an estimate of the expected count "fitted counts". To reduce the effects of outliers on the output estimates, PHIDO uses an iterative down-weighting scheme.

Example of input data

Figure 1:

Sample code snippet

Relating the code below to Figure 1:

input$x is a vector of date in Figure 1
input$y is a vector of count in Figure 1

install.packages( c("gam", "robustbase") )
install.packages( "//phsabc.ehcnet.ca/root/BCCDC/Groups/Analytics_Resources/Coding/R/Library/PHIDO_0.2.0.tar.gz", repos = NULL )

# renv users:
install.packages( pkgs = "//phsabc.ehcnet.ca/root/BCCDC/Groups/Analytics_Resources/Coding/R/Library/PHIDO_0.2.0.tar.gz",
  lib = "//phsabc.ehcnet.ca/root/BCCDC/Groups/<project_folder>/renv/library/<which_R_ver>/x86_64-w64-mingw32",
  INSTALL_opts = '--no-lock',
  repos = NULL)

library("PHIDO")
library("dplyr")
library("ggplot2")

data(package = "PHIDO")

input <- PHIDO::high_count_varied

output <- mainPHIDO(x = input$x,
 y = input$y, 
HighP = 0.001,
 MidP = 0.01,
 LowP = 0.05)

tail( select(output, x, y, AlertPeriodStart, AlertPeriodEnd, AlertScorePvalue) )

Source: User Guide

Sample outputs & visualization

After plotting the outputs (fitted_counts), one would get a trendline like this:

Mathematics behind PHIDO

PHIDO expresses the mean function $\lambda(t)$ as: $log( \lambda(t) ) = a + \beta t + f(t) + c(t)$

Where the last 2 terms are non-parameteric and are fitted using locally estimated scatter plot smoothing
$f(t)$ cpatures long-term trend; it is excluded if there are less than 10 non-zero counts
$c(t)$ captures seasonal trend
The span for $c(t)$ is always 1, all data points
The span for $f(t)$ is the smallest section with at least 10 non-zero counts and span more than 2 years of data

Iterative down-weighting scheme (IDWS)

By default, PHIDO runs a maximum of 20 iterations in the IDWS, where counts of zeros are all given weights of 1.0 initially.

Next, the counts are sorted in ascending order^ and weights for each case count are adjusted one-by-one in this order. For instance, if a jump happens, e.g. $Y_k>2*Y_{k-1}$ (or $Y_k > 2$ and $Y_{k-1}$, then the weight for count $o+1$ is computed as: $w_{k} = 2*Y_{k-1} / Y_{k}$. More details on the heuristics are presented in Lee's thesis.

The pseudo-algorithm looks like:

w <- calc_init_weights( input$y )

for (itn in 1:20){
  m <- gam(model,family=quasi(log,mu), weights=w )
  sigma_sqr  <- dispersion_fn( ... )
  residuals <- calc_residuals( sigma_sqr, model )

  residuals <- residuals / sqrt( sigma + 1e-20 )
  cond_distr <- calc_cond_distr( input$y, sigma_sqr )

  P1 <- 1 - probBinomialThruBeta_fn( ... )
  P2 <- pnorm( residuals ) - 1
  P3 <- pmax( 0.9, P1*P2 )
  w <- w*(1-P3)*10
}

^Asscending order would sort [0,0,22,0,0,0,0,1,0] to [0,0,0,0,0,0,0,1,22]

Recap

Trends from non-parametric estimations will be disabled under any of the following conditions:
- Less than 2 years of training data
- Less than 10 non-zero counts per year
- Less than 10 non-zero counts in the entire data

Return to Home