Vignette 2 of 4 for the pep725 R package. The most current version is available on GitHub and CRAN.

Introduction

Data quality assessment is a critical but often overlooked step in phenological analysis. Before calculating trends, normals, or climate sensitivities, you need to understand what’s in your data and make informed decisions about unusual observations.

The Data Quality Challenge

Long-term phenological datasets such as PEP725 are compiled from many observers over decades. This richness enables large-scale analyses, but it also introduces data quality challenges that need to be assessed:

Issue Type	Examples	Consequence if Ignored
Recording errors	Typographical errors (DOY 150 instead of 105), incorrect year	Outliers bias statistics
Identification errors	Confusion between morphologically similar species	Inconsistent or mixed time series
Protocol changes	Changes in the definition of phenological phases over time	Artificial shifts or trends
Biological anomalies	Irregular phenological events as a result of climate extremes	May represent real biological signals

The following sections introduce practical tools in pep725 package for diagnosing and handling these issues in a transparent and reproducible way.

What You’ll Learn

Section	Topic	Key Functions
Part 1	Detecting statistical outliers	`pep_flag_outliers()`, `pep_plot_outliers()`
Part 2	Temporal coverage and completeness	`pep_completeness()`
Part 3	Phase presence validation	`pep_check_phases()`
Part 4	Integrated workflow	Combining all approaches

Prerequisites

This vignette assumes you are familiar with:

Basic phenological concepts (from “Getting Started” vignette)
Day of Year (DOY) as a timing metric
BBCH phenological phase codes

Setup

library(pep725)
library(data.table)

# Download the synthetic dataset
pep <- pep_download()

# For this vignette, we'll focus on flowering phases
flowering <- pep[phase_id %in% c(60, 65)]
cat("Flowering observations:", nrow(flowering), "\n")
#> Flowering observations: 1310094
cat("Species:", length(unique(flowering$species)), "\n")
#> Species: 36
cat("Year range:", min(flowering$year), "-", max(flowering$year), "\n")
#> Year range: 1868 - 2025

Part 1: Outlier Detection

Why Detect Outliers?

An outlier is an observation that differs substantially from the expected pattern. In phenological data, outliers may indicate:

Recording errors that should be excluded from analysis
- A typo: DOY 250 instead of DOY 150 (roughly 3 months difference)
- Wrong year entered: observation assigned to incorrect season
Unusual weather events worth investigating
- Very warm winter causing early flowering
- Late frost delaying spring phenology
Second flowering or other abnormal phenological events
- Plants flowering again in autumn/winter after spring flowering
- Increasingly observed with climate change

The challenge: You can’t simply delete all outliers. Some are errors, but others are scientifically valuable observations of unusual events.

Statistical Methods for Outlier Detection

The pep_flag_outliers() function provides four statistical methods for identifying potential outliers. Each method has different assumptions and strengths:

Method 1: 30-Day Rule (Simple Threshold)

The simplest approach: flag any observation more than 30 days from the median for that species/phase combination.

# Flag outliers using the default 30-day rule
outliers <- pep_flag_outliers(
  pep = flowering,
  method = "30day",
  by = c("species", "phase_id")
)

print(outliers)
#> Phenological Outlier Detection Results
#> ============================================= 
#> Method: 30day 
#> Threshold: 30 days 
#> Grouping: species, phase_id 
#> 
#> Total observations: 1,310,094
#> Outliers flagged: 26573 (2.03%)
#>   - Early outliers: 15011
#>   - Late outliers: 11562
#> 
#> Deviation summary for outliers:
#>   Min: -167.0 days
#>   Max: 223.0 days
#>   Mean |deviation|: 38.4 days

How it works:

                     Distribution of flowering dates

        ← Early                                    Late →

   ----|----[=======|=======]----|-----
        ^           ^           ^
     -30 days    median      +30 days

   Anything outside the brackets is flagged as an outlier

Understanding the Output

The function adds new columns to your original data:

Column	What It Contains	How to Use It
`is_outlier`	TRUE/FALSE flag	Filter data: `data[is_outlier == FALSE]`
`deviation`	Days from expected	Prioritize investigation: larger = more extreme
`expected_doy`	Reference DOY (median)	Understand what “normal” looks like

Method 2: MAD (Median Absolute Deviation) - Recommended

The MAD method is robust to the presence of multiple outliers. This is important because if your data contains many outliers, they can distort mean-based methods.

# MAD method: flag if > 3 MAD from median (robust)
outliers_mad <- pep_flag_outliers(flowering, method = "mad", threshold = 3)

Why MAD is recommended:

Scenario	Mean-based method	MAD method
1-2 outliers	Works reasonably	Works well
Many outliers	Outliers inflate SD, may miss more	Stays robust
Asymmetric data	Assumes symmetry	Handles asymmetry

Method 3: IQR (Interquartile Range) - Also Robust

The IQR method uses quartiles and is familiar from boxplot “whiskers”:

# IQR method: flag if outside 1.5 * IQR (standard boxplot rule)
outliers_iqr <- pep_flag_outliers(flowering, method = "iqr", threshold = 1.5)

How it works: - Q1 = 25th percentile, Q3 = 75th percentile - IQR = Q3 - Q1 - Outlier if: value < Q1 - 1.5×IQR OR value > Q3 + 1.5×IQR

Method 4: Z-Score - Sensitive but Less Robust

The z-score method assumes normally distributed data:

# Z-score method: flag if |z| > 3 (assumes normal distribution)
outliers_zscore <- pep_flag_outliers(flowering, method = "zscore", threshold = 3)

Caution: The z-score method is sensitive to existing outliers. If your data has many outliers, they will inflate the standard deviation, making the z-score threshold less effective at detecting them.

Choosing a Method

Method	Best For	Threshold Meaning
`30day`	Simple, interpretable threshold	Absolute days from median
`mad`	Most situations (recommended)	Number of MADs from median
`iqr`	Familiar boxplot-style detection	IQR multiplier
`zscore`	Clean data, normal distribution	Standard deviations

Summary Statistics

summary(outliers)
#> Phenological Outlier Summary
#> ============================================= 
#> 
#> Total observations: 1,310,094
#> Total outliers: 26573 (2.03%)
#> 
#> Groups with most outliers:
#>               species phase_id  n_obs n_outliers pct_outliers mean_dev
#>                <fctr>    <int>  <int>      <int>        <num>    <num>
#>  1: Solanum tuberosum       60 114245       6420         5.62     12.0
#>  2:  Prunus domestica       60 114641       2958         2.58     10.5
#>  3:  Prunus domestica       65  77393       2146         2.77     10.8
#>  4:   Malus domestica       60 205403       1957         0.95      9.1
#>  5:   Malus domestica       65 175706       1770         1.01      9.1
#>  6:    Prunus persica       60  48074       1515         3.15     11.0
#>  7:    Prunus cerasus       60 136588       1227         0.90      9.0
#>  8:    Prunus cerasus       65 112597       1042         0.93      9.0
#>  9:  Prunus armeniaca       60  22692        943         4.16     12.1
#> 10:    Pyrus communis       60  50335        940         1.87      9.5
#> 
#> Deviation distribution (all data):
#>  1%  5% 25% 50% 75% 95% 99% 
#> -31 -21  -8   0   8  20  30

Comparing Outlier Rates Across Species

Different species may have different outlier rates due to observation difficulty or data quality:

# Check outliers for grapevine
vine_flowering <- pep[species == "Vitis vinifera" & phase_id %in% c(60, 65)]

if (nrow(vine_flowering) > 0) {
  outliers_vine <- pep_flag_outliers(
    pep = vine_flowering,
    method = "30day",
    by = c("species", "phase_id")
  )

  cat("Outlier comparison:\n")
  cat("All flowering species: ", round(100 * mean(outliers$is_outlier), 2), "%\n")
  cat("Grapevine only:        ", round(100 * mean(outliers_vine$is_outlier), 2), "%\n")
} else {
  cat("No grapevine flowering data available for comparison.\n")
}
#> Outlier comparison:
#> All flowering species:  2.03 %
#> Grapevine only:         3.68 %

Visualizing Outliers

Visual inspection is essential for understanding outlier patterns. The pep_plot_outliers() function provides four visualization types:

1. Overview Plot

Shows the big picture: how many outliers, which species/phases are affected:

# Overview of outlier patterns
pep_plot_outliers(outliers, type = "overview")
#> Warning: Removed 16 rows containing non-finite outside the scale range
#> (`stat_bin()`).

What to look for: - Which species have the most outliers? (Might indicate data quality issues) - Are outliers concentrated in certain phases? (Might indicate definition problems) - What’s the overall outlier rate? (Typical: 1-5% for well-curated data)

2. Seasonal Distribution

When in the year do outliers occur?

# When do outliers occur in the year?
pep_plot_outliers(outliers, type = "seasonal")

Interpretation guide:

Outlier Timing	Likely Explanation
Very early (DOY < 50)	Possible data errors (flowering in January/February unlikely for most species)
Somewhat early	Warm winters or Mediterranean locations
Somewhat late	Cool springs or high-altitude stations
Very late (DOY > 250)	Possible second flowering events - investigate!

3. Detailed Context

See outliers alongside normal observations:

# See outliers in context of all observations
pep_plot_outliers(outliers, type = "detail", n_top = 15)

This plot shows: - Full distribution of observations (gray) - Flagged outliers (highlighted) - The n_top parameter controls how many species/phases to show

4. Geographic Distribution

Where are outliers located?

# Where are outliers located?
pep_plot_outliers(outliers, type = "map")

What to look for: - Clustered outliers in one region? Might indicate local data quality issues - Outliers at network edges? Might be at environmental limits - Random scatter? Suggests individual observation errors

Part 2: Data Completeness

Why Check Completeness?

Data completeness refers to how many years of observations exist for each station/phase combination. This matters because:

Trend analysis requires continuous data: Gaps can bias trend estimates
Normals require representative coverage: WMO recommends 24+ years in a 30-year period
Missing data isn’t random: Stations often drop out or are added systematically

Visualizing Completeness Issues

Station A: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●  <- 100% complete
           1990                      2020

Station B: ●●●●●●●●●●○○○○○○○○○○●●●●●●●●●●  <- Gap in middle
           1990                      2020

Station C: ○○○○○○○○○○○○○○○○○○○○●●●●●●●●●●  <- Only recent data
           1990                      2020

Station D: ●●●●●●●●●●●●●●●●●●●●○○○○○○○○○○  <- Discontinued
           1990                      2020

Different completeness patterns have different implications:

Pattern	Implications for Analysis
Full coverage	Ideal - reliable trends and normals
Gap in middle	Careful - may miss important changes
Only recent	Cannot compare to historical period
Discontinued	Cannot assess recent changes

Assessing Completeness

# Check completeness by station and phase
# Use year_range to focus on a specific period
completeness <- pep_completeness(
  pep = flowering,
  by = c("s_id", "phase_id"),
  year_range = c(1990, 2020)
)

print(completeness)
#> Phenological Data Completeness Assessment
#> -------------------------------------------------- 
#> Grouping by: s_id, phase_id
#> Year range: 1990-2020
#> Total groups: 8390
#> Mean completeness: 95.1%
#> Groups with >80% completeness: 7686 (91.6%)
#> -------------------------------------------------- 
#> 
#>       s_id phase_id n_obs n_stations n_years year_min year_max year_span
#>     <fctr>    <int> <int>      <int>   <int>    <int>    <int>     <int>
#>  1:      1       60    33          1      10     1990     1999        10
#>  2:      1       65    25          1       8     1990     1998         9
#>  3:      3       60    30          1      17     1990     2006        17
#>  4:      3       65    27          1      16     1991     2006        16
#>  5:      4       60    62          1      26     1990     2015        26
#>  6:      4       65    55          1      26     1990     2015        26
#>  7:      5       60    10          1       6     1990     1995         6
#>  8:      5       65     7          1       4     1991     1994         4
#>  9:      9       60     6          1       3     1990     1992         3
#> 10:      9       65     4          1       3     1990     1992         3
#> 11:     11       60    16          1       7     1990     1996         7
#> 12:     11       65    15          1       7     1990     1996         7
#> 13:     13       60     3          1       3     1990     1995         6
#> 14:     13       65     4          1       4     1993     1998         6
#> 15:     15       60    27          1      10     1994     2012        19
#>     completeness_pct median_doy iqr_doy
#>                <num>      <num>   <num>
#>  1:            100.0      129.0    13.0
#>  2:             88.9      137.0    12.0
#>  3:            100.0      116.0    13.8
#>  4:            100.0      125.0     8.0
#>  5:            100.0      119.0    17.0
#>  6:            100.0      123.0    12.5
#>  7:            100.0      116.0    13.5
#>  8:            100.0      129.0     6.5
#>  9:            100.0      121.5    21.0
#> 10:            100.0      126.0    22.5
#> 11:            100.0      121.5    10.0
#> 12:            100.0      134.0     7.0
#> 13:             50.0      134.0    11.0
#> 14:             66.7      133.0     2.8
#> 15:             52.6      123.0    11.0
#> 
#> ... and 8375 more rows

Understanding Completeness Metrics

The function calculates several metrics:

Metric	What It Measures	Interpretation
`n_years`	Total years with data	More = better
`completeness_pct`	% of year span covered	70%+ for trends, 80%+ for normals
`year_span`	Total span (year_max - year_min + 1)	Context for completeness
`year_min`	Earliest observation	Needed for historical comparison
`year_max`	Most recent observation	Needed for current status

Summary Statistics

summary(completeness)
#> Phenological Data Completeness Summary
#> ================================================== 
#> 
#> Total groups: 8390
#> Total observations: 441,295
#> Total unique stations: 8,390
#> 
#> Completeness Distribution:
#>   0-20%:   33 (  0.4%) 
#>   20-40%:   99 (  1.2%) 
#>   40-60%:  224 (  2.7%) *
#>   60-80%:  413 (  4.9%) *
#>   80-100%: 7621 ( 90.8%) ******************
#> 
#> Year Coverage Statistics:
#>   Mean year span: 14.1 years
#>   Mean years with data: 13.2
#>   Earliest data: 1990
#>   Latest data: 2020
#> 
#> Observation Statistics:
#>   Mean observations per group: 52.6
#>   Median observations per group: 38
#>   Max observations: 346
#> 
#> Phenology Statistics:
#>   DOY range: 6 - 304

Filtering by Completeness

Use completeness information to select stations appropriate for your analysis:

# Get stations with good coverage (>= 70%)
good_coverage <- completeness[completeness_pct >= 70]
cat("Stations with >= 70% coverage:", nrow(good_coverage), "\n")
#> Stations with >= 70% coverage: 7893

# Use these for trend analysis
good_stations <- unique(good_coverage$s_id)
flowering_complete <- flowering[s_id %in% good_stations]
cat("Observations from complete stations:", nrow(flowering_complete), "\n")
#> Observations from complete stations: 935551

Completeness Thresholds for Different Analyses

Analysis Type	Minimum Completeness	Reasoning
Trend detection	50% (15+ years)	Need enough points for trend fitting
Climate sensitivity	60%	Need variability across climate conditions
30-year normals	80% (24+ years)	WMO standard requirement
Station comparison	Same time period	Avoid bias from different eras

Visualizing Completeness

plot(completeness)

Part 3: Phase Presence Validation

Why Check Phase Presence?

Before starting an analysis, it’s important to verify that your data contains the phenological phases you need. The pep_check_phases() function validates that expected BBCH phase codes are present in your data.

Common questions this helps answer:

Does my dataset have flowering observations (phase 60)?
Are flowering and fruit maturity phases both available for apple?
Which phases are missing for my species of interest?

Checking Phase Presence

# Check if expected phases are present for apple
apple <- pep[species == "Malus domestica"]

phase_check <- pep_check_phases(
  pep = apple,
  expected = c(60, 65, 87)  # flowering, full flowering, fruit maturity
)

print(phase_check)
#> Phenological Phase Availability Check
#> --------------------------------------------- 
#> 
#> Expected phases: 60, 65, 87
#> Status: COMPLETE - all expected phases present
#> 
#> Phases found:
#>   Phase   0: 1 observations [extra]
#>   Phase   7: 33 observations [extra]
#>   Phase  10: 1,394 observations [extra]
#>   Phase  11: 700 observations [extra]
#>   Phase  15: 159 observations [extra]
#>   Phase  57: 26 observations [extra]
#>   Phase  59: 25 observations [extra]
#>   Phase  60: 205,403 observations
#>   Phase  61: 1,055 observations [extra]
#>   Phase  63: 80 observations [extra]
#>   Phase  65: 175,706 observations
#>   Phase  67: 29 observations [extra]
#>   Phase  69: 170,323 observations [extra]
#>   Phase  81: 151 observations [extra]
#>   Phase  85: 93 observations [extra]
#>   Phase  87: 253,169 observations
#>   Phase  89: 17 observations [extra]
#>   Phase  91: 80 observations [extra]
#>   Phase  93: 617 observations [extra]
#>   Phase  95: 76,453 observations [extra]
#>   Phase  97: 2,470 observations [extra]
#>   Phase 100: 225 observations [extra]
#>   Phase 200: 3 observations [extra]
#>   Phase 201: 517 observations [extra]
#>   Phase 203: 1 observations [extra]
#>   Phase 205: 7,912 observations [extra]
#>   Phase 209: 2,358 observations [extra]
#>   Phase 213: 1 observations [extra]
#>   Phase 380: 208 observations [extra]
#>   Phase 381: 7 observations [extra]
#>   Phase 385: 95 observations [extra]
#> 
#> Detailed coverage:
#>     phase_id  n_obs n_stations n_years year_min year_max
#>        <int>  <int>      <int>   <int>    <int>    <int>
#>  1:        0      1          1       1     2022     2022
#>  2:        7     33          5      11     2014     2025
#>  3:       10   1394        258      26     2000     2025
#>  4:       11    700        157      65     1961     2025
#>  5:       15    159         53      14     2009     2025
#>  6:       57     26          1       7     1942     1949
#>  7:       59     25          4      14     1949     2025
#>  8:       60 205403       7947     100     1926     2025
#>  9:       61   1055        192      69     1936     2025
#> 10:       63     80          5      39     1986     2025
#> 11:       65 175706       7272     125     1896     2025
#> 12:       67     29          5      10     2016     2025
#> 13:       69 170323       7192      86     1937     2025
#> 14:       81    151          5      54     1952     2021
#> 15:       85     93         19       8     2013     2020
#> 16:       87 253169       7504      86     1928     2024
#> 17:       89     17          5      10     2000     2024
#> 18:       91     80         26      13     2013     2025
#> 19:       93    617         76      55     1970     2024
#> 20:       95  76453       4120      78     1944     2024
#> 21:       97   2470        368      32     1966     2024
#> 22:      100    225         18      64     1951     2024
#> 23:      200      3          2       2     2016     2017
#> 24:      201    517         64      55     1970     2024
#> 25:      203      1          1       1     2016     2016
#> 26:      205   7912        779      79     1943     2025
#> 27:      209   2358        357      25     1966     2014
#> 28:      213      1          1       1     2012     2012
#> 29:      380    208         10      63     1961     2024
#> 30:      381      7          4       2     2023     2024
#> 31:      385     95         37      13     2013     2025
#>     phase_id  n_obs n_stations n_years year_min year_max
#>        <int>  <int>      <int>   <int>    <int>    <int>

Understanding the Output

The function returns information about phase coverage:

Output	What It Shows
`expected`	The phases you asked for
`present`	Which expected phases were found
`missing`	Which expected phases are absent
`complete`	TRUE if all expected phases are present
`n_obs`	Number of observations per phase

Checking Multiple Species

To check phase presence across multiple species at once:

# Check phases for multiple species
multi_check <- pep_check_phases_multi(
  pep = pep,
  species_list = c("Malus domestica", "Vitis vinifera"),
  expected = c(60, 65, 87)
)

print(multi_check)
#> Multi-Species Phase Availability Check
#> ================================================== 
#> Expected phases: 60, 65, 87
#> 
#> Species with all phases: 2 / 2 (100%)
#> -------------------------------------------------- 
#> 
#>            species complete n_expected n_present n_missing missing_phases
#>             <char>   <lgcl>      <int>     <int>     <int>         <char>
#> 1: Malus domestica     TRUE          3         3         0           <NA>
#> 2:  Vitis vinifera     TRUE          3         3         0           <NA>
#>    total_obs
#>        <int>
#> 1:    899311
#> 2:    146667

Common Phases to Check

Plant Type	Common Phases	BBCH Codes
Cereals	Heading, Flowering, Harvest	60, 65, 100
Fruit trees	Flowering, Full flowering, Fruit maturity	60, 65, 87
Deciduous trees	Leaf unfolding, Flowering, Leaf fall	11, 60, 95

Part 4: Integrated Quality Workflow

Putting It All Together

Here’s a recommended workflow for data quality assessment that combines all the tools covered in this vignette:

# ══════════════════════════════════════════════════════════════════════════════
# STEP 1: Assess temporal completeness
# ══════════════════════════════════════════════════════════════════════════════
# Why: Incomplete stations can bias trend estimates and normals

completeness <- pep_completeness(flowering, by = c("s_id", "phase_id"))
good_stations <- completeness[completeness_pct >= 50, s_id]
fl_filtered <- flowering[s_id %in% good_stations]

cat("Kept", length(good_stations), "stations with >= 50% completeness\n")
#> Kept 15213 stations with >= 50% completeness

# ══════════════════════════════════════════════════════════════════════════════
# STEP 2: Flag statistical outliers
# ══════════════════════════════════════════════════════════════════════════════
# Why: Identify observations that deviate from expected patterns

outliers_wf <- pep_flag_outliers(fl_filtered, method = "mad", threshold = 3)

cat("Flagged", sum(outliers_wf$is_outlier), "outliers",
    "(", round(100 * mean(outliers_wf$is_outlier), 1), "% )\n")
#> Flagged 20449 outliers ( 1.6 % )

# ══════════════════════════════════════════════════════════════════════════════
# STEP 3: Make informed decisions about exclusion
# ══════════════════════════════════════════════════════════════════════════════
# Key principle: Document your decisions!

# Option A: Strict cleaning (for normals calculation)
fl_strict <- outliers_wf[is_outlier == FALSE]

# Option B: Moderate cleaning (keep moderate outliers)
fl_moderate <- outliers_wf[is_outlier == FALSE | abs(deviation) < 60]

cat("Strict cleaning keeps:", nrow(fl_strict), "obs\n")
#> Strict cleaning keeps: 1252610 obs
cat("Moderate cleaning keeps:", nrow(fl_moderate), "obs\n")
#> Moderate cleaning keeps: 1272698 obs

# ══════════════════════════════════════════════════════════════════════════════
# STEP 4: Proceed with analysis on cleaned data
# ══════════════════════════════════════════════════════════════════════════════

normals <- pheno_normals(fl_moderate, period = 1991:2020, min_years = 5)
#> Note: 44 group(s) have fewer than 5 years of data and return NA values.
cat("Normals calculated for", sum(!is.na(normals$mean_doy)), "groups\n")
#> Normals calculated for 202 groups

Documentation Template

When publishing research, document your quality control decisions:

Data Quality Control:
1. Completeness filtering: Excluded stations with < 50% temporal coverage
   (reduced dataset from N to M stations)

2. Outlier detection: Used MAD method with threshold = 3
   (flagged X observations as outliers, Y% of total)

3. Outlier disposition:
   - Excluded: Z observations with deviations > 60 days (likely errors)
   - Retained: W observations identified as abnormal events (e.g., second flowering)

4. Phase validation: Checked for sequence violations in [species list]
   (identified V cases for manual review)

Best Practices Summary

For Outlier Detection

Use robust methods: MAD or IQR methods handle multiple outliers better than z-scores. The MAD method is generally recommended.
Check seasonally: Late-season outliers (DOY > 250) may be biologically meaningful events (e.g. second flowering) rather than errors.
Verify extremes: For observations with very large deviations (> 60 days), try to check original data sources or contact data providers.
Document decisions: Record which outliers you exclude and why. This is essential for reproducibility and peer review.
Don’t delete automatically: Human judgment is needed to distinguish errors from genuine unusual events.

For Abnormal Event Detection

Consider species biology: Some species (e.g., certain Rosaceae) are more prone to second flowering than others.
Check geographic patterns: If multiple nearby stations report late events in the same year, this suggests a real regional phenomenon.
Look at weather context: Link abnormal events to weather extremes - drought is a common trigger.
Distinguish from errors: Very isolated observations (single station, single year) need verification before treating them as e.g. second flowering.
Keep for separate analysis: Abnormal phenological events are scientifically interesting (proof of climate change) - don’t just delete them!

For Completeness Assessment

Set appropriate thresholds: Use 80%+ coverage for calculating normals, 50%+ for trend analysis.
Consider gap patterns: Many small gaps are usually better than one long gap that might coincide with important climate changes.
Check temporal bias: Ensure you’re not comparing stations with data only from early periods to stations with only recent data.
Document station selection: Report how many stations met your completeness criteria and how this affected your sample.

For Phase Presence Checking

Check early: Run pep_check_phases() at the start of your analysis to verify required phases are available in your data.
Know required phases: Different analyses need different phases - know which BBCH codes you need before starting.
Check across species: Use pep_check_phases_multi() to verify data availability for all species you plan to analyze.

Summary

This vignette covered data quality tools for phenological analysis:

Function	Purpose	Key Output
`pep_flag_outliers()`	Identify unusual observations	`is_outlier` flag, `deviation` in days
`pep_plot_outliers()`	Visualize outlier patterns	Four plot types: overview, seasonal, detail, map
`pep_completeness()`	Assess temporal coverage	Completeness %, gaps, year range
`pep_check_phases()`	Check phase presence	Missing phases, observation counts

Key Take-Home Messages

Data quality assessment is not optional - it’s a critical step that affects the validity of all downstream analyses.
Not all outliers are errors - some represent genuine biological phenomena like second flowering that deserve further study.
Document your decisions - quality control choices should be transparent and reproducible.
Use robust methods - the MAD method is recommended for phenological data because it handles multiple outliers well.
Visual inspection matters - plots often reveal patterns that summary statistics miss.

Next Steps

Explore the other vignettes for complementary analyses:

Getting Started: Data access, the pep class, and basic exploration

vignette("getting-started", package = "pep725")

Phenological Analysis: Normals, anomalies, quality grading, and trends

vignette("phenological-analysis", package = "pep725")

Spatial Phenological Patterns: Gradients, synchrony, and mapping

vignette("spatial-patterns", package = "pep725")

Session Info

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.0.1
#> 
#> Matrix products: default
#> BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Zurich
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pep725_1.0.0        ggplot2_4.0.1       data.table_1.18.2.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6        jsonlite_2.0.0      dplyr_1.2.0        
#>  [4] compiler_4.5.2      Rcpp_1.1.1          tidyselect_1.2.1   
#>  [7] tidyr_1.3.2         jquerylib_0.1.4     scales_1.4.0       
#> [10] yaml_2.3.12         fastmap_1.2.0       R6_2.6.1           
#> [13] labeling_0.4.3      generics_0.1.4      patchwork_1.3.2    
#> [16] classInt_0.4-11     robustbase_0.99-7   sf_1.0-24          
#> [19] knitr_1.51          tibble_3.3.1        units_1.0-0        
#> [22] DBI_1.2.3           bslib_0.10.0        pillar_1.11.1      
#> [25] RColorBrewer_1.1-3  rlang_1.1.7         cachem_1.1.0       
#> [28] xfun_0.56           sass_0.4.10         S7_0.2.1           
#> [31] otel_0.2.0          rnaturalearth_1.2.0 viridisLite_0.4.2  
#> [34] cli_3.6.5           withr_3.0.2         magrittr_2.0.4     
#> [37] class_7.3-23        digest_0.6.39       grid_4.5.2         
#> [40] rstudioapi_0.18.0   lifecycle_1.0.5     DEoptimR_1.1-4     
#> [43] vctrs_0.7.1         KernSmooth_2.23-26  proxy_0.4-29       
#> [46] evaluate_1.0.5      glue_1.8.0          farver_2.1.2       
#> [49] e1071_1.7-17        rmarkdown_2.30      purrr_1.2.1        
#> [52] tools_4.5.2         pkgconfig_2.0.3     htmltools_0.5.9

University of Applied Sciences Northwestern Switzerland (FHNW)↩︎
Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)↩︎

Data Quality Assessment

Matthias Templ1

Barbara Templ2

2026-03-04

Introduction

The Data Quality Challenge

What You’ll Learn

Prerequisites

Setup

Part 1: Outlier Detection

Why Detect Outliers?

Statistical Methods for Outlier Detection

Method 1: 30-Day Rule (Simple Threshold)

Understanding the Output

Method 2: MAD (Median Absolute Deviation) - Recommended

Method 3: IQR (Interquartile Range) - Also Robust

Method 4: Z-Score - Sensitive but Less Robust

Choosing a Method

Summary Statistics

Comparing Outlier Rates Across Species

Visualizing Outliers

1. Overview Plot

2. Seasonal Distribution

3. Detailed Context

4. Geographic Distribution

Part 2: Data Completeness

Why Check Completeness?

Visualizing Completeness Issues

Assessing Completeness

Understanding Completeness Metrics

Summary Statistics

Filtering by Completeness

Completeness Thresholds for Different Analyses

Visualizing Completeness

Part 3: Phase Presence Validation

Why Check Phase Presence?

Checking Phase Presence

Understanding the Output

Checking Multiple Species

Common Phases to Check

Part 4: Integrated Quality Workflow

Putting It All Together

Documentation Template

Best Practices Summary

For Outlier Detection

For Abnormal Event Detection

For Completeness Assessment

For Phase Presence Checking

Summary

Key Take-Home Messages

Next Steps

Session Info

Matthias Templ¹

Barbara Templ²