Crop Modelling Guide

This book is the learning and design core of the cropmodelling repository.

It is not only about one wrapper, one crop, or one model family.

It is meant to grow into a broader crop-modeling guide that can support:

  • beginners learning the field from zero
  • interns learning to move from raw notes to simulation
  • researchers building reproducible workflows
  • contributors working across multiple crop-model ecosystems

What this repository is trying to become

The long-term aim is a crop-modeling environment that can eventually support:

  • DSSAT workflows
  • APSIM workflows
  • STICS workflows
  • multiple wrappers and runners
  • ensemble modeling
  • model comparison
  • crop-model-agnostic orchestration

That means the repository should not be framed as "the DSSAT-wrapper repo."

Instead, DSSAT-wrapper is one concrete lesson and implementation pathway inside a larger crop-modeling system.

What this book tries to do

This book combines several layers:

  • beginner agronomy and crop-modeling concepts
  • DSSAT-oriented file and workflow lessons
  • a hemp case-study pathway
  • practical onboarding exercises
  • a broader ecosystem vision for future models and runners

Where DSSAT-wrapper fits

The DSSAT-wrapper work is already hosted separately:

Inside this book, DSSAT-wrapper is treated as:

  • one practical implementation example
  • one lesson path for running a real model family
  • one stepping stone toward broader crop-modeling infrastructure

Suggested reading paths

If you are completely new

Start here:

  1. Agronomy Basics for Modelers
  2. Concepts for Beginners
  3. What Is Crop Modeling?
  4. How DSSAT Thinks Day by Day
  5. Growth Stages, GDD, and Photoperiod
  6. DSSAT Weather Files Explained
  7. DSSAT Soil Files Explained
  8. Worked Case Study: Weather, Soil, and Management Construction
  9. End-to-End Intern Exercise: From Raw Notes to a Finished Run

If you want the current DSSAT lesson path

Start here:

  1. DSSAT in This Ecosystem
  2. DSSAT File Anatomy
  3. Installed DSSAT Setup
  4. Using GitHub-Sourced Example Data
  5. Hopf Paper Case Study

If you care about the bigger platform direction

Start here:

  1. Model Ecosystem Vision
  2. Repository Tour
  3. How Calibration Actually Works
  4. Visualizing Results

What has changed in this reframed version

This guide is no longer presented as a wrapper manual with extra lessons around it.

It is now presented as a crop-modeling guide with:

  • a DSSAT section
  • a hemp example pathway
  • a training path for interns
  • a future-facing vision for multiple models and multiple runners

That change is intentional.

Model Ecosystem Vision

This chapter explains the strategic direction of the repository.

The core idea

This repository is not meant to stop at one wrapper around one model.

The broader goal is to support a crop-modeling ecosystem that can work across:

  • model families
  • crops
  • calibration workflows
  • data-preparation pipelines
  • analysis and visualization layers

Why this matters

In practice, crop-model work often becomes fragmented:

  • one model has one workflow
  • another model has a completely different runner
  • data preparation is repeated by hand
  • comparisons across models become difficult

That fragmentation makes teaching, reuse, and ensemble work harder than it needs to be.

The direction we want

Over time, this repository can grow toward a layered system:

  1. learning materials
  2. input-preparation workflows
  3. model-specific wrappers and runners
  4. shared evaluation and visualization tools
  5. ensemble and comparison workflows
  6. a crop-model-agnostic orchestration layer

What "crop-model agnostic" means here

It does not mean every model is identical.

It means we aim to build shared concepts and workflow layers for things like:

  • weather preparation
  • soil preparation
  • management encoding
  • observation handling
  • metric calculation
  • plotting
  • reproducibility checks

while still respecting that each model family has its own structure.

Why wrappers still matter

Model-agnostic design does not eliminate model-specific wrappers.

Wrappers remain valuable because they provide clean entry points into each system.

The point is to avoid stopping there.

We want to move from:

  • one wrapper per model

to:

  • one broader system that can work with several wrappers and models coherently

Near-term examples

Today, the strongest implemented pathway is DSSAT.

Tomorrow, this repository may also include:

  • APSIM-oriented lessons and runner logic
  • STICS-oriented lessons and runner logic
  • shared case studies expressed across multiple models

Ensemble modeling

Ensemble modeling matters because no single crop model captures every process or every uncertainty perfectly.

A stronger long-term system should eventually support:

  • running multiple models on the same scenario
  • comparing their outputs systematically
  • understanding agreement and disagreement
  • building ensemble summaries

That is one of the motivations for keeping the repo broader than one wrapper.

Agronomy Basics for Modelers

This chapter comes before file formats, wrappers, and calibration on purpose.

If you do not have a basic agronomy picture in your head, crop-model outputs can feel like disconnected columns instead of a field story.

The goal here is simple:

understand the agronomic logic that most crop models are trying to represent.

What agronomy is

Agronomy is the practical science of crop production and field management.

It asks questions like:

  • what was planted?
  • where was it planted?
  • when was it planted?
  • what soil was it planted into?
  • what weather did it experience?
  • how was water and fertility managed?
  • what stresses reduced growth?
  • what was finally harvested?

For a modeler, agronomy is not background decoration. It is the structure behind the inputs.

The basic agronomy triangle

At the simplest level, field outcomes come from the interaction of three big things:

  1. Crop The species, cultivar, and biological traits being grown.
  2. Environment Weather, soil, landscape position, and seasonal conditions.
  3. Management What people did to the field: planting, irrigation, fertilization, harvest, and related decisions.

Most crop models are trying to simulate what happens when those three meet through time.

A crop is a living population, not a single plant

Beginners sometimes picture one ideal plant.

Agronomy usually works at field scale, where we care about a crop stand or population.

That means questions such as:

  • how many plants emerged?
  • how evenly are they spaced?
  • how quickly did the canopy close?
  • how deep can roots explore?
  • how much competition exists among plants?

This matters because yield is usually the result of population performance, not one perfect individual.

Establishment comes first

Before a crop can produce biomass or yield, it has to establish successfully.

Establishment includes:

  • seed quality or planting material quality
  • planting depth
  • planting date
  • soil temperature
  • soil moisture near planting
  • emergence success
  • early survival

If establishment is poor, everything later in the season starts from a weaker base.

For modelers, this means early assumptions matter more than they may appear.

The crop needs resources every day

From an agronomy perspective, daily growth depends on access to key resources:

  • light
  • water
  • nutrients
  • temperature conditions within a usable range

A crop model turns those ideas into equations, but the agronomic story stays the same:

  • leaves intercept light
  • roots access water and nutrients
  • temperature regulates development speed
  • stress reduces what the crop could otherwise do

Soil is more than dirt

In crop modeling, soil is one of the most important agronomic controls.

Soil affects:

  • how much water can be stored
  • how quickly water drains
  • how deep roots can grow
  • how much nitrogen may become available
  • whether roots encounter physical restrictions

Two fields with the same weather and cultivar can behave very differently if one has deep, well-structured soil and the other has shallow or restrictive soil.

That is why soil files are never just administrative details.

Weather drives opportunity and stress

Agronomically, weather does two things at once:

  • it creates growth opportunity
  • it creates stress risk

Warmth and radiation can support faster growth. Too little rain, too much heat, or cold conditions can reduce it.

That is why crop models care so much about daily weather rather than seasonal averages.

The crop experiences the season one day at a time.

Management changes the season the crop experiences

Management is not separate from the environment in practice. It changes the environment that the crop actually feels.

Examples:

  • irrigation changes water availability
  • fertilizer changes nutrient supply
  • planting date changes the daylength and temperature pattern the crop sees
  • plant density changes competition and canopy structure
  • harvest timing changes what outcome is counted as success

This is one reason management files are so influential in crop modeling.

Growth is not the same as development

This distinction helps beginners a lot.

Growth usually refers to increase in size or biomass.

Development usually refers to progress through life stages such as:

  • emergence
  • vegetative growth
  • flowering
  • grain fill or reproductive filling
  • maturity

A crop can keep developing even when growth is stressed. It can also accumulate biomass rapidly during some stages and slowly during others.

Good agronomy thinking keeps both ideas in view.

Yield is built from components

Agronomists do not usually think of yield as magic that appears at harvest.

Yield is built gradually from intermediate processes and components such as:

  • plant population
  • tiller or branch number in some crops
  • canopy size
  • flowering success
  • seed or grain number
  • individual seed or grain weight
  • harvest index or partitioning pattern

If yield is wrong in a model, the most useful question is often:

which component went wrong first?

Stress is central to field reality

Real fields are almost never operating at perfect potential.

Common agronomic stresses include:

  • drought
  • nutrient deficiency
  • heat stress
  • cold stress
  • waterlogging
  • salinity
  • pests and diseases
  • weed competition

Not every model represents all of these equally well. But agronomically, they matter because they explain why actual fields often underperform relative to ideal conditions.

The same species can behave very differently

One of the easiest mistakes is to talk about a crop species as if it had one fixed behavior.

In reality, cultivars can differ in:

  • maturity timing
  • photoperiod sensitivity
  • biomass potential
  • partitioning
  • height
  • stress tolerance

That is why genotype information matters so much in model-based work.

Agronomic data are usually imperfect

Field agronomy is messy.

You may encounter:

  • missing weather days
  • incomplete fertilizer records
  • uncertain planting depth
  • borrowed soil profiles
  • observations collected on different dates across variables
  • trials with uneven stands or edge effects

A strong modeler does not hide that mess. They document assumptions and keep track of uncertainty.

What a modeler should ask about any field experiment

Before trusting a dataset, start with questions like:

  • What crop and cultivar were grown?
  • What was the target production goal: grain, fiber, seed, biomass?
  • When was the crop planted and harvested?
  • What soil was used, and how was it described?
  • What weather source represents the site?
  • What irrigation and fertilizer were applied, and when?
  • What observations were actually measured rather than inferred?
  • What obvious stress or management events might have shaped the season?

Those questions usually matter more than jumping straight to parameters.

Why this chapter belongs before the modeling details

Later chapters will talk about:

  • weather files
  • soil profiles
  • management sections
  • phenology variables
  • genotype coefficients

Those topics make much more sense once you see the agronomic story underneath them.

Crop models are not only software objects. They are field logic translated into structured inputs and equations.

Beginner takeaway

If you remember only one thing, let it be this:

a crop model is trying to simulate how a crop population responds to environment and management through time.

That is an agronomy problem before it becomes a file-format problem.

Concepts for Beginners

This chapter is for readers who are new to crop modeling, DSSAT, or both.

The goal is not to make you memorize jargon. The goal is to give you a working mental model so that the rest of the book feels natural instead of cryptic.

What a crop model is

A crop model is a mathematical description of how a crop grows through time under a given environment and management system.

In plain language, a crop model tries to answer:

  • when does the crop emerge?
  • when does it flower?
  • how much leaf, stem, root, and grain biomass does it produce?
  • how do weather, soil, water, planting date, cultivar, and fertilizer change that outcome?

The model does not "see" the field directly. It only sees the numbers we give it through input files and parameters.

What DSSAT is

DSSAT stands for Decision Support System for Agrotechnology Transfer.

At a practical level, DSSAT is:

  • a collection of crop models
  • a file format ecosystem
  • an executable that reads experiment files and writes outputs

That last point matters. DSSAT is not just one crop model. It is a family of models that share a common workflow but differ in their crop-specific logic.

Why wrappers exist

If DSSAT already runs, why build an R wrapper?

Because DSSAT by itself is not optimized for:

  • large batch experimentation from code
  • calibration loops
  • sensitivity analysis
  • automated validation
  • reproducible research workflows

A wrapper lets us treat DSSAT as a callable engine inside a broader scientific workflow.

The basic logic of a DSSAT run

Every simulation needs four big ingredients:

  1. Weather Daily minimum and maximum temperature, radiation, rainfall, and related variables.
  2. Soil Layer-by-layer information about texture, water holding, bulk density, and initial conditions.
  3. Management Planting date, planting density, irrigation, fertilizer, harvest rules, and other field operations.
  4. Genetics Cultivar, ecotype, and species coefficients that tell the model how this crop behaves.

The experiment file ties those pieces together.

Why the same crop can still have multiple models

This is one of the most important beginner ideas in this repo.

In DSSAT, "crop" and "model family" are related but not identical.

Examples:

  • wheat can be run through CERES or NWHEAT
  • sugarcane has multiple engines such as CANEGRO, CASUPRO, and SAMUCA
  • cassava can involve CSYCA or CSCAS

That is exactly why the DSSAT_omniwrapper() exists. A single hard-coded wrapper is not enough if we want to work across the broader DSSAT ecosystem.

What calibration means

Calibration means adjusting uncertain model parameters so the model reproduces observed field behavior more closely.

Typical targets include:

  • flowering date
  • maturity date
  • aboveground biomass
  • stem biomass
  • grain yield
  • leaf area

Calibration is not guessing until the curve looks nice. It should be guided by:

  • real observations
  • clearly defined objective functions
  • a biologically sensible order of parameter adjustment

A good beginner workflow

If you are starting from zero, this is the safest order:

  1. Make the model run without errors.
  2. Confirm that the correct crop family and module are being used.
  3. Confirm that the experiment file points to the intended weather, soil, and management setup.
  4. Confirm that observations are being read correctly.
  5. Compare simulated and observed time series.
  6. Only then start changing parameters.

That order sounds slow, but it prevents most confusing failures later.

What success looks like

A successful workflow does not mean "the model matches perfectly."

It means:

  • the run is reproducible
  • the files and assumptions are transparent
  • the comparison with observations is explicit
  • improvements can be explained rather than guessed

That is the standard this book and this repo aim for.

What Is Crop Modeling?

This chapter is written for a reader who may be strong in coding but new to agronomy, or strong in agronomy but new to modeling.

The goal is simple:

understand what a crop model is trying to represent before worrying about file formats, wrapper functions, or calibration metrics.

A field is a process, not just an outcome

When a farmer or researcher looks at a field, they usually care about outcomes:

  • emergence
  • flowering
  • yield
  • fiber production
  • grain production
  • biomass

A crop model tries to represent the processes that create those outcomes.

That means the model asks questions like:

  • How warm was each day?
  • How long was the day?
  • How much water could roots access?
  • How much nitrogen was available?
  • How fast could leaves expand?
  • When did the plant switch from vegetative growth to reproductive growth?

The model does not "know" the plant the way a human observer does. It knows only the processes and parameters encoded in its equations and inputs.

What a crop model does each day

Most process-based crop models operate on a daily time step.

For each simulated day, the model takes inputs such as:

  • weather
  • soil water and soil nitrogen status
  • crop development stage
  • management events such as irrigation or fertilizer

Then it updates internal state variables such as:

  • days since planting
  • thermal time or photothermal time
  • leaf area
  • root growth
  • dry matter production
  • biomass partitioning to leaves, stems, roots, seed, or storage organs

The next day starts from the updated state of the previous day.

This is why one bad input or one poor parameter choice can affect the whole season instead of only one date.

Why process-based models are useful

Process-based models are useful because they let us reason beyond a single observed season.

If the model captures the main processes reasonably well, we can ask:

  • What if planting date changes?
  • What if rainfall changes?
  • What if the crop is grown at a different latitude?
  • What if soil depth is shallower?
  • What if cultivar photoperiod sensitivity is different?

That is why crop models are used in:

  • agronomy
  • breeding
  • climate adaptation studies
  • irrigation and fertilizer analysis
  • systems research

What a model is not

A crop model is not:

  • a perfect copy of reality
  • a replacement for field data
  • an excuse to ignore measurement quality
  • automatically transferable from one crop or location to another

A model is a disciplined simplification.

The useful question is not "Is the model true?"

The useful questions are:

  • Is it transparent?
  • Is it reproducible?
  • Is it good enough for the question being asked?
  • Do the errors make biological sense?

The three big ingredients of crop-model quality

The quality of a simulation usually depends on three broad things:

  1. input quality
  2. parameter quality
  3. model structure

Input quality

If weather, soil, or management inputs are wrong, the model can fail even if the equations are sensible.

Parameter quality

If cultivar or ecotype coefficients are unrealistic, the crop may flower too early, accumulate too little biomass, or partition biomass incorrectly.

Model structure

Even with good data and reasonable parameters, a model can still struggle if its underlying assumptions are not suitable for the crop or environment.

This matters a lot in new-crop work such as hemp adaptation.

Why beginners often feel lost

Beginners often see file names first and biology second.

They encounter:

  • .WTH
  • .SOL
  • .CUL
  • .ECO
  • PlantGro.OUT
  • GSTD
  • CWAD

and it feels like a codebase without a story.

The story is:

  • weather drives daily opportunity and stress
  • soil controls what roots can access
  • management defines what the farmer did
  • genetics defines how this crop behaves
  • the model combines them to simulate growth through time

Once that story is clear, the file system starts to make sense.

Where DSSAT fits in

DSSAT is a crop modeling platform that provides:

  • crop-specific model engines
  • standard file structures
  • a simulation executable
  • output conventions

This repository is not DSSAT itself.

This repository adds a wrapper around DSSAT so that the model can be used more cleanly from R for:

  • reproducible runs
  • validation
  • calibration
  • paper reproduction
  • multi-family experimentation

How to read the rest of this book

If this chapter is your starting point, the safest next order is:

  1. How DSSAT Thinks Day by Day
  2. Hemp Biology for Modelers
  3. Growth Stages, GDD, and Photoperiod
  4. DSSAT File Anatomy

That path gives you the biology and process logic before you dive into files and code.

How DSSAT Thinks Day by Day

This chapter gives a mental model of a daily DSSAT simulation.

If you understand this daily loop, many confusing terms become easier:

  • thermal time
  • phenology
  • biomass accumulation
  • partitioning
  • stress
  • stage transitions

The daily loop

At a high level, DSSAT repeats a sequence like this:

  1. read today's weather
  2. update development progress
  3. compute growth potential
  4. reduce that potential if water, nitrogen, or temperature are limiting
  5. partition new biomass among organs
  6. update soil and plant states
  7. move to the next day

That sounds simple, but each step contains real biology and many assumptions.

Step 1: read today's weather

The model uses daily weather variables such as:

  • solar radiation
  • minimum temperature
  • maximum temperature
  • rainfall

Some models also depend strongly on:

  • daylength
  • vapor pressure or humidity-related effects
  • potential evapotranspiration

These values set the opportunity for photosynthesis, development, and water balance on that day.

Step 2: update development progress

The model keeps track of where the crop is in its life cycle.

Examples of broad phases are:

  • planting to emergence
  • emergence to vegetative growth
  • vegetative growth to flowering
  • flowering to seed fill or maturation

The model uses rules such as:

  • accumulated heat
  • daylength sensitivity
  • photothermal response
  • stress effects

to decide how quickly the crop moves through these phases.

This is called phenology.

Step 3: compute growth potential

Once the model knows the crop's stage and environment for that day, it estimates how much dry matter the crop could produce under those conditions.

This is influenced by:

  • intercepted radiation
  • radiation use efficiency
  • canopy size or leaf area
  • temperature

Potential growth is what the crop could do before water, nitrogen, or other constraints reduce it.

Step 4: apply stress reductions

Real crops rarely grow at full potential every day.

The model therefore applies reductions based on stresses such as:

  • too little water
  • too much or too little temperature
  • too little nitrogen
  • other family-specific limitations

This is one reason the same cultivar can behave very differently in different soils or seasons.

Step 5: partition new biomass

New growth is not sent to one single pool.

The model decides how much of today's new biomass goes to:

  • leaves
  • stems
  • roots
  • grain
  • flowers
  • storage organs

The proportions depend on crop type and development stage.

This is called partitioning.

For a fiber-oriented hemp workflow, stem partitioning matters a lot. For a seed-oriented workflow, reproductive partitioning matters more.

Step 6: update soil and plant states

The model then updates internal state variables such as:

  • total biomass
  • leaf area
  • root depth
  • soil water
  • soil nitrogen
  • stage counters

These updated states are what the next simulated day starts from.

That is why calibration errors can accumulate through time. A mismatch in early growth may still matter at flowering or harvest.

Why outputs look the way they do

When DSSAT writes a time-series file such as PlantGro.OUT, it is recording parts of this daily state.

Variables often represent:

  • a stage or stage counter
  • an organ biomass pool
  • a morphological trait
  • a stress-related progression value

If you treat outputs as random column names, the files feel opaque.

If you remember they are snapshots of a daily process loop, they become much easier to read.

Why this matters for debugging

Suppose biomass is too low.

That could happen because:

  • the crop developed too quickly and ended vegetative growth early
  • leaf area stayed too small
  • radiation was too low
  • water stress reduced growth
  • nitrogen stress reduced growth
  • partitioning moved too much biomass away from the organ you are studying

The point is that a final error at harvest is often the result of earlier daily processes, not just one wrong coefficient.

Why this matters for calibration

Calibration is much easier when you think in process order:

  1. get emergence and phenology roughly right
  2. get canopy and vegetative growth roughly right
  3. get partitioning and harvest components right

Trying to fit final biomass before understanding the timing of stage transitions usually creates confusion.

The beginner takeaway

The most important thing to remember is this:

DSSAT is not jumping straight from planting date to final yield.

It is simulating a chain of daily decisions and state updates.

That is the logic behind almost everything else in this book.

Hemp Biology for Modelers

This chapter is not a full botany text.

Its purpose is narrower:

to give enough hemp biology that a new modeler understands what the model is trying to reproduce.

Why hemp is a special case in modeling

Hemp is not just "another generic broadleaf crop."

It can be grown for very different purposes:

  • fiber
  • seed or grain
  • dual-purpose production
  • sometimes floral biomass in other contexts

Those goals matter because they change what traits we care about:

  • stem growth
  • partitioning
  • flowering timing
  • plant height
  • branching
  • grain production

Major growth phases in practical terms

For a beginner, it helps to think about hemp growth in broad stages:

  1. germination and emergence
  2. vegetative growth
  3. floral initiation and flowering
  4. seed set or reproductive filling
  5. maturation or harvest stage

During vegetative growth, the plant is mostly building structure:

  • leaves
  • stems
  • roots

Later, more of the plant's energy may shift toward reproductive structures and seed production, depending on genotype and management.

Why flowering timing matters so much

Hemp is often strongly sensitive to photoperiod, meaning daylength influences when it transitions toward reproductive development.

This matters because flowering timing controls:

  • how long the crop stays vegetative
  • how much biomass can accumulate before reproduction
  • how tall the crop becomes
  • how much stem biomass is available for fiber
  • how much time remains for seed filling

If a hemp model flowers too early, it may produce too little total biomass. If it flowers too late, it may overgrow or miss the observed reproductive pattern.

Fiber versus seed logic

For fiber-oriented production, modelers often care especially about:

  • stem dry weight
  • plant height
  • canopy development
  • timing of flowering

For seed-oriented production, modelers may care more about:

  • flowering and reproductive timing
  • grain or seed biomass
  • partitioning to reproductive organs
  • maturation timing

The same crop model may be used in both contexts, but the most important target variables and the most sensitive parameters can differ.

Cultivar differences

Not all hemp cultivars respond the same way.

Differences may include:

  • photoperiod sensitivity
  • maturity type
  • growth habit
  • total biomass potential
  • partitioning pattern
  • height and canopy structure

That is why genotype coefficients matter so much in hemp adaptation work.

Why latitude and planting date matter

Because daylength changes with latitude and season, planting date can change hemp behavior dramatically.

Two crops planted in different places or on different dates may experience:

  • different daylength trajectories
  • different temperature accumulation
  • different rainfall environments

That means a hemp model must usually represent both:

  • temperature-driven development
  • photoperiod-driven development

What an intern should watch first

When inspecting hemp simulations, the most useful first questions are often:

  • Is the crop flowering at roughly the right time?
  • Is stem biomass tracking observations?
  • Is total biomass trajectory plausible?
  • Is plant height developing at a reasonable pace?
  • Are differences among cultivars visible?

Those are often more informative early on than focusing immediately on every single output column.

Why hemp adaptation in DSSAT matters

When a crop is new to a modeling platform, the task is not only "make it run."

The task is to decide whether the platform's structure can represent:

  • the right developmental signals
  • the right partitioning behavior
  • the right observed responses to environment and management

That is why the hemp case study in this repository is so valuable. It tests the wrapper against a real adaptation problem rather than only against legacy crops.

Growth Stages, GDD, and Photoperiod

This chapter introduces several terms that confuse beginners very quickly:

  • growth stage
  • phenology
  • GDD
  • thermal time
  • photoperiod
  • photothermal day

They are closely related, but they are not the same thing.

Growth stages

A crop does not behave the same way from planting to harvest.

It passes through recognizable developmental phases such as:

  • sowing or planting
  • emergence
  • vegetative growth
  • flowering
  • seed fill
  • maturity

These phases are called growth stages or phenological stages.

The exact names vary by crop and model family, but the basic idea is always the same: stage affects what the plant can do and how the model treats it.

Phenology

Phenology is the study or simulation of when these developmental stages occur.

In crop models, phenology is critical because it controls:

  • how long the crop remains vegetative
  • when it begins reproduction
  • when partitioning rules change
  • when final harvest components can accumulate

If phenology is wrong, many later variables can also be wrong even if the rest of the model is reasonable.

What is GDD?

GDD means Growing Degree Days.

It is a simple way to express how much useful warmth a crop experienced.

The broad idea is:

  • crops develop faster on warmer days
  • but only within a biologically meaningful range

A simple textbook version looks like:

GDD = mean daily temperature - base temperature

with rules to prevent unrealistic values when temperatures are too low or too high.

This is only the basic concept. Real crop models may use more detailed thermal response functions than this one-line formula.

Why GDD matters

GDD helps explain why calendar days alone are not enough.

Two crops planted on the same date can accumulate very different development if one season is cool and the other is warm.

That affects:

  • emergence timing
  • flowering timing
  • maturity timing

Thermal time versus GDD

People sometimes use these terms loosely, but thermal time is the broader concept.

GDD is one common way of expressing thermal time.

In practice, when modelers say:

  • "the crop needs more thermal time to flower"

they often mean the model needs more accumulated temperature-driven progress before stage transition.

Photoperiod

Photoperiod means daylength.

Some crops are sensitive to how long the day is, not only how warm it is.

For those crops, development may slow or accelerate depending on whether the day is longer or shorter than the crop prefers.

This matters strongly in hemp because flowering behavior is often tied to daylength response.

Photothermal day

Some model formulations combine temperature and photoperiod into a single idea, often called a photothermal day or something similar.

The exact equation depends on the model family, but the concept is:

  • development is not driven by temperature alone
  • temperature and daylength together shape progress toward flowering or later stages

This is especially important when comparing:

  • different planting dates
  • different latitudes
  • different cultivars with different photoperiod sensitivity

Why stage variables matter in outputs

When you see variables related to stages in DSSAT output, they are often trying to summarize this development logic.

Even if you are mainly interested in biomass, stage-related outputs help answer:

  • Did the model flower too early?
  • Did it remain vegetative too long?
  • Did reproductive development begin at the right time?

Those questions are often the key to understanding yield or biomass mismatches.

A practical way to think about calibration

For a beginner, a useful rule is:

  • first fix timing
  • then fix size
  • then fix partitioning

In other words:

  1. get stage transitions roughly right
  2. get total growth roughly right
  3. get organ-specific biomass roughly right

Trying to fit stem or seed biomass while flowering timing is still wrong often leads to misleading parameter changes.

Common beginner confusion

"The model has the right harvest date, so phenology must be fine"

Not necessarily.

It may still have the wrong:

  • emergence date
  • flowering date
  • duration of vegetative growth
  • duration of reproductive growth

"More GDD always means better growth"

Not necessarily.

Too much heat, stress, or an unfavorable daylength response can still reduce performance even if thermal accumulation is high.

"Photoperiod only matters at flowering"

Often flowering is the most visible effect, but photoperiod sensitivity can influence the whole developmental pathway around the transition to reproduction.

Beginner takeaway

If you remember only one thing from this chapter, let it be this:

calendar date is not enough.

Crop models care deeply about how temperature and daylength accumulate through time, because that is how they decide when the crop changes from one stage to the next.

Repository Tour

This chapter explains what is in the repository and why each piece exists.

Top-level structure

The key folders and files are:

  • src/ The mdBook source files for the learning site.
  • examples/ Public-safe example materials and case-study scripts.
  • .github/workflows/ GitHub Actions, including documentation deployment.
  • book.toml The mdBook configuration.
  • README.md The short project-facing introduction.

The src/ folder

This folder is the book.

Hosted documentation should not be treated as decoration. In a project like this, the documentation is part of the reproducibility layer.

The purpose of the book is to:

  • teach beginners how the system fits together
  • explain why design decisions were made
  • document validation status
  • make onboarding easier for future contributors

The examples/ folder

This folder is where model-specific case studies can live without defining the identity of the whole repository.

Right now it contains a hemp-oriented DSSAT case-study example.

Over time it could also contain:

  • APSIM examples
  • STICS examples
  • cross-model comparison examples
  • ensemble examples

Where the wrapper source lives conceptually

The DSSAT-wrapper is part of the ecosystem, but it is not the identity of this repository.

Its dedicated documentation lives separately here:

This repository should link to such model-specific modules rather than pretending they are the whole system.

What is deliberately outside the repo

The umbrella cropmodelling repo is not the same thing as a full local runtime workspace.

A local research workspace may also contain:

  • a DSSAT installation such as C:/path/to/DSSAT48
  • cloned example data such as dssat-csm-data
  • APSIM installations
  • STICS installations
  • project-specific reproduction scripts
  • temporary outputs and figures

The goal of this repo is to teach how to work with those pieces without assuming that everyone has the same machine layout or the same preferred model.

DSSAT in This Ecosystem

This chapter explains how DSSAT fits inside the broader cropmodelling repository.

DSSAT is one model ecosystem, not the whole repo

DSSAT is currently the most developed pathway in this project because:

  • it is mature
  • it has rich file structures
  • it has real example data
  • it has already been used in the hemp case-study work

But DSSAT is still one ecosystem among several that this repository aims to support over time.

Where the wrapper belongs

The DSSAT-wrapper work belongs in the project as:

  • one implementation module
  • one teaching pathway
  • one reproducibility layer

It should not define the identity of the whole repository.

Why the hosted wrapper site matters

The wrapper already has its own hosted documentation:

That means this repository does not need to duplicate every wrapper-level detail as if it were the only story.

Instead, this repository can:

  • explain why the wrapper matters
  • teach the beginner concepts around it
  • link out to the dedicated wrapper site when deep DSSAT-wrapper detail is needed

What DSSAT currently contributes here

DSSAT currently provides:

  • the clearest end-to-end lesson path
  • a rich example of weather, soil, management, and genotype interaction
  • a hemp adaptation and paper-reproduction case study
  • a concrete example of why wrappers and validation matter

What comes later

As the project grows, the same pattern can be used for other ecosystems:

  • APSIM section
  • STICS section
  • shared comparison sections

The important idea is that each model can be both:

  • its own ecosystem
  • part of a broader shared crop-modeling framework

DSSAT File Anatomy

One of the biggest beginner hurdles in DSSAT is that everything seems to be "just files." This chapter explains what those files do.

The most important idea

The experiment file is the conductor.

It does not contain every piece of data itself. Instead, it points to:

  • the crop setup
  • the field and soil context
  • the planting and fertilizer treatments
  • the weather station
  • the cultivar and ecotype identifiers

If you understand the experiment file, DSSAT becomes much easier to reason about.

Experiment files

Examples:

  • *.WHX for wheat examples
  • *.MZX for maize examples
  • *.HMX for hemp examples

An experiment file usually contains sections like:

  • *TREATMENTS
  • *CULTIVARS
  • *FIELDS
  • *SOIL ANALYSIS
  • *INITIAL CONDITIONS
  • *PLANTING DETAILS
  • *FERTILIZERS
  • *HARVEST DETAILS
  • *SIMULATION CONTROLS

When you read an experiment file, think of it as a structured recipe rather than a flat text file.

Companion observation files

Two observation file types matter often:

  • *.T Time-course observations.
  • *.A End-of-season or summary observations.

In hemp case studies you may see:

  • .HMT
  • .HMA

Those are the crop-specific versions of the same idea.

Genotype files

These usually live in the DSSAT Genotype folder.

Common file types:

  • .CUL Cultivar coefficients.
  • .ECO Ecotype coefficients.
  • .SPE Species-level definitions.

Important caution:

Not all crop/model families use the exact same genotype structure. Some have no .ECO file at all. That is one reason a truly universal wrapper cannot assume a single fixed genotype pattern.

Soil files

These define the physical and hydraulic context of the field.

Typical concerns include:

  • layer depth
  • bulk density
  • lower and upper water limits
  • saturation
  • organic carbon
  • initial water and nitrogen conditions

If biomass or stress behavior looks unrealistic, soil inputs are often one of the first places to inspect.

Weather files

Weather files are daily drivers. If the weather is wrong, the simulation can be beautifully reproducible and still biologically wrong.

Typical variables include:

  • solar radiation
  • minimum temperature
  • maximum temperature
  • rainfall

Weather files should be checked for:

  • missing dates
  • unrealistic spikes
  • unit consistency
  • correct station linkage from the experiment file

Metadata files in the DSSAT install

Two install-level files are especially important for this repo:

SIMULATION.CDE

This provides model and crop metadata.

DSSATPRO.V48

This maps crop codes to installed folders and default modules.

These files are crucial for the registry-driven wrapper because they let us infer model context instead of forcing the user to hard-code everything.

Outputs you will see often

PlantGro.OUT

The main time-series output file for growth variables.

Evaluate.OUT

Often provides end-of-run summary metrics and stage-related information.

Other output files

Some families also write additional outputs or family-specific variables.

That is why the omniwrapper uses:

  • a generic output reader
  • variable alias logic
  • fallback parsing where necessary

Practical reading order

When debugging a run, a good order is:

  1. experiment file
  2. companion observation files
  3. genotype files
  4. install metadata
  5. output files

That order helps you understand not only what happened, but why the wrapper made the choices it made.

DSSAT Weather Files Explained

Weather is one of the most important DSSAT inputs because it drives the crop every single day.

If the weather data are poor, everything downstream can look coherent and still be wrong.

What a DSSAT weather file does

A weather file provides the daily atmospheric conditions the crop experiences.

Typical daily drivers include:

  • solar radiation
  • maximum temperature
  • minimum temperature
  • rainfall

Depending on workflow, other variables may also be relevant or derived.

Why weather matters so much

Weather influences almost every major process:

  • emergence
  • phenology
  • photosynthesis potential
  • evapotranspiration
  • soil water balance
  • stress timing

Even a small issue such as missing rainfall on the wrong date can shift water stress and therefore biomass or flowering patterns.

The basic structure of a weather file

A DSSAT weather file typically contains:

  • station metadata
  • geographic information
  • a header defining daily columns
  • one row per day

The station metadata usually help DSSAT understand:

  • where the site is
  • what climate station identifier to use
  • how to compute daylength and seasonal context

The daily rows provide the actual time series.

Common weather variables

The exact names can vary, but beginners should understand the common concepts.

Solar radiation

This represents incoming shortwave energy available to drive canopy interception and photosynthesis-related processes.

If radiation is too low or too high, biomass can be strongly biased.

Minimum and maximum temperature

These influence:

  • development rate
  • heat stress or cold stress
  • evapotranspiration
  • potential growth

Temperature errors often show up first as phenology errors.

Rainfall

Rainfall feeds the soil water balance.

If rainfall timing is wrong, the model may create water stress at the wrong stage even if seasonal rainfall totals look reasonable.

Why location metadata matter

Latitude and longitude are not decoration.

They influence:

  • daylength
  • seasonal solar geometry
  • sometimes weather interpretation choices

For crops with photoperiod sensitivity, such as hemp, location metadata can be especially important.

Common data-quality problems

Beginner workflows often run into:

  • missing dates
  • duplicate dates
  • impossible temperatures
  • negative radiation
  • unit confusion
  • station metadata copied from the wrong location

Any of these can damage a simulation quietly.

How weather files relate to experiment files

The experiment file typically references a weather station identifier rather than embedding the daily weather table directly.

That means debugging weather requires checking both:

  • the weather file itself
  • the station linkage used by the experiment file

Practical checks before you trust a weather file

Before running calibration, check:

  1. Does the file cover the full simulation period?
  2. Are all dates present?
  3. Do units match the expected DSSAT convention for that variable?
  4. Do min and max temperatures look physically plausible?
  5. Does rainfall timing look believable?
  6. Does the station metadata match the intended site?

Beginner takeaway

When a run looks biologically wrong, do not assume the genotype is the problem.

Weather is often the first major input to inspect because it drives development and growth every day of the season.

How to Build Climate Data for DSSAT

This chapter is about workflow rather than theory.

It explains how you go from real weather information to a weather file that DSSAT can actually use.

Where weather data usually come from

Common sources include:

  • on-site weather stations
  • nearby public weather stations
  • gridded weather products
  • satellite-supported products
  • reanalysis products

The best source depends on:

  • how close it is to the field
  • whether it covers the full date range
  • whether it provides the required variables
  • whether its quality control is trustworthy

The core rule: daily completeness matters

DSSAT usually needs a complete daily time series across the simulation period.

That means you should aim for:

  • no missing dates
  • consistent daily units
  • no duplicated rows
  • no accidental mixing of stations

Even if a source is scientifically reputable, your assembled file can still be wrong if the processing pipeline is careless.

A practical build workflow

Step 1: define the site and period

Write down:

  • station or site identifier
  • latitude
  • longitude
  • elevation if available
  • start date
  • end date

These become the backbone of the weather file.

Step 2: collect the daily variables

At minimum, assemble the variables required by your DSSAT workflow, often:

  • daily solar radiation
  • daily minimum temperature
  • daily maximum temperature
  • daily rainfall

If your source does not provide one of these directly, document how it was derived.

Step 3: standardize units

This is one of the most common failure points.

Before writing the file, make sure each variable is in the convention expected by the DSSAT weather format you are targeting.

Never assume two data providers use the same units.

Step 4: fill or flag missing values

If data are missing, decide explicitly whether to:

  • gap-fill from another source
  • interpolate where scientifically justified
  • mark the series as unsuitable

Do not silently leave gaps in a file that is supposed to drive a daily simulation.

Step 5: run quality checks

Plot and inspect:

  • rainfall through time
  • min and max temperature through time
  • radiation through time

Also check summary statistics such as:

  • hottest day
  • coldest day
  • wettest day
  • total seasonal rainfall

These quick checks often reveal transcription or unit errors immediately.

Step 6: write the DSSAT weather file

Once the data are clean, write them into the required DSSAT structure with:

  • correct station header information
  • correct variable names
  • one row per day

Step 7: test with a known experiment

Before using the file in serious calibration, run a simple known experiment and confirm:

  • the simulation period is covered
  • DSSAT reads the file
  • no weather-related missing-data warnings appear

When weather must be estimated

Sometimes field experiments do not have perfect station data.

In that case, you may use:

  • nearby stations
  • merged products
  • satellite-supported products
  • reanalysis products

The important thing is not to pretend the data are more direct than they are. Document the source and any transformation clearly.

A good habit for reproducibility

Keep a separate record of:

  • source dataset names
  • source URLs or citations
  • download date
  • unit conversions
  • gap-filling rules
  • final weather file name used in DSSAT

That record is often more valuable than the final file alone.

Beginner takeaway

Building weather for DSSAT is not just "put climate in a text file."

It is a careful process of:

  • sourcing
  • cleaning
  • standardizing
  • validating
  • documenting

That discipline is what makes later calibration and paper reproduction credible.

Denmark ISIMIP Workflow Walkthrough

This chapter records what we actually did to build the Denmark weather deliverables, which scripts were involved, where they lived on the remote HPC, what outputs were produced, and how the soil side should be structured if we later extend the Denmark project using the Ethiopia soil-first pattern.

The goal is not just to say "the files were made." The goal is to leave a traceable map that another researcher or intern can follow.

The main question we answered

The practical Denmark question was:

"Can we take ISIMIP climate data for Denmark and push it all the way to DSSAT-ready weather outputs, using a workflow that can later be paired with a proper soil pipeline?"

The answer became yes.

By the end of the work, we had:

  • processed Denmark climate NetCDF files in crop-model units
  • parquet zonal summaries
  • TAV and AMP outputs
  • DSSAT .WTH files
  • repaired elevation headers
  • validation of processed units

The real project locations

Denmark climate and weather project on HPC

  • /home/<user>/DEN
  • /home/<user>/DEN/denmark

The denmark folder was created as a clean project wrapper so we would not have to move heavy climate data around unnecessarily.

Ethiopia soil reference project on HPC

  • /home/<user>/Ethiopia
  • /home/<user>/Ethiopia/soil_data

This Ethiopia project is important because it already demonstrates the soil-first pattern that we want future projects to follow.

What already existed before the Denmark cleanup

Before the cleanup and restructuring, Denmark already had important raw and processed ingredients:

  • Denmark shapefiles under /home/<user>/DEN/shapefiles
  • Denmark COP30 DEM tiles under /home/<user>/DEN/COP30
  • processed ISIMIP climate NetCDFs under each model's processed/ folder
  • legacy zonal outputs under /home/<user>/DEN/zonal_stats

Those legacy zonal outputs were not yet in the Ethiopia-style parquet layout. They existed mostly as per-grid Excel outputs.

The Denmark wrapper we built

We created a clean wrapper at:

  • /home/<user>/DEN/denmark

Inside that wrapper we organized:

  • /home/<user>/DEN/denmark/scripts
  • /home/<user>/DEN/denmark/docs
  • /home/<user>/DEN/denmark/config
  • /home/<user>/DEN/denmark/deliverables
  • /home/<user>/DEN/denmark/logs

This mattered because it separated:

  • large source data that already existed
  • lightweight workflow code
  • final deliverables
  • validation outputs

The Denmark scripts we used

These are the numbered workflow scripts we set up for Denmark:

  • /home/<user>/DEN/denmark/scripts/01_download_isimip.sh
  • /home/<user>/DEN/denmark/scripts/02_prepare_shapefiles_and_grid.sh
  • /home/<user>/DEN/denmark/scripts/03_create_denmark_grid.py
  • /home/<user>/DEN/denmark/scripts/04_prepare_dem.sh
  • /home/<user>/DEN/denmark/scripts/05_merge_clip_convert_climate.sh
  • /home/<user>/DEN/denmark/scripts/06_compute_tav_amp.py
  • /home/<user>/DEN/denmark/scripts/07_extract_zonal_stats_parquet.py
  • /home/<user>/DEN/denmark/scripts/08_generate_wth_files.py
  • /home/<user>/DEN/denmark/scripts/09_validate_outputs.py
  • /home/<user>/DEN/denmark/scripts/10_run_model_scenario_weather.sh
  • /home/<user>/DEN/denmark/scripts/11_submit_denmark_weather.sh

What each Denmark script did

01_download_isimip.sh

This is the placeholder entry for climate acquisition.

In practice, much of the Denmark climate data already existed under model folders, so the project did not depend on a fresh redownload to finish the weather pipeline.

02_prepare_shapefiles_and_grid.sh

This script represents the shapefile preparation stage.

Conceptually this stage ties together:

  • Denmark extent
  • Denmark 0.5 degree grid
  • the grid-cell logic that later drives zonal extraction and WTH station IDs

03_create_denmark_grid.py

This is the grid-construction step.

It formalizes the Denmark climate grid as the geographic framework against which climate and later soil can be mapped.

04_prepare_dem.sh

This script prepared the Denmark DEM support raster used for WTH elevation.

It produced the merged DEM used in the workflow:

  • /home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif

This step mattered because the WTH headers need defensible elevation values.

05_merge_clip_convert_climate.sh

This was the bridge from raw model outputs to processed Denmark climate files.

The underlying climate conversion logic matched the older project script:

  • merge yearly files
  • clip to Denmark extent
  • convert variables to crop-model units

The effective unit targets were:

  • pr -> mm/day
  • tas -> degC
  • tasmax -> degC
  • tasmin -> degC
  • rsds -> MJ m-2 day-1

06_compute_tav_amp.py

This script computed DSSAT header climatology summaries:

  • TAV
  • AMP

It reads processed tasmax and tasmin, aligns them by date, and writes:

  • /home/<user>/DEN/denmark/deliverables/tav_amp/{MODEL}/{SCENARIO}/{PERIOD}/tav_amp.csv

07_extract_zonal_stats_parquet.py

This script converted processed Denmark NetCDF climate variables into Ethiopia-style zonal parquet outputs.

It writes outputs such as:

  • /home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/pr/pr.parquet
  • /home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/tasmax/tasmax.parquet
  • /home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/tasmin/tasmin.parquet
  • /home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/rsds/rsds.parquet

This was the key format bridge that let us move from NetCDF to WTH generation.

08_generate_wth_files.py

This script assembled the final DSSAT weather files.

It merged the parquet variables:

  • pr
  • tasmax
  • tasmin
  • rsds

Then it wrote DSSAT .WTH files into:

  • /home/<user>/DEN/denmark/deliverables/wth/{MODEL}/{SCENARIO}/{PERIOD}

It also handled:

  • station header coordinates
  • TAV and AMP
  • DEM-based elevation

Later we repaired this script so that zero-elevation coastal cells would borrow the nearest valid land-based elevation instead of remaining at 0 m.

09_validate_outputs.py

This script counted the main deliverables:

  • parquet files
  • WTH files
  • TAV and AMP files

It was useful for fast completeness checks during batch reruns.

10_run_model_scenario_weather.sh

This was the per-model, per-scenario orchestration script.

It did:

  1. determine the time span from processed pr
  2. compute TAV and AMP
  3. build all parquet variables
  4. generate WTH files

This script made it easy to re-run only one broken scenario after a repair.

11_submit_denmark_weather.sh

This was the batch submission wrapper for the cluster.

It was used to submit the multi-model, multi-scenario Denmark weather runs.

The main climate processing logic we followed

The Denmark weather build followed this chain:

  1. use the existing processed model folders under /home/<user>/DEN
  2. ensure the processed NetCDF values were in DSSAT-ready units
  3. prepare a merged Denmark DEM for elevation support
  4. compute TAV and AMP
  5. convert daily climate grids to zone-level parquet summaries
  6. generate WTH files for each Denmark grid cell
  7. validate units, counts, and headers

Problems we had to repair

The work was not a simple one-pass run.

We had to repair real issues.

1. Broken batch dependencies and environment issues

Early jobs were stuck behind dependency logic or failed because the right geospatial environment was not consistently activated.

We fixed that by making the wrapper scripts explicit about:

  • Python interpreter
  • PROJ_LIB
  • PATH
  • environment cleanup

2. Broken processed files for MPI-ESM1-2-HR ssp370

These files were genuinely corrupted and had to be rebuilt from raw inputs:

  • /home/<user>/DEN/MPI-ESM1-2-HR/processed/MPI-ESM1-2-HR_ssp370_tasmin_denmark.nc
  • /home/<user>/DEN/MPI-ESM1-2-HR/processed/MPI-ESM1-2-HR_ssp370_rsds_denmark.nc

They were reconstructed using the same logic as the main climate processing step:

  • merge raw decade files
  • clip to Denmark
  • convert to the correct units

3. Duplicate parquet and TAV/AMP test artifacts

During iteration we created a few stale outputs from older path logic.

These were later cleaned so the final deliverables reflected only the intended production runs.

4. Zero-elevation WTH headers

Many coastal Denmark grid cells initially got ELEV = 0.

This happened because centroid sampling and some polygon means landed in very water-dominated cells.

We fixed that by improving the elevation fallback logic in:

  • /home/<user>/DEN/denmark/scripts/08_generate_wth_files.py

The final fallback order became:

  1. land-intersection centroid DEM sample
  2. land-polygon mean
  3. land-polygon max
  4. nearest grid cell with a valid land-based elevation

Final Denmark weather results

The Denmark weather deliverables finished cleanly.

Final validated counts were:

  • 80 parquet files
  • 1160 WTH files
  • 20 TAV/AMP files

That corresponds to:

  • 5 models
  • 4 scenarios each
  • 20 model-scenario combinations
  • 58 WTH files per combination

Key validation outputs:

  • /home/<user>/DEN/denmark/deliverables/validation/processed_units_validation.csv
  • /home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif

Key bundled deliverables:

  • /home/<user>/DEN/denmark/deliverables/denmark_wth_bundle.tar.gz

What we proved about the processed climate files

We validated all 100 processed climate files across:

  • pr
  • tas
  • tasmax
  • tasmin
  • rsds

Final result:

  • 100/100 files passed unit and range checks

The final expected units were:

  • pr: mm/day
  • tas: degC
  • tasmax: degC
  • tasmin: degC
  • rsds: MJ m-2 day-1

The Ethiopia soil-first pattern we should imitate

The Ethiopia project is the best reference for how soil should be prepared in a more model-agnostic and reusable way.

The core soil workflow is documented at:

  • /home/<user>/Ethiopia/soil_data/docs/soil_first_workflow.md
  • /home/<user>/Ethiopia/soil_data/docs/soil_hydraulics_runoff.md

Its core idea is simple:

  • keep one authoritative processed soil backbone
  • export DSSAT, APSIM, and STICS from that same backbone

Important precision note:

  • in the current HPC tree, the documented Ethiopia soil workflow is easiest to confirm through the files under /home/<user>/Ethiopia/soil_data rather than through a complete set of matching upstream script files
  • the docs clearly describe the producer scripts by name, but the present walkthrough only treats the existing soil_data files and docs as confirmed paths unless a script path was directly observed

Ethiopia authoritative soil backbone

The files to treat as source of truth are:

  • /home/<user>/Ethiopia/soil_data/processed/final_soil_profile_all_points.csv
  • /home/<user>/Ethiopia/soil_data/processed/HWSD2_hydraulic_parameters_from_profile.csv
  • /home/<user>/Ethiopia/soil_data/processed/soil_with_slro.csv
  • /home/<user>/Ethiopia/soil_data/processed/horizon.csv

In the current Ethiopia tree we directly confirmed at least:

  • /home/<user>/Ethiopia/soil_data/processed/soil_with_slro.csv
  • /home/<user>/Ethiopia/soil_data/processed/horizon.csv

Ethiopia downstream soil exports

  • DSSAT soil file: /home/<user>/Ethiopia/soil_data/dssat/ET.SOL
  • APSIM soils: /home/<user>/Ethiopia/soil_data/apsim/soils/*.json
  • STICS soils: /home/<user>/Ethiopia/soil_data/stics/sols.xml

This is exactly the pattern worth copying.

What the Denmark soil side should look like

If Denmark soil preparation is added later, it should not start from DSSAT format first.

It should start from a model-agnostic processed soil core and then export to model-specific products.

  • /home/<user>/DEN/soil_data/raw
  • /home/<user>/DEN/soil_data/processed
  • /home/<user>/DEN/soil_data/docs
  • /home/<user>/DEN/soil_data/reports
  • /home/<user>/DEN/soil_data/dssat
  • /home/<user>/DEN/soil_data/apsim
  • /home/<user>/DEN/soil_data/stics

At minimum, Denmark should end up with processed tables analogous to Ethiopia:

  • final_soil_profile_all_points.csv
  • hydraulic_parameters_from_profile.csv
  • soil_with_slro.csv
  • horizon.csv

One final DSSAT soil file such as:

  • DN.SOL

or an equivalent country-specific .SOL file containing all Denmark station profiles.

Each station or grid cell should ultimately have:

  • point or grid identifier
  • coordinates
  • layer bottoms
  • sand, silt, clay
  • bulk density
  • organic carbon
  • pH if available
  • hydraulic limits such as lower limit, drained upper limit, saturation
  • runoff/slope enrichment such as SLRO
  • root-limiting depth or impermeable-layer depth where available

What ISIMIP soil input data actually provides

For ISIMIP3 soil input data, the official DOI page states that two newer HWSD 1.2-based files are provided:

  • hwsd_soil_data_all_land.nc
  • hwsd_soil_data_on_cropland.nc

For agriculture work, the important one is usually:

  • hwsd_soil_data_on_cropland.nc

because it represents the soil predominantly occurring on cropland within each grid cell.

The DOI metadata also states that these files provide fields such as:

  • texture_class
  • mu_global
  • soil_ph
  • soil_caco3
  • bulk_density
  • cec_soil
  • oc
  • root_obstacles
  • impermeable_layer
  • awc
  • sand
  • silt
  • clay
  • gravel
  • ece
  • bs_soil
  • issoil

That means ISIMIP soil input data is not already a DSSAT .SOL file.

It is an upstream static geographic soil dataset that still needs to be turned into:

  • a canonical processed soil table
  • hydraulic estimates or pedotransfer outputs
  • runoff or slope enrichments
  • final model exports

What that means for a Denmark soil build

If we want to build Denmark soil inputs inspired by the Ethiopia project, the correct logic is:

  1. start from an ISIMIP soil NetCDF, ideally cropland-focused if the target is crop modeling
  2. sample the soil properties to the Denmark grid or station points
  3. create a canonical processed table with one row per layer per station
  4. derive hydraulic quantities that DSSAT needs
  5. add slope/runoff support from the DEM if needed
  6. create a final horizon.csv or equivalent initialization table
  7. export a Denmark .SOL file for DSSAT
  8. optionally export APSIM and STICS soils from the same processed backbone

The key lesson from the Denmark weather work

The weather pipeline succeeded because it had a disciplined chain:

  • one clear wrapper
  • numbered scripts
  • validated units
  • validated counts
  • explicit deliverables

The soil pipeline should follow the same discipline.

The strongest future design is therefore:

  • weather: NetCDF -> parquet -> WTH
  • soil: ISIMIP/HWSD-like source -> processed soil core -> .SOL and other model exports

The practical answer to the soil question

If you want the soil data "prepared correctly" for a Denmark-style project, motivated by Ethiopia, then the final soil deliverables should not only be a single .SOL file.

They should be a package containing:

  1. a raw source record showing which ISIMIP soil file was used
  2. a processed canonical table with one row per layer per station or grid cell
  3. a hydraulic-enriched soil table
  4. a horizon or initial-condition table
  5. a DSSAT .SOL export
  6. QC notes describing fallback assumptions and missing values

That is the structure that makes the soil side auditable, reusable, and ready for future models beyond DSSAT.

Denmark ISIMIP3b to DSSAT: Full Script Walkthrough

Full guide: github.com/Ongevic/cropmodelling

This chapter walks through every script that turned raw ISIMIP3b climate data into DSSAT-ready weather files for Denmark.

It does not summarise the scripts. It goes through each one block by block, explaining what the code does and why each decision was made.

By the end you should be able to reproduce the full pipeline from scratch, or adapt it for any other country or climate dataset.


What this pipeline produces

At the end of this workflow you have:

  • 1160 DSSAT .WTH files covering all of Denmark
  • 20 TAV/AMP summary CSVs
  • 80 Parquet zonal-statistic files as intermediate data

That corresponds to:

  • 5 climate models
  • 4 scenarios per model (historical, ssp126, ssp370, ssp585)
  • 58 Denmark grid cells per model-scenario combination

The full data flow

ISIMIP data portal
        │
        │  01_download_isimip.sh
        ▼
Raw ISIMIP3b NetCDF files
(one file per model / scenario / variable / year-chunk)
        │
        │  05_merge_clip_convert_climate.sh
        ▼
Processed Denmark NetCDF files
(units converted, clipped to Denmark, one file per model/scenario/variable)
        │
        ├──── 06_compute_tav_amp.py ──────────► tav_amp/*.csv
        │
        ├──── 07_extract_zonal_stats_parquet.py ► *.parquet
        │     (also needs 02/03 shapefiles)
        │
        └──── 08_generate_wth_files.py ──────► *.WTH
              (also needs 04 DEM, parquet, shapefile)

Validation
        │
        │  09_validate_outputs.py
        ▼
Counts check + processed_units_validation.csv

Orchestration
        │
        │  10_run_model_scenario_weather.sh
        │  11_submit_denmark_weather.sh (Slurm)
        ▼
All 20 model-scenario combinations run on HPC in parallel

Project directory layout on HPC

Everything lives under /home/<user>/DEN.

/home/<user>/DEN/
├── GFDL_ESM4/
│   ├── raw/                ← original downloaded NetCDF files
│   └── processed/          ← unit-converted, Denmark-clipped NetCDFs
├── IPSL-CM6A-LR/
│   ├── raw/
│   └── processed/
├── MPI-ESM1-2-HR/
│   ├── raw/
│   └── processed/
├── MRI-ESM2-0/
│   ├── raw/
│   └── processed/
├── UKESM1-0-LL/
│   ├── raw/
│   └── processed/
├── shapefiles/
│   ├── denmark_extent.shp
│   └── denmark_grid_05deg.shp
├── COP30/
│   └── *.tif               ← Copernicus 30m DEM tiles
└── denmark/                ← clean wrapper (scripts + deliverables)
    ├── scripts/
    ├── docs/
    ├── config/
    ├── logs/
    └── deliverables/
        ├── wth/
        ├── tav_amp/
        ├── parquet_zonal_stats/
        └── validation/

The large climate data files stay in place under each model folder. The denmark/ wrapper only holds scripts and final outputs. That separation avoids expensive file copies on HPC.


The five climate models

Short nameFull CMIP6 identifier
GFDL_ESM4NOAA Geophysical Fluid Dynamics Lab
IPSL-CM6A-LRInstitut Pierre Simon Laplace
MPI-ESM1-2-HRMax Planck Institute, high resolution
MRI-ESM2-0Meteorological Research Institute Japan
UKESM1-0-LLUK Earth System Model

These five are the standard ISIMIP3b bias-corrected set used by the climate-impacts community.


The five climate variables

CMIP6 nameMeaningDSSAT nameTarget units
prPrecipitationRAINmm/day
tasmaxDaily maximum temperatureTMAX°C
tasminDaily minimum temperatureTMIN°C
rsdsSurface downwelling shortwave radiationSRADMJ m⁻² day⁻¹
tasMean temperature (used only for TAV/AMP)°C

DSSAT does not need tas directly in the weather file but the processed tas NetCDF is validated as a consistency check.


Script 01 — Download ISIMIP climate data

File: 01_download_isimip.sh

#!/bin/bash
set -euo pipefail

exec /home/<user>/DEN/scripts/download_isimip.sh "$@"

What this script does

This is a thin wrapper. It delegates immediately to the core download script that lives in the parent project tree.

The real download logic is in:

/home/<user>/DEN/scripts/download_isimip.sh

That older script uses the ISIMIP data portal API to pull all model-scenario- variable combinations to the raw/ folders.

Line-by-line breakdown

#!/bin/bash

The shebang. Tells the shell this file is a Bash script. This is the first line of every shell script in this pipeline.

set -euo pipefail

Three safety flags in one line. Always include these at the top of any Bash script you write.

  • -e — stop the script immediately if any command returns a non-zero exit code
  • -u — treat any unset variable as an error instead of silently expanding to empty
  • -o pipefail — if any command in a pipeline fails, the whole pipeline is marked as failed

Without these, a failing download could silently continue and you would only discover the problem much later when WTH generation crashes.

exec /home/<user>/DEN/scripts/download_isimip.sh "$@"

exec replaces the current process with the target script rather than forking a child process. That means any Slurm job accounting stays with the real script and there is no intermediate shell process sitting idle.

"$@" passes all arguments through unchanged. If you call this wrapper with --model GFDL_ESM4 those flags reach the real download script.

Why the data were already present

For the Denmark project, the ISIMIP downloads had been completed in earlier work. The pipeline could therefore proceed directly to the processing step without waiting for a fresh download.

In a new project you would run this step first and let it finish before anything else.


Scripts 02 and 03 — Prepare shapefiles and grid

File: 02_prepare_shapefiles_and_grid.sh

#!/bin/bash
set -euo pipefail

echo "Use existing Denmark shapefiles in /home/<user>/DEN/shapefiles"
echo "Grid builder source: /home/<user>/DEN/scripts/create_denmark_grid.py"

File: 03_create_denmark_grid.py

#!/usr/bin/env python3
import runpy

runpy.run_path("/home/<user>/DEN/scripts/create_denmark_grid.py",
               run_name="__main__")

What these scripts do

These two scripts represent the geographic foundation.

The Denmark project needed two shapefiles:

  1. denmark_extent.shp — the outer boundary of Denmark used for clipping climate grids
  2. denmark_grid_05deg.shp — a 0.5-degree grid of polygons covering Denmark, with each polygon assigned a unique grid_id

The grid_id values in denmark_grid_05deg.shp become the station identifiers in the final WTH filenames and headers.

Why 0.5 degrees

ISIMIP3b bias-corrected climate data are delivered on a 0.5-degree global grid. Building the Denmark analysis grid at the same resolution means every grid polygon maps exactly to one or more NetCDF cells without interpolation.

Denmark spans roughly:

  • latitude: 54.5 N to 57.8 N
  • longitude: 8.0 E to 15.2 E

At 0.5-degree spacing that gives approximately 58 grid cells that contain meaningful land area.

The runpy pattern

runpy.run_path("/home/<user>/DEN/scripts/create_denmark_grid.py",
               run_name="__main__")

runpy.run_path executes a Python file as if it were run directly from the command line. Setting run_name="__main__" triggers any if __name__ == "__main__": guard in the target file.

This pattern lets the numbered wrapper scripts stay lean while pointing to the real logic files that already existed before the clean wrapper was built.


Script 04 — Prepare the DEM

File: 04_prepare_dem.sh

#!/bin/bash
set -euo pipefail

export PATH="$HOME/.local/bin:$PATH"
unset PYTHONPATH || true
GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"

BASE=/home/<user>/DEN
OUTDIR="$BASE/denmark/deliverables/validation"
mkdir -p "$OUTDIR"

echo "COP30 source tiles: $BASE/COP30" > "$OUTDIR/dem_status.txt"
find "$BASE/COP30" -maxdepth 1 -name '*.tif' | sort >> "$OUTDIR/dem_status.txt"
MERGED="$OUTDIR/denmark_dem_merged.tif"

if [ -f "$MERGED" ]; then
  echo "Merged DEM already exists: $MERGED" >> "$OUTDIR/dem_status.txt"
  exit 0
fi

$GEO_PY - <<'PY'
from pathlib import Path
import rasterio
from rasterio.merge import merge

base = Path("/home/<user>/DEN/COP30")
out = Path("/home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif")
tiles = sorted(base.glob("*.tif"))
if not tiles:
    raise SystemExit("No COP30 tiles found")

srcs = [rasterio.open(tile) for tile in tiles]
mosaic, transform = merge(srcs)
meta = srcs[0].meta.copy()
meta.update(
    {
        "driver": "GTiff",
        "height": mosaic.shape[1],
        "width": mosaic.shape[2],
        "transform": transform,
    }
)

out.parent.mkdir(parents=True, exist_ok=True)
with rasterio.open(out, "w", **meta) as dst:
    dst.write(mosaic)

for src in srcs:
    src.close()

print(out)
PY

echo "Merged DEM: $MERGED" >> "$OUTDIR/dem_status.txt"

What this script does

It merges individual Copernicus COP30 DEM tiles covering Denmark into one single raster file.

That merged raster is later used by Script 08 to read the elevation of each Denmark grid cell centroid, which goes into the ELEV field of every WTH header.

Why elevation matters in WTH files

The DSSAT weather file header looks like this:

@ INSI      LAT     LONG  ELEV   TAV   AMP REFHT WNDHT
     1    54.95     8.83     5   8.7  15.7   -99   -99

The ELEV field (here 5 metres) tells DSSAT how high above sea level the station is. DSSAT uses this for atmospheric pressure adjustments and related calculations.

If ELEV = 0 for coastal grid cells that actually sit at 3 to 8 metres, the header is technically wrong even if DSSAT can still run.

The DEM fixes that.

Line-by-line breakdown

export PATH="$HOME/.local/bin:$PATH"
unset PYTHONPATH || true
GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"

These three lines solve the most common HPC Python problem: the wrong interpreter or the wrong package set gets loaded.

  • PATH is extended so locally installed tools come first
  • PYTHONPATH is cleared so no old project packages accidentally load
  • GEO_PY points to the exact Python binary inside the geo_env conda environment, which has rasterio and geopandas installed

Always use the full interpreter path in HPC scripts. Never rely on python3 resolving to the right environment.

BASE=/home/<user>/DEN
OUTDIR="$BASE/denmark/deliverables/validation"
mkdir -p "$OUTDIR"

BASE is the root of the Denmark project. OUTDIR is where the merged DEM and all validation outputs will go. mkdir -p creates the directory tree if it does not already exist.

echo "COP30 source tiles: $BASE/COP30" > "$OUTDIR/dem_status.txt"
find "$BASE/COP30" -maxdepth 1 -name '*.tif' | sort >> "$OUTDIR/dem_status.txt"

Before doing any work, the script writes a log. It records where the source tiles came from and lists each file. This gives you a permanent audit trail without opening any tool.

MERGED="$OUTDIR/denmark_dem_merged.tif"

if [ -f "$MERGED" ]; then
  echo "Merged DEM already exists: $MERGED" >> "$OUTDIR/dem_status.txt"
  exit 0
fi

If the merged file already exists, the script skips the merge and exits cleanly. This makes the script safe to rerun. Rerunning 04_prepare_dem.sh a second time never damages the output.

$GEO_PY - <<'PY'
...
PY

This runs a Python heredoc. The shell passes everything between <<'PY' and PY directly to the Python interpreter as standard input. The single-quoted 'PY' delimiter means no shell variable expansion happens inside the Python block.

Inside the Python block:

tiles = sorted(base.glob("*.tif"))
if not tiles:
    raise SystemExit("No COP30 tiles found")

sorted() guarantees a consistent tile order. raise SystemExit gives a clear error message if the COP30 folder is empty, rather than silently producing a broken output.

srcs = [rasterio.open(tile) for tile in tiles]
mosaic, transform = merge(srcs)

rasterio.merge.merge stitches the tiles together into a single array (mosaic) and computes the combined geotransform (transform) that describes where the merged raster sits on Earth.

meta = srcs[0].meta.copy()
meta.update(
    {
        "driver": "GTiff",
        "height": mosaic.shape[1],
        "width": mosaic.shape[2],
        "transform": transform,
    }
)

The metadata dictionary from the first tile is copied and updated with the new dimensions and transform. The driver stays as GTiff because the COP30 tiles are GeoTIFF files.

with rasterio.open(out, "w", **meta) as dst:
    dst.write(mosaic)

for src in srcs:
    src.close()

The merged raster is written to disk. Then every source file handle is closed. Forgetting to close file handles on HPC can leave locked files and cause problems for later scripts.

Practical check

After this script runs, confirm the merged DEM exists and has the right extent:

gdalinfo /home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif \
  | grep -E "Origin|Pixel Size|Size"

The corner coordinates should bracket Denmark. If the extent covers only part of Denmark, some COP30 tiles were missing from the COP30/ folder.


Script 05 — Merge, clip, and convert climate data

File: 05_merge_clip_convert_climate.sh

#!/bin/bash
set -euo pipefail

exec /home/<user>/DEN/scripts/merge_denmark.sh "$@"

What this script does

Like Script 01, this is a thin wrapper. The real processing logic is in the pre-existing script:

/home/<user>/DEN/scripts/merge_denmark.sh

That script does the three most important preprocessing steps for each model-scenario-variable combination:

  1. Merge — yearly or decade-chunked NetCDF files are concatenated into one continuous time series
  2. Clip — the global or European climate field is cut to Denmark's bounding box using the extent shapefile
  3. Convert — raw ISIMIP3b units are converted to DSSAT-ready units

Why unit conversion is the critical step

ISIMIP3b delivers data in SI units (inherited from the underlying CMIP6 models):

VariableRaw ISIMIP3b unitMeaning
prkg m⁻² s⁻¹precipitation flux
tas, tasmax, tasminKtemperature in Kelvin
rsdsW m⁻²watts per square metre

DSSAT expects:

VariableDSSAT unit
RAINmm/day
TMAX, TMIN°C
SRADMJ m⁻² day⁻¹

The conversion formulas used are:

pr [mm/day]        = pr [kg/m²/s] × 86400
tas [°C]           = tas [K] − 273.15
tasmax [°C]        = tasmax [K] − 273.15
tasmin [°C]        = tasmin [K] − 273.15
rsds [MJ/m²/day]   = rsds [W/m²] × 0.0864

The 86400 factor for precipitation converts seconds to days (1 day = 86400 s). The 0.0864 factor for radiation converts W/m² to MJ/m²/day (1 W/m² × 86400 s/day ÷ 1,000,000 J/MJ = 0.0864).

Output files produced

For each model and scenario, this step produces five NetCDF files:

/home/<user>/DEN/{MODEL}/processed/
    {MODEL}_{SCENARIO}_pr_denmark.nc
    {MODEL}_{SCENARIO}_tas_denmark.nc
    {MODEL}_{SCENARIO}_tasmax_denmark.nc
    {MODEL}_{SCENARIO}_tasmin_denmark.nc
    {MODEL}_{SCENARIO}_rsds_denmark.nc

Five models × four scenarios × five variables = 100 processed NetCDF files in total.

A real problem we hit here

During production, two files for MPI-ESM1-2-HR / ssp370 were found corrupted:

MPI-ESM1-2-HR_ssp370_tasmin_denmark.nc
MPI-ESM1-2-HR_ssp370_rsds_denmark.nc

They had to be rebuilt from raw by rerunning the merge-clip-convert logic on the raw source files. After rebuilding, the downstream WTH generation for that model-scenario combination ran correctly.

If you see a model-scenario combination missing from the final WTH output, always check whether its processed NetCDFs are intact first.


Script 06 — Compute TAV and AMP

File: 06_compute_tav_amp.py

#!/usr/bin/env python3
from pathlib import Path
import argparse
import pandas as pd
import xarray as xr


def main() -> None:
    parser = argparse.ArgumentParser(
        description="Compute Denmark TAV/AMP from processed NetCDFs."
    )
    parser.add_argument("--model", required=True)
    parser.add_argument("--scenario", required=True)
    parser.add_argument("--start-year", type=int)
    parser.add_argument("--end-year", type=int)
    args = parser.parse_args()

    base = Path("/home/<user>/DEN")
    tasmax = (
        base / args.model / "processed"
        / f"{args.model}_{args.scenario}_tasmax_denmark.nc"
    )
    tasmin = (
        base / args.model / "processed"
        / f"{args.model}_{args.scenario}_tasmin_denmark.nc"
    )
    if not tasmax.exists() or not tasmin.exists():
        raise SystemExit("Missing tasmax or tasmin processed NetCDF")

    dsx = xr.open_dataset(tasmax)
    dsn = xr.open_dataset(tasmin)
    tx = (
        dsx["tasmax"]
        .mean(dim=[d for d in dsx["tasmax"].dims if d != "time"])
        .to_series()
    )
    tn = (
        dsn["tasmin"]
        .mean(dim=[d for d in dsn["tasmin"].dims if d != "time"])
        .to_series()
    )
    df = pd.DataFrame(
        {"date": pd.to_datetime(tx.index), "tasmax": tx.values}
    ).merge(
        pd.DataFrame({"date": pd.to_datetime(tn.index), "tasmin": tn.values}),
        on="date",
        how="inner",
    )
    if args.start_year is not None:
        df = df[df["date"].dt.year >= args.start_year]
    if args.end_year is not None:
        df = df[df["date"].dt.year <= args.end_year]
    if df.empty:
        raise SystemExit("No overlapping tasmax/tasmin dates after filtering")

    tavg = (df["tasmax"] + df["tasmin"]) / 2.0
    tav_val = float(tavg.mean())
    monthly = (
        df.assign(month=df["date"].dt.month)
        .groupby("month")[["tasmax", "tasmin"]]
        .mean()
    )
    monthly_tavg = (monthly["tasmax"] + monthly["tasmin"]) / 2.0
    amp_val = float(monthly_tavg.max() - monthly_tavg.min())

    period = f"{df['date'].dt.year.min()}_{df['date'].dt.year.max()}"
    outdir = (
        base / "denmark" / "deliverables" / "tav_amp"
        / args.model / args.scenario / period
    )
    outdir.mkdir(parents=True, exist_ok=True)
    out = outdir / "tav_amp.csv"
    pd.DataFrame(
        [
            {
                "model": args.model,
                "scenario": args.scenario,
                "tav_degC": round(tav_val, 3),
                "amp_degC": round(amp_val, 3),
            }
        ]
    ).to_csv(out, index=False)
    dsx.close()
    dsn.close()
    print(out)


if __name__ == "__main__":
    main()

What this script produces

For each model-scenario pair it writes one CSV file:

deliverables/tav_amp/GFDL_ESM4/historical/1971_2014/tav_amp.csv

That CSV contains two numbers:

model,scenario,tav_degC,amp_degC
GFDL_ESM4,historical,7.951,16.528

Those two numbers end up in the header of every WTH file for that model-scenario combination.

What TAV and AMP mean for DSSAT

TAV is the long-term mean annual temperature at the site in degrees Celsius. DSSAT uses it for soil temperature initialisation.

AMP is the mean annual amplitude — the difference between the warmest month's average temperature and the coldest month's average temperature. It describes how strongly the site's temperature cycles through the year. A site in Denmark has a large AMP (~16 °C) because summers and winters are very different. A tropical site has a small AMP (3 to 5 °C).

Line-by-line breakdown

parser = argparse.ArgumentParser(
    description="Compute Denmark TAV/AMP from processed NetCDFs."
)
parser.add_argument("--model", required=True)
parser.add_argument("--scenario", required=True)
parser.add_argument("--start-year", type=int)
parser.add_argument("--end-year", type=int)
args = parser.parse_args()

The script takes four command-line arguments. --model and --scenario are required. The year filters are optional — when omitted, the full time range in the NetCDF is used.

This design means the script can be called for any model-scenario combination without editing the code.

tasmax = (
    base / args.model / "processed"
    / f"{args.model}_{args.scenario}_tasmax_denmark.nc"
)
tasmin = (
    base / args.model / "processed"
    / f"{args.model}_{args.scenario}_tasmin_denmark.nc"
)
if not tasmax.exists() or not tasmin.exists():
    raise SystemExit("Missing tasmax or tasmin processed NetCDF")

The script builds the expected file paths from the arguments and checks they exist before trying to open them. A SystemExit with a clear message is better than letting xr.open_dataset crash with a cryptic error.

dsx = xr.open_dataset(tasmax)
dsn = xr.open_dataset(tasmin)
tx = (
    dsx["tasmax"]
    .mean(dim=[d for d in dsx["tasmax"].dims if d != "time"])
    .to_series()
)

This opens both NetCDF files and reduces the spatial dimensions to a single Denmark-wide daily mean. The expression [d for d in dsx["tasmax"].dims if d != "time"] builds a list of all dimensions except time — typically ["lat", "lon"] — and averages over them. The result is a daily time series of the Denmark-mean maximum temperature.

Taking a spatial mean here is a simplification. It produces one TAV/AMP pair per model-scenario combination rather than one per grid cell. The per-cell TAV and AMP values are computed later inside Script 08 from the actual zone data.

df = pd.DataFrame(
    {"date": pd.to_datetime(tx.index), "tasmax": tx.values}
).merge(
    pd.DataFrame({"date": pd.to_datetime(tn.index), "tasmin": tn.values}),
    on="date",
    how="inner",
)

The two time series (tasmax and tasmin) are joined on date using an inner merge. Inner merge means only dates present in both series are kept. This protects against misaligned time coordinates between the two NetCDF files.

if args.start_year is not None:
    df = df[df["date"].dt.year >= args.start_year]
if args.end_year is not None:
    df = df[df["date"].dt.year <= args.end_year]
if df.empty:
    raise SystemExit("No overlapping tasmax/tasmin dates after filtering")

The optional year filters are applied. If after filtering nothing remains (the requested years are outside the NetCDF range), the script exits with a clear error.

tavg = (df["tasmax"] + df["tasmin"]) / 2.0
tav_val = float(tavg.mean())

TAV is the mean of the daily mean temperatures across the whole period. The daily mean temperature is approximated as (tasmax + tasmin) / 2.

monthly = (
    df.assign(month=df["date"].dt.month)
    .groupby("month")[["tasmax", "tasmin"]]
    .mean()
)
monthly_tavg = (monthly["tasmax"] + monthly["tasmin"]) / 2.0
amp_val = float(monthly_tavg.max() - monthly_tavg.min())

AMP is derived from monthly averages. Each month's mean maximum and minimum are averaged, then AMP is the difference between the warmest and coldest month. This is a 12-value lookup, not day-by-day.

period = f"{df['date'].dt.year.min()}_{df['date'].dt.year.max()}"
outdir = (
    base / "denmark" / "deliverables" / "tav_amp"
    / args.model / args.scenario / period
)
outdir.mkdir(parents=True, exist_ok=True)

The output directory is constructed from the data itself. If the NetCDF spans 1971 to 2014, period becomes "1971_2014". This makes the output path self-documenting and avoids hardcoding year ranges.

pd.DataFrame(
    [
        {
            "model": args.model,
            "scenario": args.scenario,
            "tav_degC": round(tav_val, 3),
            "amp_degC": round(amp_val, 3),
        }
    ]
).to_csv(out, index=False)
dsx.close()
dsn.close()
print(out)

The output CSV has exactly one row. Values are rounded to 3 decimal places. Both dataset handles are closed before the script exits. The path of the output file is printed so the calling shell script can log it.


Script 07 — Extract zonal statistics to Parquet

File: 07_extract_zonal_stats_parquet.py

#!/usr/bin/env python3
"""Convert Denmark processed NetCDF files into Ethiopia-style zonal parquet outputs."""

import argparse
from pathlib import Path

import geopandas as gpd
import numpy as np
import pandas as pd
import xarray as xr
from shapely.geometry import Point


def find_cells(geometry, lons, lats):
    bounds = geometry.bounds
    lon_idx = np.where((lons >= bounds[0]) & (lons <= bounds[2]))[0]
    lat_idx = np.where((lats >= bounds[1]) & (lats <= bounds[3]))[0]
    hits = []
    for yi in lat_idx:
        for xi in lon_idx:
            if geometry.contains(Point(lons[xi], lats[yi])):
                hits.append((yi, xi))
    return hits


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", required=True)
    parser.add_argument("--scenario", required=True)
    parser.add_argument("--variable", required=True)
    parser.add_argument("--start-year", type=int)
    parser.add_argument("--end-year", type=int)
    args = parser.parse_args()

    base = Path("/home/<user>/DEN")
    nc_file = (
        base / args.model / "processed"
        / f"{args.model}_{args.scenario}_{args.variable}_denmark.nc"
    )
    shp = base / "shapefiles" / "denmark_grid_05deg.shp"
    gdf = gpd.read_file(shp)
    ds = xr.open_dataset(nc_file)
    var = ds[args.variable]
    lats = ds["lat"].values if "lat" in ds.coords else ds["latitude"].values
    lons = ds["lon"].values if "lon" in ds.coords else ds["longitude"].values
    times = pd.to_datetime(ds["time"].values)
    if args.start_year is not None:
        mask = times.year >= args.start_year
        var = var.isel(time=mask)
        times = times[mask]
    if args.end_year is not None:
        mask = times.year <= args.end_year
        var = var.isel(time=mask)
        times = times[mask]
    if len(times) == 0:
        raise SystemExit("No timesteps remain after filtering")
    period = f"{times.min().year}_{times.max().year}"
    out_dir = (
        base
        / "denmark"
        / "deliverables"
        / "parquet_zonal_stats"
        / args.model
        / args.scenario
        / period
        / args.variable
    )
    out_dir.mkdir(parents=True, exist_ok=True)
    out_file = out_dir / f"{args.variable}.parquet"

    rows = []
    for idx, row in gdf.iterrows():
        zone_id = int(row.get("grid_id", idx + 1))
        cells = find_cells(row.geometry, lons, lats)
        if not cells:
            continue
        yi = [c[0] for c in cells]
        xi = [c[1] for c in cells]
        values = var.values[:, yi, xi]
        mean_values = np.nanmean(values, axis=1)
        counts = np.sum(~np.isnan(values), axis=1)
        rows.append(
            pd.DataFrame(
                {
                    "zone_id": zone_id,
                    "date": times,
                    "value": mean_values,
                    "cell_count": counts,
                }
            )
        )

    if not rows:
        raise SystemExit("No zonal rows created")

    pd.concat(rows, ignore_index=True).to_parquet(out_file, index=False)
    ds.close()
    print(out_file)


if __name__ == "__main__":
    main()

What this script produces

For each model, scenario, variable, and period it writes one Parquet file:

deliverables/parquet_zonal_stats/
    GFDL_ESM4/
        historical/
            1971_2014/
                pr/pr.parquet
                tasmax/tasmax.parquet
                tasmin/tasmin.parquet
                rsds/rsds.parquet

Each Parquet file has one row per zone per day:

zone_iddatevaluecell_count
11971-01-010.01
11971-01-021.61
............
582014-12-312.31

What Parquet is and why it is used here

Parquet is a column-oriented binary file format. Compared to CSV:

  • much smaller file size for the same data
  • much faster to read in Python with pandas or pyarrow
  • stores data types exactly (no ambiguity between integers and strings)

For a daily time series over 58 zones and 44 years, a CSV would have about 930,000 rows per variable. Parquet compresses this by 5 to 10 times and reads it roughly 10 times faster.

The find_cells function

def find_cells(geometry, lons, lats):
    bounds = geometry.bounds
    lon_idx = np.where((lons >= bounds[0]) & (lons <= bounds[2]))[0]
    lat_idx = np.where((lats >= bounds[1]) & (lats <= bounds[3]))[0]
    hits = []
    for yi in lat_idx:
        for xi in lon_idx:
            if geometry.contains(Point(lons[xi], lats[yi])):
                hits.append((yi, xi))
    return hits

This function answers the question: which NetCDF grid points fall inside this shapefile polygon?

Step 1 — bounding box pre-filter:

bounds = geometry.bounds
lon_idx = np.where((lons >= bounds[0]) & (lons <= bounds[2]))[0]
lat_idx = np.where((lats >= bounds[1]) & (lats <= bounds[3]))[0]

geometry.bounds returns (minx, miny, maxx, maxy) for the polygon. The numpy where calls find all longitude indices that fall within the east-west range and all latitude indices that fall within the north-south range. This pre-filter makes the full search much faster by limiting the candidate cells before the slower point-in-polygon test.

Step 2 — point-in-polygon test:

for yi in lat_idx:
    for xi in lon_idx:
        if geometry.contains(Point(lons[xi], lats[yi])):
            hits.append((yi, xi))

For each candidate cell, a Point is constructed from its longitude and latitude and tested against the polygon using Shapely's contains method. Only cells whose centre point falls strictly inside the polygon are kept.

The function returns a list of (yi, xi) index pairs — the row and column indices into the NetCDF data array.

Line-by-line breakdown of main

lats = ds["lat"].values if "lat" in ds.coords else ds["latitude"].values
lons = ds["lon"].values if "lon" in ds.coords else ds["longitude"].values

Different ISIMIP providers use slightly different coordinate names. Some use lat/lon, others use latitude/longitude. This one-liner handles both without crashing.

times = pd.to_datetime(ds["time"].values)
if args.start_year is not None:
    mask = times.year >= args.start_year
    var = var.isel(time=mask)
    times = times[mask]

The time dimension is converted to pandas DatetimeIndex immediately. Pandas datetime operations (.year, .month, etc.) are simpler and more reliable than xarray's own time filtering. The isel call selects only the time steps that pass the year filter, keeping the data and times in sync.

for idx, row in gdf.iterrows():
    zone_id = int(row.get("grid_id", idx + 1))
    cells = find_cells(row.geometry, lons, lats)
    if not cells:
        continue
    yi = [c[0] for c in cells]
    xi = [c[1] for c in cells]
    values = var.values[:, yi, xi]
    mean_values = np.nanmean(values, axis=1)
    counts = np.sum(~np.isnan(values), axis=1)

The outer loop iterates over every Denmark grid polygon. For each polygon:

  • find_cells returns the NetCDF indices for all grid points inside it
  • var.values[:, yi, xi] extracts a 2D array of shape (n_days, n_cells)
  • np.nanmean(..., axis=1) averages across cells for each day, ignoring NaN values (ocean cells in the NetCDF return NaN)
  • counts records how many non-NaN cells contributed to each day's average, which is useful for QC later
rows.append(
    pd.DataFrame(
        {
            "zone_id": zone_id,
            "date": times,
            "value": mean_values,
            "cell_count": counts,
        }
    )
)

Each zone becomes one DataFrame with one row per day. All zone DataFrames are collected in rows.

pd.concat(rows, ignore_index=True).to_parquet(out_file, index=False)

All zone DataFrames are concatenated and written as a single Parquet file. ignore_index=True resets the row index to 0, 1, 2, … so there are no duplicate index values. index=False keeps the Parquet file clean by not writing the index as an extra column.


Script 08 — Generate DSSAT WTH files

File: 08_generate_wth_files.py

#!/usr/bin/env python3
"""Generate Denmark DSSAT WTH files from Denmark parquet zonal stats."""

from __future__ import annotations

import argparse
from pathlib import Path

import pandas as pd

VARIABLES = ["pr", "tasmax", "tasmin", "rsds"]
COLUMN_MAP = {"pr": "RAIN", "tasmax": "TMAX", "tasmin": "TMIN", "rsds": "SRAD"}
DEFAULT_DEM = Path(
    "/home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif"
)
MIN_VALID_ELEV = 1.0


def load_zone_metadata(shapefile: Path, dem_raster: Path | None):
    import geopandas as gpd
    import rasterio
    from rasterstats import zonal_stats

    gdf = gpd.read_file(shapefile)
    extent_path = Path("/home/<user>/DEN/shapefiles/denmark_extent.shp")
    land_geom = None
    if extent_path.exists():
        extent_gdf = gpd.read_file(extent_path)
        land_geom = (
            extent_gdf.union_all()
            if hasattr(extent_gdf, "union_all")
            else extent_gdf.unary_union
        )
    meta = {}
    sample_points = []
    zone_ids = []
    land_shapes = []
    for idx, row in gdf.iterrows():
        zone_id = int(row.get("grid_id", idx + 1))
        geom = row.geometry
        if land_geom is not None:
            clipped = geom.intersection(land_geom)
            if not clipped.is_empty:
                geom = clipped
        centroid = geom.centroid
        meta[zone_id] = {"lat": float(centroid.y), "lon": float(centroid.x), "elev": 0.0}
        sample_points.append((float(centroid.x), float(centroid.y)))
        zone_ids.append(zone_id)
        land_shapes.append(geom)

    if dem_raster and dem_raster.exists():
        with rasterio.open(dem_raster) as src:
            nodata = src.nodata
            samples = list(src.sample(sample_points))
        stats = zonal_stats(land_shapes, dem_raster, stats=["mean", "max"], nodata=nodata)
        for zone_id, sample in zip(zone_ids, samples):
            elev = float(sample[0]) if sample is not None and len(sample) else 0.0
            if nodata is not None and elev == nodata:
                elev = 0.0
            meta[zone_id]["elev"] = elev
        for zone_id, stat in zip(zone_ids, stats):
            if meta[zone_id]["elev"] <= 0 and stat.get("mean") is not None:
                meta[zone_id]["elev"] = float(stat["mean"])
            if meta[zone_id]["elev"] <= 0 and stat.get("max") is not None:
                meta[zone_id]["elev"] = float(stat["max"])

        positive = [zid for zid in zone_ids if meta[zid]["elev"] >= MIN_VALID_ELEV]
        for zone_id in zone_ids:
            if meta[zone_id]["elev"] >= MIN_VALID_ELEV or not positive:
                continue
            lat = meta[zone_id]["lat"]
            lon = meta[zone_id]["lon"]
            nearest = min(
                positive,
                key=lambda zid: (meta[zid]["lat"] - lat) ** 2
                + (meta[zid]["lon"] - lon) ** 2,
            )
            meta[zone_id]["elev"] = meta[nearest]["elev"]
    return meta


def normalize_srad(series: pd.Series) -> pd.Series:
    max_val = float(series.max()) if not series.empty else 0.0
    if max_val > 10000:
        return series / 1_000_000.0
    if max_val > 100:
        return series * 0.0864
    return series


def write_wth(
    output_path: Path,
    zone_id: int,
    zone_df: pd.DataFrame,
    zone_meta: dict[str, float],
) -> None:
    tavg = (zone_df["TMAX"] + zone_df["TMIN"]) / 2.0
    tav = round(float(tavg.mean()), 1)
    monthly = (
        zone_df.assign(month=pd.to_datetime(zone_df["date"]).dt.month)
        .groupby("month")[["TMAX", "TMIN"]]
        .mean()
    )
    monthly_tavg = (monthly["TMAX"] + monthly["TMIN"]) / 2.0
    amp = round(float(monthly_tavg.max() - monthly_tavg.min()), 1)
    years = sorted(zone_df["year"].unique())
    start_year = years[0]

    output_path.parent.mkdir(parents=True, exist_ok=True)
    with open(output_path, "w", encoding="ascii", newline="\n") as h:
        h.write(f"*WEATHER DATA : {zone_id:04d} {start_year}\n")
        h.write("@ INSI      LAT     LONG  ELEV   TAV   AMP REFHT WNDHT\n")
        h.write(
            f"{zone_id:>6}"
            f"{zone_meta['lat']:>9.2f}"
            f"{zone_meta['lon']:>9.2f}"
            f"{zone_meta['elev']:>6.0f}"
            f"{tav:>6.1f}"
            f"{amp:>6.1f}"
            f"{-99:>6}"
            f"{-99:>6}\n"
        )
        h.write("@DATE  SRAD  TMAX  TMIN  RAIN\n")
        for year, doy, srad, tmax, tmin, rain in zone_df[
            ["year", "doy", "SRAD", "TMAX", "TMIN", "RAIN"]
        ].itertuples(index=False, name=None):
            h.write(
                f"{int(year) % 100:02d}{int(doy):03d}"
                f"{float(srad):6.1f}"
                f"{float(tmax):6.1f}"
                f"{float(tmin):6.1f}"
                f"{float(rain):6.1f}\n"
            )


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", required=True)
    parser.add_argument("--scenario", required=True)
    parser.add_argument("--period", required=True)
    parser.add_argument("--dem-raster", default=str(DEFAULT_DEM))
    args = parser.parse_args()

    base = Path("/home/<user>/DEN/denmark/deliverables")
    zonal_dir = base / "parquet_zonal_stats" / args.model / args.scenario / args.period
    out_dir = base / "wth" / args.model / args.scenario / args.period
    meta = load_zone_metadata(
        Path("/home/<user>/DEN/shapefiles/denmark_grid_05deg.shp"),
        Path(args.dem_raster) if args.dem_raster else None,
    )

    zone_data = {}
    for var in VARIABLES:
        pq = zonal_dir / var / f"{var}.parquet"
        if not pq.exists():
            raise SystemExit(f"Missing {pq}")
        zone_data[var] = pd.read_parquet(pq)

    merged = zone_data["pr"][["zone_id", "date", "value"]].rename(
        columns={"value": "RAIN"}
    )
    for var in VARIABLES[1:]:
        df = zone_data[var][["zone_id", "date", "value"]].rename(
            columns={"value": COLUMN_MAP[var]}
        )
        merged = merged.merge(df, on=["zone_id", "date"], how="inner")

    merged["date"] = pd.to_datetime(merged["date"])
    merged["SRAD"] = normalize_srad(merged["SRAD"])
    merged["year"] = merged["date"].dt.year
    merged["doy"] = merged["date"].dt.dayofyear

    for zone_id, zone_df in merged.groupby("zone_id"):
        years = sorted(zone_df["year"].unique())
        filename = f"{int(zone_id):04d}{years[0] % 100:02d}{len(years):02d}.WTH"
        write_wth(
            out_dir / filename,
            int(zone_id),
            zone_df.sort_values("date"),
            meta[int(zone_id)],
        )

    print(out_dir)


if __name__ == "__main__":
    main()

What this script produces

One .WTH file per Denmark grid cell per model-scenario combination:

deliverables/wth/GFDL_ESM4/historical/1971_2014/
    00017144.WTH
    00027144.WTH
    ...
    05807144.WTH

The filename pattern is {zone_id:04d}{yy:02d}{n_years:02d}.WTH.

For zone 1, starting in 1971, spanning 44 years: 00017144.WTH.

Each file looks like this:

*WEATHER DATA : 0001 1971
@ INSI      LAT     LONG  ELEV   TAV   AMP REFHT WNDHT
     1    54.95     8.83     5   8.7  15.7   -99   -99
@DATE  SRAD  TMAX  TMIN  RAIN
71001   1.2   2.6  -4.7   0.0
71002   2.9  -3.3  -7.4   0.0
...

The module-level constants

VARIABLES = ["pr", "tasmax", "tasmin", "rsds"]
COLUMN_MAP = {"pr": "RAIN", "tasmax": "TMAX", "tasmin": "TMIN", "rsds": "SRAD"}
DEFAULT_DEM = Path(
    "/home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif"
)
MIN_VALID_ELEV = 1.0

Defining constants at the top of the file rather than buried in functions makes them easy to find and change. MIN_VALID_ELEV = 1.0 sets the threshold below which an elevation value is considered suspect (likely a water cell or a nodata misread). Any zone with elevation below 1 metre gets the fallback logic.

The load_zone_metadata function

This function returns a dictionary keyed by zone_id, with each value containing the latitude, longitude, and elevation for that zone.

land_geom = None
if extent_path.exists():
    extent_gdf = gpd.read_file(extent_path)
    land_geom = (
        extent_gdf.union_all()
        if hasattr(extent_gdf, "union_all")
        else extent_gdf.unary_union
    )

The Denmark extent shapefile is read and dissolved into a single geometry representing all land area. This is used in the next block to clip the grid polygons so coastal cells only sample elevation over their land portion.

union_all is the newer geopandas method; unary_union is the older one. The hasattr check makes the script compatible with both.

for idx, row in gdf.iterrows():
    zone_id = int(row.get("grid_id", idx + 1))
    geom = row.geometry
    if land_geom is not None:
        clipped = geom.intersection(land_geom)
        if not clipped.is_empty:
            geom = clipped
    centroid = geom.centroid

Each grid polygon is intersected with the land boundary. For a purely inland grid cell the intersection returns the same polygon. For a coastal cell that is half sea, the intersection returns only the land portion. The centroid is then computed from the land-only shape, so it sits on land rather than in the sea.

The four-level elevation fallback:

# Level 1: DEM point sample at the land centroid
with rasterio.open(dem_raster) as src:
    nodata = src.nodata
    samples = list(src.sample(sample_points))
for zone_id, sample in zip(zone_ids, samples):
    elev = float(sample[0]) if sample is not None and len(sample) else 0.0
    if nodata is not None and elev == nodata:
        elev = 0.0
    meta[zone_id]["elev"] = elev

# Level 2 and 3: zonal mean, then max, if point sample failed
stats = zonal_stats(land_shapes, dem_raster, stats=["mean", "max"], nodata=nodata)
for zone_id, stat in zip(zone_ids, stats):
    if meta[zone_id]["elev"] <= 0 and stat.get("mean") is not None:
        meta[zone_id]["elev"] = float(stat["mean"])
    if meta[zone_id]["elev"] <= 0 and stat.get("max") is not None:
        meta[zone_id]["elev"] = float(stat["max"])

# Level 4: nearest valid neighbour
positive = [zid for zid in zone_ids if meta[zid]["elev"] >= MIN_VALID_ELEV]
for zone_id in zone_ids:
    if meta[zone_id]["elev"] >= MIN_VALID_ELEV or not positive:
        continue
    nearest = min(
        positive,
        key=lambda zid: (meta[zid]["lat"] - lat) ** 2
        + (meta[zid]["lon"] - lon) ** 2,
    )
    meta[zone_id]["elev"] = meta[nearest]["elev"]

Before this four-level fallback was added, many coastal Denmark grid cells got ELEV = 0 because their centroid fell in the sea and the DEM returned nodata there. The fallback ensures every zone ends up with a physically plausible elevation.

The normalize_srad function

def normalize_srad(series: pd.Series) -> pd.Series:
    max_val = float(series.max()) if not series.empty else 0.0
    if max_val > 10000:
        return series / 1_000_000.0
    if max_val > 100:
        return series * 0.0864
    return series

This function handles solar radiation unit ambiguity robustly.

ISIMIP3b data sometimes arrives as W/m² and sometimes the unit conversion in Script 05 already applied the 0.0864 factor. The maximum value in the series reveals which case we are in:

  • If max > 10,000: the values are in J/m²/day (raw energy totals). Divide by 1,000,000 to get MJ/m²/day.
  • If max > 100: the values are in W/m². Multiply by 0.0864 to convert to MJ/m²/day.
  • Otherwise: values are already in MJ/m²/day. Return unchanged.

This defensive approach prevents radiation values 1,000 times too large from silently corrupting the WTH files.

The write_wth function — building the WTH header

tavg = (zone_df["TMAX"] + zone_df["TMIN"]) / 2.0
tav = round(float(tavg.mean()), 1)
monthly = (
    zone_df.assign(month=pd.to_datetime(zone_df["date"]).dt.month)
    .groupby("month")[["TMAX", "TMIN"]]
    .mean()
)
monthly_tavg = (monthly["TMAX"] + monthly["TMIN"]) / 2.0
amp = round(float(monthly_tavg.max() - monthly_tavg.min()), 1)

TAV and AMP are computed per zone from the actual zone data. These are therefore slightly different from the domain-average TAV/AMP in Script 06. The values written to the WTH header are the per-zone values here.

with open(output_path, "w", encoding="ascii", newline="\n") as h:
    h.write(f"*WEATHER DATA : {zone_id:04d} {start_year}\n")
    h.write("@ INSI      LAT     LONG  ELEV   TAV   AMP REFHT WNDHT\n")
    h.write(
        f"{zone_id:>6}"
        f"{zone_meta['lat']:>9.2f}"
        f"{zone_meta['lon']:>9.2f}"
        f"{zone_meta['elev']:>6.0f}"
        f"{tav:>6.1f}"
        f"{amp:>6.1f}"
        f"{-99:>6}"
        f"{-99:>6}\n"
    )

Two things matter in this block:

  1. encoding="ascii" — DSSAT reads weather files as plain ASCII. Using UTF-8 or Latin-1 can cause invisible character problems.
  2. newline="\n" — DSSAT expects Unix-style line endings. If you let Python use the platform default on Windows, you get \r\n line endings, which can confuse DSSAT's fixed-width parser.

The format strings use right-aligned field widths (>6, >9.2f, etc.) to match the exact column positions expected by DSSAT. Even one character off can cause DSSAT to misread a value.

-99 for REFHT and WNDHT is the DSSAT missing-value code, indicating that reference height for humidity and wind height are not provided. DSSAT handles this gracefully.

Writing the daily data rows

h.write("@DATE  SRAD  TMAX  TMIN  RAIN\n")
for year, doy, srad, tmax, tmin, rain in zone_df[
    ["year", "doy", "SRAD", "TMAX", "TMIN", "RAIN"]
].itertuples(index=False, name=None):
    h.write(
        f"{int(year) % 100:02d}{int(doy):03d}"
        f"{float(srad):6.1f}"
        f"{float(tmax):6.1f}"
        f"{float(tmin):6.1f}"
        f"{float(rain):6.1f}\n"
    )

int(year) % 100 converts a 4-digit year to a 2-digit year: 1971 becomes 71, 2031 becomes 31. DSSAT's @DATE field is YYDDD format — two-digit year plus three-digit day-of-year. So 1 January 1971 becomes 71001.

itertuples is used instead of iterrows because it is significantly faster when writing many rows. The name=None argument suppresses the named-tuple wrapper for a small additional speed gain.

Building the WTH filename

filename = f"{int(zone_id):04d}{years[0] % 100:02d}{len(years):02d}.WTH"

For zone 1, starting in 1971, spanning 44 years: 00017144.WTH.

  • 0001 — zone ID, zero-padded to 4 digits
  • 71 — two-digit start year (1971 % 100)
  • 44 — number of years in the file

This naming convention is consistent with how DSSAT weather files are typically named.


Script 09 — Validate outputs

File: 09_validate_outputs.py

#!/usr/bin/env python3
from pathlib import Path

base = Path("/home/<user>/DEN/denmark/deliverables")
for label, pattern in [
    ("parquet", "parquet_zonal_stats/**/*.parquet"),
    ("wth", "wth/**/*.WTH"),
    ("tav_amp", "tav_amp/**/*.csv"),
]:
    print(label, len(list(base.rglob(pattern.split("**/")[-1]))))

What this script does

It counts the deliverables produced.

After a full production run, running this script should print:

parquet 80
wth 1160
tav_amp 20

Why those numbers

  • 80 parquet files = 5 models × 4 scenarios × 4 variables
  • 1160 WTH files = 5 models × 4 scenarios × 58 Denmark grid cells
  • 20 TAV/AMP files = 5 models × 4 scenarios

If any number is wrong, you know which layer of the pipeline to investigate. If parquet is 76 instead of 80, four Parquet jobs failed. If wth is 1102 instead of 1160, one model-scenario combination's 58 WTH files are missing.

The rglob trick

base.rglob(pattern.split("**/")[-1])

split("**/")[-1] extracts just the filename extension pattern (e.g. *.parquet). Then rglob searches recursively for all files matching that pattern. This avoids needing to know the exact directory depth, which changes between models and scenarios.


Script 10 — Run one model-scenario combination

File: 10_run_model_scenario_weather.sh

#!/bin/bash
set -euo pipefail

export PATH="$HOME/.local/bin:$PATH"
unset PYTHONPATH || true
GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"
GEO_ENV_ROOT="/home/<user>/.conda/envs/mamba_env/envs/geo_env"
if [ -d "$GEO_ENV_ROOT/share/proj" ]; then
  export PROJ_LIB="$GEO_ENV_ROOT/share/proj"
fi

if [ $# -ne 2 ]; then
  echo "Usage: $0 MODEL SCENARIO"
  exit 1
fi

MODEL="$1"
SCENARIO="$2"
BASE="/home/<user>/DEN/denmark"

echo "=================================================="
echo "DENMARK WEATHER PIPELINE"
echo "MODEL: $MODEL"
echo "SCENARIO: $SCENARIO"
echo "START: $(date)"
echo "=================================================="

read START_YEAR END_YEAR <<< "$($GEO_PY - <<PY
from pathlib import Path
import xarray as xr
base = Path("/home/<user>/DEN")
model = "$MODEL"
scenario = "$SCENARIO"
p = base / model / "processed" / f"{model}_{scenario}_pr_denmark.nc"
ds = xr.open_dataset(p)
t = ds["time"].values
print(str(t[0])[:4], str(t[-1])[:4])
ds.close()
PY
)"

$GEO_PY "$BASE/scripts/06_compute_tav_amp.py" \
  --model "$MODEL" \
  --scenario "$SCENARIO" \
  --start-year "$START_YEAR" \
  --end-year "$END_YEAR"

for VAR in pr tasmax tasmin rsds; do
  echo "Extracting parquet for $MODEL / $SCENARIO / $VAR"
  $GEO_PY "$BASE/scripts/07_extract_zonal_stats_parquet.py" \
    --model "$MODEL" \
    --scenario "$SCENARIO" \
    --variable "$VAR" \
    --start-year "$START_YEAR" \
    --end-year "$END_YEAR"
done

PERIOD="${START_YEAR}_${END_YEAR}"

echo "Generating WTH files for $MODEL / $SCENARIO / $PERIOD"
$GEO_PY "$BASE/scripts/08_generate_wth_files.py" \
  --model "$MODEL" \
  --scenario "$SCENARIO" \
  --period "$PERIOD"

echo "DONE: $MODEL / $SCENARIO at $(date)"

What this script does

This is the main per-job orchestration script.

When you call:

./10_run_model_scenario_weather.sh GFDL_ESM4 historical

it runs Scripts 06, 07 (×4 variables), and 08 in sequence for that one model-scenario combination.

Line-by-line breakdown

GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"
GEO_ENV_ROOT="/home/<user>/.conda/envs/mamba_env/envs/geo_env"
if [ -d "$GEO_ENV_ROOT/share/proj" ]; then
  export PROJ_LIB="$GEO_ENV_ROOT/share/proj"
fi

PROJ_LIB tells GDAL and rasterio where to find the PROJ projection database. When this is not set correctly, geopandas operations involving coordinate reference systems fail with cryptic messages about missing proj.db.

On the WUR HPC, this path inside the conda environment contains the correct PROJ database. Setting it here prevents the problem.

if [ $# -ne 2 ]; then
  echo "Usage: $0 MODEL SCENARIO"
  exit 1
fi

$# is the number of arguments passed to the script. If not exactly 2, a usage message is printed and the script exits. This catches typos before any long-running computation starts.

read START_YEAR END_YEAR <<< "$($GEO_PY - <<PY
from pathlib import Path
import xarray as xr
base = Path("/home/<user>/DEN")
model = "$MODEL"
scenario = "$SCENARIO"
p = base / model / "processed" / f"{model}_{scenario}_pr_denmark.nc"
ds = xr.open_dataset(p)
t = ds["time"].values
print(str(t[0])[:4], str(t[-1])[:4])
ds.close()
PY
)"

This is the most complex line in the script. It:

  1. Opens a Python heredoc that reads the precipitation NetCDF for this model-scenario
  2. Extracts the first and last timestamps and prints their 4-digit years
  3. Captures that printed output back into the shell
  4. Uses read START_YEAR END_YEAR to split the two space-separated years into two shell variables

The result: START_YEAR=1971, END_YEAR=2014 for the historical period — without hardcoding those years anywhere.

This matters because the historical and future periods differ:

  • historical: 1971–2014
  • ssp scenarios: 2031–2100

Using the NetCDF itself as the source of truth means the same script works for all scenarios without modification.

$GEO_PY "$BASE/scripts/06_compute_tav_amp.py" \
  --model "$MODEL" \
  --scenario "$SCENARIO" \
  --start-year "$START_YEAR" \
  --end-year "$END_YEAR"

Script 06 runs first because TAV and AMP are conceptually a summary of the full period, and computing them at this stage is a useful early sanity check before the heavier zonal extraction runs.

for VAR in pr tasmax tasmin rsds; do
  echo "Extracting parquet for $MODEL / $SCENARIO / $VAR"
  $GEO_PY "$BASE/scripts/07_extract_zonal_stats_parquet.py" \
    --model "$MODEL" \
    --scenario "$SCENARIO" \
    --variable "$VAR" \
    --start-year "$START_YEAR" \
    --end-year "$END_YEAR"
done

Script 07 runs four times — once per variable. The echo before each call means the Slurm log file shows exactly which variable is being processed and when.

PERIOD="${START_YEAR}_${END_YEAR}"

echo "Generating WTH files for $MODEL / $SCENARIO / $PERIOD"
$GEO_PY "$BASE/scripts/08_generate_wth_files.py" \
  --model "$MODEL" \
  --scenario "$SCENARIO" \
  --period "$PERIOD"

Script 08 runs last because it depends on the Parquet files from Script 07. PERIOD is assembled from the same variables used in the Parquet output paths, so the directory lookup in Script 08 always matches what Script 07 wrote.


Script 11 — Submit all jobs to Slurm

File: 11_submit_denmark_weather.sh

#!/bin/bash
set -euo pipefail

BASE="/home/<user>/DEN/denmark"
LOGDIR="$BASE/logs"
mkdir -p "$LOGDIR"

MODELS=("GFDL_ESM4" "IPSL-CM6A-LR" "MPI-ESM1-2-HR" "MRI-ESM2-0" "UKESM1-0-LL")
SCENARIOS=("historical" "ssp126" "ssp370" "ssp585")

MODELS_TO_RUN=("${MODELS[@]}")
SCENARIOS_TO_RUN=("${SCENARIOS[@]}")

if [ $# -ge 1 ] && [ -n "${1:-}" ]; then
  IFS=',' read -r -a MODELS_TO_RUN <<< "$1"
fi
if [ $# -ge 2 ] && [ -n "${2:-}" ]; then
  IFS=',' read -r -a SCENARIOS_TO_RUN <<< "$2"
fi

echo "Submitting DEM preparation job..."
DEM_MERGED="$BASE/deliverables/validation/denmark_dem_merged.tif"
DEPENDENCY_ARGS=()
if [ -f "$DEM_MERGED" ]; then
  echo "DEM already available: $DEM_MERGED"
else
  DEM_JOB=$(sbatch \
    --parsable \
    --job-name=den_dem \
    --partition=cpuq \
    --account=pr_climate \
    --time=01:00:00 \
    --cpus-per-task=2 \
    --mem=8G \
    --output="$LOGDIR/dem-%j.out" \
    --error="$LOGDIR/dem-%j.err" \
    --wrap="$BASE/scripts/04_prepare_dem.sh")
  echo "DEM job: $DEM_JOB"
  DEPENDENCY_ARGS=(--dependency="afterok:${DEM_JOB}")
fi

for MODEL in "${MODELS_TO_RUN[@]}"; do
  for SCENARIO in "${SCENARIOS_TO_RUN[@]}"; do
    JOBID=$(sbatch \
      --parsable \
      "${DEPENDENCY_ARGS[@]}" \
      --job-name="den_${MODEL}_${SCENARIO}" \
      --partition=cpuq \
      --account=pr_climate \
      --time=04:00:00 \
      --cpus-per-task=4 \
      --mem=24G \
      --output="$LOGDIR/${MODEL}_${SCENARIO}-%j.out" \
      --error="$LOGDIR/${MODEL}_${SCENARIO}-%j.err" \
      --wrap="$BASE/scripts/10_run_model_scenario_weather.sh $MODEL $SCENARIO")
    echo "$MODEL $SCENARIO $JOBID"
  done
done

What this script does

It submits all 20 model-scenario combinations to the HPC cluster as separate Slurm jobs, all running in parallel.

It also handles the DEM job as a mandatory prerequisite: if the merged DEM does not yet exist, it submits the DEM job first and makes all weather jobs depend on it.

Line-by-line breakdown

MODELS=("GFDL_ESM4" "IPSL-CM6A-LR" "MPI-ESM1-2-HR" "MRI-ESM2-0" "UKESM1-0-LL")
SCENARIOS=("historical" "ssp126" "ssp370" "ssp585")

MODELS_TO_RUN=("${MODELS[@]}")
SCENARIOS_TO_RUN=("${SCENARIOS[@]}")

Two arrays define all models and scenarios. They are immediately copied to MODELS_TO_RUN and SCENARIOS_TO_RUN. Those copies are what the loop actually iterates. This pattern allows the full list to serve as documentation while the run list can be overridden.

if [ $# -ge 1 ] && [ -n "${1:-}" ]; then
  IFS=',' read -r -a MODELS_TO_RUN <<< "$1"
fi
if [ $# -ge 2 ] && [ -n "${2:-}" ]; then
  IFS=',' read -r -a SCENARIOS_TO_RUN <<< "$2"
fi

Optional arguments let you run a subset. For example:

./11_submit_denmark_weather.sh "MPI-ESM1-2-HR" "ssp370,ssp585"

submits only the two broken-and-repaired combinations instead of all 20.

IFS=',' sets the field separator so read -a splits on commas rather than spaces. ${1:-} uses an empty default so the variable is considered set even if the argument is empty, avoiding the -u error.

DEM_MERGED="$BASE/deliverables/validation/denmark_dem_merged.tif"
DEPENDENCY_ARGS=()
if [ -f "$DEM_MERGED" ]; then
  echo "DEM already available: $DEM_MERGED"
else
  DEM_JOB=$(sbatch \
    --parsable \
    --job-name=den_dem \
    ...
    --wrap="$BASE/scripts/04_prepare_dem.sh")
  echo "DEM job: $DEM_JOB"
  DEPENDENCY_ARGS=(--dependency="afterok:${DEM_JOB}")
fi

The DEM check runs before the main loop. If the merged DEM already exists, DEPENDENCY_ARGS stays as an empty array and the weather jobs start immediately. If the DEM does not exist, sbatch --parsable returns only the job ID number, which is captured in DEM_JOB. The dependency array then holds --dependency=afterok:12345, meaning every weather job will only start after job 12345 finishes successfully.

DEM_JOB=$(sbatch \
  --parsable \
  --job-name=den_dem \
  --partition=cpuq \
  --account=pr_climate \
  --time=01:00:00 \
  --cpus-per-task=2 \
  --mem=8G \
  --output="$LOGDIR/dem-%j.out" \
  --error="$LOGDIR/dem-%j.err" \
  --wrap="$BASE/scripts/04_prepare_dem.sh")

The DEM job is allocated 2 CPUs, 8 GB RAM, and 1 hour. The %j in the log filenames is replaced by the Slurm job ID, so each run gets its own log file.

JOBID=$(sbatch \
  --parsable \
  "${DEPENDENCY_ARGS[@]}" \
  --job-name="den_${MODEL}_${SCENARIO}" \
  --partition=cpuq \
  --account=pr_climate \
  --time=04:00:00 \
  --cpus-per-task=4 \
  --mem=24G \
  --output="$LOGDIR/${MODEL}_${SCENARIO}-%j.out" \
  --error="$LOGDIR/${MODEL}_${SCENARIO}-%j.err" \
  --wrap="$BASE/scripts/10_run_model_scenario_weather.sh $MODEL $SCENARIO")
echo "$MODEL $SCENARIO $JOBID"

Each weather job gets 4 CPUs, 24 GB RAM, and 4 hours. The memory allocation covers loading climate NetCDFs (which can be several GB) plus the geopandas and xarray operations.

"${DEPENDENCY_ARGS[@]}" expands as either nothing (if DEM was already present) or --dependency=afterok:JOBID (if it was just submitted). This works because Bash expands an empty array to nothing rather than to an empty string.


The output files in detail

A WTH file header

*WEATHER DATA : 0001 1971
@ INSI      LAT     LONG  ELEV   TAV   AMP REFHT WNDHT
     1    54.95     8.83     5   8.7  15.7   -99   -99
FieldValueMeaning
INSI1Station identifier (zone_id)
LAT54.95Latitude in decimal degrees
LONG8.83Longitude in decimal degrees
ELEV5Elevation in metres above sea level
TAV8.7Mean annual temperature (°C)
AMP15.7Mean annual amplitude (°C)
REFHT-99Humidity measurement height (not provided)
WNDHT-99Wind measurement height (not provided)

A WTH daily row

71001   1.2   2.6  -4.7   0.0
Columns 1–5Col 6–11Col 12–17Col 18–23Col 24–29
71001 (YYDDD)1.2 SRAD2.6 TMAX-4.7 TMIN0.0 RAIN

The date 71001 means year 1971, day 1 (1 January).

A tav_amp CSV

model,scenario,tav_degC,amp_degC
GFDL_ESM4,historical,7.951,16.528

Problems we hit and what fixed them

1. Wrong Python environment on HPC

Symptom: ModuleNotFoundError: No module named 'geopandas'

Cause: The job was using the system Python rather than the geo_env conda environment.

Fix: Specify the full path to the Python binary in every script:

GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"

Never use python3 or python on HPC without knowing which environment they resolve to.

2. PROJ database not found

Symptom: ProjError: proj.db not found in proj search path

Cause: PROJ_LIB was not set, so geopandas could not find the coordinate reference system database.

Fix:

GEO_ENV_ROOT="/home/<user>/.conda/envs/mamba_env/envs/geo_env"
if [ -d "$GEO_ENV_ROOT/share/proj" ]; then
  export PROJ_LIB="$GEO_ENV_ROOT/share/proj"
fi

3. Corrupted processed NetCDF files

Symptom: MPI-ESM1-2-HR ssp370 WTH files were missing.

Cause: Two processed NetCDF files had been damaged:

MPI-ESM1-2-HR_ssp370_tasmin_denmark.nc
MPI-ESM1-2-HR_ssp370_rsds_denmark.nc

Fix: Rebuild them from raw source files using the same merge-clip-convert logic as Script 05. Once rebuilt, run Script 10 for that model-scenario only:

./10_run_model_scenario_weather.sh MPI-ESM1-2-HR ssp370

4. Zero-elevation coastal cells

Symptom: Many coastal Denmark WTH files had ELEV = 0.

Cause: The grid-cell centroid fell in the sea. The DEM returned its nodata value there. The script was treating that as zero instead of triggering a fallback.

Fix: The four-level elevation fallback in load_zone_metadata inside Script 08 (described above).

5. Wrong line endings on Windows

Symptom: DSSAT could not parse the WTH file or returned strange phenology results.

Cause: The WTH file was written with Windows \r\n line endings because it was generated on a Windows machine.

Fix: Always open WTH files with newline="\n" in Python:

with open(output_path, "w", encoding="ascii", newline="\n") as h:

Final validated deliverable counts

DeliverableCountFormula
Processed NetCDF files1005 models × 4 scenarios × 5 variables
Parquet zonal files805 models × 4 scenarios × 4 WTH variables
TAV/AMP CSV files205 models × 4 scenarios
WTH files11605 models × 4 scenarios × 58 grid cells

All 100 processed NetCDF files passed unit and range checks in processed_units_validation.csv:

  • pr mean: ~2.3–2.6 mm/day (plausible for Denmark)
  • tas mean: ~8–11 °C (plausible for Denmark)
  • rsds mean: ~10–11 MJ/m²/day (plausible for Denmark)

How to re-run from scratch

If you need to rebuild everything, the sequence is:

# Step 1: download (if data not already present)
./01_download_isimip.sh

# Step 2: merge, clip, convert (if processed NetCDFs not present)
# (runs the underlying merge_denmark.sh for each model/scenario/variable)
./05_merge_clip_convert_climate.sh

# Step 3: submit all weather jobs to Slurm
./11_submit_denmark_weather.sh

# Step 4: check counts once jobs complete
python 09_validate_outputs.py

To re-run only one broken model-scenario without submitting everything:

./10_run_model_scenario_weather.sh MPI-ESM1-2-HR ssp370

Script 14 — Fetch soil profiles and build DN.SOL

File: 14_fetch_soil_and_build_sol.py

This script is the soil counterpart to the weather pipeline.

It queries the MSU Decision Support and Informatics Soil Query API for each of the 60 Denmark grid centroids and assembles a complete DSSAT soil library file DN.SOL.

The API

The API is hosted by Michigan State University's Decision Support and Informatics group:

https://dsiweb.cse.msu.edu/soil-query-api/soil

It holds 1,984,797 global soil profiles sourced from ISRIC SoilGrids combined with the HC27 pedotransfer functions. A GET request with lat, lon, and format=json returns a single nearest-match profile.

GET /soil-query-api/soil?lat=54.751&lon=8.750&format=json

Example response (truncated):

{
  "profile": {
    "id": "DE01825304",
    "location": { "lat": 54.792, "lon": 8.708, "country_code": "DE" },
    "site": {
      "country_code_alpha3": "DEU",
      "scs_family": "HC_GEN0011",
      "texture": "Loam",
      "max_depth_cm": 200
    },
    "properties": {
      "scom": "BK", "salb": 0.1, "slu1": 6.0, "sldr": 0.5,
      "slro": 75.0, "slnf": 1.0, "slpf": 1.0,
      "smhb": "SA001", "smpx": "SA001", "smke": "SA001"
    },
    "layers": [
      { "slb": 5,  "slmh": "A",  "slll": 0.115, "sdul": 0.245, "ssat": 0.395,
        "srgf": 1.0, "ssks": 1.05, "sbdm": 1.26, "sloc": 4.01,
        "slcl": 18.87, "slsi": 40.46, "slcf": null, "slni": 0.12,
        "slhw": 6.13, "slhb": null,  "scec": 18.4, "sadc": null },
      ...5 more layers to 200 cm
    ],
    "metadata": { "source": "ISRIC soilgrids + HC27", "distance_km": 5.39 }
  }
}

The API also supports format=sol, which returns the profile already formatted as a DSSAT SOL block. The script uses format=json instead so it can assign Denmark-specific profile IDs (DK0001 to DK0060) and add a proper file header.

About the API — source code and data origin

The full API source code is publicly available at:

https://github.com/eusojk/soil-query

It is a Rust application built with Axum and SQLite, deployed on Railway.

The underlying data is from the 2015 Harvard Dataverse publication:

Han, Ines, and Koo — Global High-Resolution Soil Profile Database for Crop Modeling Applications

The database holds approximately 2 million profiles from 225 countries at roughly 10 km resolution, combining ISRIC SoilGrids with HC27 pedotransfer functions. Spatial lookups use an R-tree index and Haversine distance. Typical query latency is under 50 milliseconds.

The three API endpoints are:

EndpointWhat it does
GET /healthReturns {"status":"ok","version":"0.1.0","profiles":1984797}
GET /soil?lat=&lon=&format=jsonReturns nearest profile as JSON
GET /soil?lat=&lon=&format=solReturns nearest profile as DSSAT SOL text
GET /definitionsReturns field abbreviation descriptions

The format=sol option is convenient for quick inspection of one profile but returns the API's own profile ID (e.g. DE01825304), not our Denmark grid ID. The script uses format=json so it can assign the correct DK0001DK0060 identifiers.

One critical header requirement

A plain Python requests.get without extra headers returns an HTML page instead of JSON. The CDN protecting the server blocks requests that do not look like browser traffic.

The fix is two lines:

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0 Safari/537.36"
    ),
    "Referer": "https://dsiweb.cse.msu.edu/",
    "Accept": "application/json",
}

Without User-Agent and Referer, the request gets a 406 or an HTML page. This is not documented — it was found by probing the server directly.

The full script

#!/usr/bin/env python3
"""
14_fetch_soil_and_build_sol.py
-------------------------------
Query the MSU DSI Soil Query API for each Denmark 0.5-degree grid centroid
and assemble a complete DSSAT-format DN.SOL file.

Usage (local Windows):
    python 14_fetch_soil_and_build_sol.py

Usage (HPC):
    /home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python \\
        /home/<user>/DEN/denmark/scripts/14_fetch_soil_and_build_sol.py

Inputs:
    denmark_grid_05deg_centroids.csv  (grid_id, lon, lat — 60 rows)

Outputs:
    deliverables/soil/DN.SOL               DSSAT soil library
    deliverables/soil/soil_fetch_report.csv  per-centroid QC report
"""

from __future__ import annotations

import csv
import sys
import time
from pathlib import Path

import requests

# ── Detect HPC vs local ────────────────────────────────────────────────────────
_HPC_CENTROIDS   = Path("/home/<user>/DEN/denmark_grid_05deg_centroids.csv")
_LOCAL_CENTROIDS = Path(
    r"C:\Users\chich\Downloads\denmark_wth_bundle\denmark_grid_05deg_centroids.csv"
)

if _HPC_CENTROIDS.exists():
    CENTROIDS_CSV       = _HPC_CENTROIDS
    _BASE_DELIVERABLES  = Path("/home/<user>/DEN/denmark/deliverables")
elif _LOCAL_CENTROIDS.exists():
    CENTROIDS_CSV       = _LOCAL_CENTROIDS
    _BASE_DELIVERABLES  = Path(__file__).resolve().parent.parent / "deliverables"
else:
    sys.exit(
        "Cannot find denmark_grid_05deg_centroids.csv.\n"
        f"Expected at:\n  {_HPC_CENTROIDS}\n  {_LOCAL_CENTROIDS}"
    )

OUT_SOL    = _BASE_DELIVERABLES / "soil" / "DN.SOL"
OUT_REPORT = _BASE_DELIVERABLES / "soil" / "soil_fetch_report.csv"

# ── API ────────────────────────────────────────────────────────────────────────
API_URL = "https://dsiweb.cse.msu.edu/soil-query-api/soil"

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0 Safari/537.36"
    ),
    "Referer": "https://dsiweb.cse.msu.edu/",
    "Accept": "application/json",
}

REQUEST_TIMEOUT = 30
SLEEP_BETWEEN   = 0.6
MAX_RETRIES     = 3
MAX_DISTANCE_KM = 50.0
MISSING         = -99.0


# ── DSSAT SOL formatting ───────────────────────────────────────────────────────

def _fmt(val, width: int, dec: int) -> str:
    """Right-align a numeric value; substitute -99.0 for None."""
    v = MISSING if val is None else float(val)
    return f"{v:{width}.{dec}f}"


def build_sol_block(grid_id: int, lon: float, lat: float, profile: dict) -> str:
    loc    = profile.get("location",   {})
    site   = profile.get("site",       {})
    props  = profile.get("properties", {})
    layers = profile.get("layers",     [])
    meta   = profile.get("metadata",   {})

    profile_id = f"DK{grid_id:04d}"
    country_a3 = site.get("country_code_alpha3", "DNK")
    texture    = site.get("texture",     "Unknown")
    max_depth  = site.get("max_depth_cm", -99)
    source     = meta.get("source",      "MSU DSI API")
    api_lat    = loc.get("lat",  lat)
    api_lon    = loc.get("lon",  lon)
    scs_family = site.get("scs_family", "-99")

    scom = props.get("scom", "-99")
    salb = props.get("salb", MISSING)
    slu1 = props.get("slu1", MISSING)
    sldr = props.get("sldr", MISSING)
    slro = props.get("slro", MISSING)
    slnf = props.get("slnf", MISSING)
    slpf = props.get("slpf", MISSING)
    smhb = props.get("smhb", "SA001")
    smpx = props.get("smpx", "SA001")
    smke = props.get("smke", "SA001")

    lines: list[str] = []

    # Profile header
    lines.append(
        f"*{profile_id:<10}  {country_a3:<16} {texture:<8} {int(max_depth):>4}"
        f"    {source}"
    )

    # Site
    lines.append("@SITE        COUNTRY          LAT     LONG SCS Family")
    lines.append(
        f" {'-99':<12} {country_a3:<16}"
        f" {api_lat:>7.3f} {api_lon:>7.3f} {scs_family}"
    )

    # Properties
    lines.append(
        "@ SCOM  SALB  SLU1  SLDR  SLRO  SLNF  SLPF  SMHB  SMPX  SMKE"
    )
    lines.append(
        f"  {scom:>4}"
        f"{float(salb):>6.2f}"
        f"{float(slu1):>6.2f}"
        f"{float(sldr):>6.2f}"
        f"{float(slro):>6.2f}"
        f"{float(slnf):>6.2f}"
        f"{float(slpf):>6.2f}"
        f" {smhb:>5}"
        f" {smpx:>5}"
        f" {smke:>5}"
    )

    # Layers
    lines.append(
        "@  SLB  SLMH  SLLL  SDUL  SSAT  SRGF  SSKS  SBDM"
        "  SLOC  SLCL  SLSI  SLCF  SLNI  SLHW  SLHB  SCEC  SADC"
    )
    for lyr in layers:
        lines.append(
            f"{int(lyr.get('slb', 0)):>5} {str(lyr.get('slmh', '-99')):<5}"
            f"{_fmt(lyr.get('slll'), 6, 3)}"
            f"{_fmt(lyr.get('sdul'), 6, 3)}"
            f"{_fmt(lyr.get('ssat'), 6, 3)}"
            f"{_fmt(lyr.get('srgf'), 6, 2)}"
            f"{_fmt(lyr.get('ssks'), 6, 2)}"
            f"{_fmt(lyr.get('sbdm'), 6, 2)}"
            f"{_fmt(lyr.get('sloc'), 6, 2)}"
            f"{_fmt(lyr.get('slcl'), 6, 2)}"
            f"{_fmt(lyr.get('slsi'), 6, 2)}"
            f"{_fmt(lyr.get('slcf'), 6, 1)}"
            f"{_fmt(lyr.get('slni'), 6, 2)}"
            f"{_fmt(lyr.get('slhw'), 6, 2)}"
            f"{_fmt(lyr.get('slhb'), 6, 1)}"
            f"{_fmt(lyr.get('scec'), 6, 2)}"
            f"{_fmt(lyr.get('sadc'), 6, 1)}"
        )

    return "\n".join(lines)


# ── API fetch with retry ───────────────────────────────────────────────────────

def fetch_profile(lat: float, lon: float) -> dict | None:
    params = {"lat": round(lat, 6), "lon": round(lon, 6), "format": "json"}

    for attempt in range(1, MAX_RETRIES + 1):
        try:
            resp = requests.get(
                API_URL, params=params, headers=HEADERS, timeout=REQUEST_TIMEOUT
            )
            if 400 <= resp.status_code < 500:
                print(f"HTTP {resp.status_code} (no retry)")
                return None
            resp.raise_for_status()
            return resp.json().get("profile")

        except requests.exceptions.Timeout:
            print(f"timeout (attempt {attempt}/{MAX_RETRIES})", end="  ")
        except requests.exceptions.JSONDecodeError:
            print("bad JSON — likely got HTML; check headers")
            return None
        except requests.exceptions.RequestException as exc:
            print(f"{exc} (attempt {attempt}/{MAX_RETRIES})", end="  ")

        if attempt < MAX_RETRIES:
            time.sleep(2.0 * attempt)

    return None


# ── Main ───────────────────────────────────────────────────────────────────────

def main() -> None:
    # 1. Read centroids
    centroids: list[dict] = []
    with open(CENTROIDS_CSV, newline="", encoding="utf-8") as fh:
        for row in csv.DictReader(fh):
            centroids.append(
                {"grid_id": int(row["grid_id"]),
                 "lon": float(row["lon"]),
                 "lat": float(row["lat"])}
            )

    print(f"Loaded {len(centroids)} centroids from {CENTROIDS_CSV.name}")
    print(f"Output SOL   : {OUT_SOL}")
    print(f"Output report: {OUT_REPORT}\n")

    # 2. Prepare output directory
    OUT_SOL.parent.mkdir(parents=True, exist_ok=True)

    # 3. Fetch profiles
    sol_blocks:  list[str]  = []
    report_rows: list[dict] = []

    for i, c in enumerate(centroids, 1):
        gid, lat, lon = c["grid_id"], c["lat"], c["lon"]

        print(
            f"[{i:>3}/{len(centroids)}]  grid={gid:>3}  "
            f"lat={lat:.5f}  lon={lon:.5f}  ",
            end="", flush=True,
        )

        profile = fetch_profile(lat, lon)

        if profile is None:
            print("FAILED — skipped")
            report_rows.append(dict(
                grid_id=gid, lat=lat, lon=lon,
                api_lat=None, api_lon=None,
                distance_km=None, n_layers=0,
                texture=None, max_depth_cm=None, status="API_FAILED",
            ))
            continue

        dist      = profile.get("metadata", {}).get("distance_km", 0.0)
        n_layers  = len(profile.get("layers", []))
        texture   = profile.get("site",     {}).get("texture",     "Unknown")
        max_depth = profile.get("site",     {}).get("max_depth_cm", None)
        api_lat   = profile.get("location", {}).get("lat", lat)
        api_lon   = profile.get("location", {}).get("lon", lon)
        status    = "OK" if dist <= MAX_DISTANCE_KM else "FAR"

        print(
            f"dist={dist:>6.1f} km  layers={n_layers}  "
            f"texture={texture:<10}  [{status}]"
        )

        sol_blocks.append(build_sol_block(gid, lon, lat, profile))
        report_rows.append(dict(
            grid_id=gid, lat=lat, lon=lon,
            api_lat=round(api_lat, 4), api_lon=round(api_lon, 4),
            distance_km=round(dist, 3), n_layers=n_layers,
            texture=texture, max_depth_cm=max_depth, status=status,
        ))

        time.sleep(SLEEP_BETWEEN)

    # 4. Write DN.SOL
    with open(OUT_SOL, "w", encoding="ascii", newline="\n") as fh:
        fh.write(
            "*SOILS: Denmark ISIMIP 0.5-deg Grid  "
            "[ISRIC SoilGrids + HC27 via MSU DSI Soil Query API]\n\n"
        )
        for block in sol_blocks:
            fh.write(block)
            fh.write("\n\n")

    # 5. Write QC report
    fields = [
        "grid_id", "lat", "lon", "api_lat", "api_lon",
        "distance_km", "n_layers", "texture", "max_depth_cm", "status",
    ]
    with open(OUT_REPORT, "w", newline="", encoding="utf-8") as fh:
        writer = csv.DictWriter(fh, fieldnames=fields)
        writer.writeheader()
        writer.writerows(report_rows)

    # 6. Summary
    n_ok     = sum(1 for r in report_rows if r["status"] == "OK")
    n_far    = sum(1 for r in report_rows if r["status"] == "FAR")
    n_failed = sum(1 for r in report_rows if r["status"] == "API_FAILED")

    print(f"\n{'='*60}")
    print(f"Profiles fetched        : {n_ok + n_far} / {len(centroids)}")
    print(f"  within {MAX_DISTANCE_KM:.0f} km (OK)   : {n_ok}")
    print(f"  beyond {MAX_DISTANCE_KM:.0f} km (FAR)  : {n_far}")
    print(f"Failed / skipped        : {n_failed}")
    print(f"SOL  file               : {OUT_SOL}")
    print(f"QC report               : {OUT_REPORT}")
    if n_far:
        print("\nFAR profiles (nearest match > 50 km):")
        for r in report_rows:
            if r["status"] == "FAR":
                print(
                    f"  grid_id={r['grid_id']:>3}  "
                    f"lat={r['lat']:.3f}  lon={r['lon']:.3f}  "
                    f"dist={r['distance_km']:.1f} km"
                )
    print("=" * 60)


if __name__ == "__main__":
    main()

Line-by-line breakdown

The path detection block

_HPC_CENTROIDS   = Path("/home/<user>/DEN/denmark_grid_05deg_centroids.csv")
_LOCAL_CENTROIDS = Path(
    r"C:\Users\chich\Downloads\denmark_wth_bundle\denmark_grid_05deg_centroids.csv"
)

if _HPC_CENTROIDS.exists():
    CENTROIDS_CSV      = _HPC_CENTROIDS
    _BASE_DELIVERABLES = Path("/home/<user>/DEN/denmark/deliverables")
elif _LOCAL_CENTROIDS.exists():
    CENTROIDS_CSV      = _LOCAL_CENTROIDS
    _BASE_DELIVERABLES = Path(__file__).resolve().parent.parent / "deliverables"
else:
    sys.exit("Cannot find denmark_grid_05deg_centroids.csv. ...")

The same script runs on both the HPC and a Windows laptop without editing. It checks the HPC path first. If that does not exist it tries the local path. If neither exists it exits immediately with a clear message rather than crashing deep inside the fetch loop.

Path(__file__).resolve().parent.parent navigates from the script file (scripts/) up two levels to the project root (denmark/) and then into deliverables/ — so the output paths stay consistent regardless of where you launch the script from.

The _fmt helper

def _fmt(val, width: int, dec: int) -> str:
    v = MISSING if val is None else float(val)
    return f"{v:{width}.{dec}f}"

Every soil field in the API response can be null (when a property was not measured or estimated). DSSAT uses -99.0 as its missing-value code. This single helper applies that rule uniformly and right-aligns the value in a fixed-width field, keeping the SOL file columns aligned.

Without this, you would need an if val is None branch for every single field in every layer. With it, the layer formatter is a single readable block.

The build_sol_block function

This function converts one API response dictionary into a DSSAT SOL profile block. It has four sections matching the DSSAT format exactly.

Profile header line:

lines.append(
    f"*{profile_id:<10}  {country_a3:<16} {texture:<8} {int(max_depth):>4}"
    f"    {source}"
)

The * at the start marks a profile header to DSSAT. profile_id is DK0001 to DK0060 — the DK prefix identifies Denmark and the 4-digit number matches the grid_id in the centroids CSV and in the WTH filenames. This means every soil profile can be linked back to its WTH file by number.

Properties line:

lines.append(
    f"  {scom:>4}"
    f"{float(salb):>6.2f}"
    f"{float(slu1):>6.2f}"
    ...
)

SALB is albedo (dimensionless), SLU1 is the upper limit of soil water in the first layer (mm), SLDR is the drainage rate (fraction per day), SLRO is the runoff curve number. These site-level properties come directly from the API and do not need derivation.

Layer lines:

lines.append(
    f"{int(lyr.get('slb', 0)):>5} {str(lyr.get('slmh', '-99')):<5}"
    f"{_fmt(lyr.get('slll'), 6, 3)}"
    f"{_fmt(lyr.get('sdul'), 6, 3)}"
    ...
)

SLB is the bottom depth of the layer in cm. SLMH is the master horizon designation (A, AB, B, BC, C etc.). All remaining fields are hydraulic and chemical properties. The column widths — 5 for SLB, 6 for hydraulic values — match the DSSAT .SOL specification exactly. Even one character off causes DSSAT to misread the field.

The fetch_profile function

if 400 <= resp.status_code < 500:
    print(f"HTTP {resp.status_code} (no retry)")
    return None
resp.raise_for_status()
return resp.json().get("profile")

4xx responses are not retried. A 406 means the server rejected the request format. A 404 means no profile exists near that coordinate (open ocean, for example). Neither is worth retrying.

5xx responses and timeouts raise exceptions, which are caught and retried up to MAX_RETRIES times with increasing wait between attempts.

resp.json().get("profile") extracts the nested profile object. If the response structure ever changes and "profile" is missing, this returns None cleanly instead of raising a KeyError.

The main loop

for i, c in enumerate(centroids, 1):
    ...
    print(
        f"[{i:>3}/{len(centroids)}]  grid={gid:>3}  "
        f"lat={lat:.5f}  lon={lon:.5f}  ",
        end="", flush=True,
    )
    profile = fetch_profile(lat, lon)
    ...
    time.sleep(SLEEP_BETWEEN)

end="" keeps the cursor on the same line so the result (dist=... [OK]) prints alongside the coordinate it refers to, giving a compact one-line-per- centroid log.

flush=True forces the output buffer to write immediately. Without it, when running in a Slurm job the log file shows nothing until the script finishes.

time.sleep(SLEEP_BETWEEN) waits 0.6 seconds between every request. The API serves 1.9 million profiles to many users. Being polite costs you 36 seconds on 60 centroids and keeps the service available for everyone.

Writing the SOL file

with open(OUT_SOL, "w", encoding="ascii", newline="\n") as fh:
    fh.write(
        "*SOILS: Denmark ISIMIP 0.5-deg Grid  "
        "[ISRIC SoilGrids + HC27 via MSU DSI Soil Query API]\n\n"
    )
    for block in sol_blocks:
        fh.write(block)
        fh.write("\n\n")

encoding="ascii" and newline="\n" are the same two rules as Script 08. DSSAT reads SOL files as plain ASCII with Unix line endings. Writing on Windows with the default settings would produce \r\n endings that confuse DSSAT's fixed-width parser.

The *SOILS: line is the file-level header. DSSAT skips it but it tells a human reader what is in the file, where the data came from, and what pedotransfer approach was used.

Two blank lines ("\n\n") after each block separate profiles clearly.

SOL format strings verified against the Rust source

Reading the upstream Rust source at crates/soil-query/src/parser.rs reveals the exact format strings used by to_sol_format(). The Python script translates these directly.

Profile header line:

#![allow(unused)]
fn main() {
// Rust
format!("*{:<14} {:<10} {:>10} {:>5}    {}\n", id, country_a3, texture, depth, source)
}
# Python equivalent
f"*{profile_id:<14} {country_a3:<10} {texture:>10} {int(max_depth):>5}    {source}"

Site data line:

#![allow(unused)]
fn main() {
// Rust  — note: uses 2-letter location.country_code, NOT 3-letter country_code_alpha3
format!(" -99              {:>2} {:>11.3} {:>9.3}     {}\n",
        location.country_code, lat, lon, scs_family)
}
# Python equivalent  — country_2 = loc.get("country_code"), NOT country_a3
f" -99              {country_2:>2} {api_lat:>11.3f} {api_lon:>9.3f}     {scs_family}"

Properties line:

#![allow(unused)]
fn main() {
// Rust
format!("    {} {:5.2} {:5.2} {:5.2} {:5.2} {:5.2} {:5.2} {} {} {}\n",
        scom, salb, slu1, sldr, slro, slnf, slpf, smhb, smpx, smke)
}

Layer line — two different formatters:

#![allow(unused)]
fn main() {
// Rust — fmt_layer_value (3 decimals) only for SLLL, SDUL, SSAT
let fmt_layer_value = |opt: Option<f64>| match opt {
    Some(val) => format!("{:5.3}", val),
    None      => "-99.0".to_string(),    // ← 5-char literal, NOT -99.00
};

// Rust — fmt_optional (2 decimals) for all other layer fields
let fmt_optional = |opt: Option<f64>| match opt {
    Some(val) => format!("{:5.2}", val),
    None      => "-99.0".to_string(),    // ← 5-char literal, NOT -99.00
};
}
# Python equivalent
MISSING_STR = "-99.0"   # must be 5 chars — "-99.00" breaks DSSAT

def _fmt_water(val) -> str:   # SLLL, SDUL, SSAT only
    return MISSING_STR if val is None else f"{float(val):5.3f}"

def _fmt(val) -> str:         # everything else
    return MISSING_STR if val is None else f"{float(val):5.2f}"

The critical insight is that None maps to the string "-99.0", not to f"{-99.0:5.2f}" which would produce "-99.00" — six characters. That extra character shifts every column to the right and causes DSSAT to misread every value in the layer.

Which fields get 3 decimals vs 2:

FieldDecimalsReason
SLLL, SDUL, SSAT3Volumetric water content needs 3 sig figs (e.g. 0.115)
SRGF, SSKS, SBDM, SLOC, SLCL, SLSI, SLCF, SLNI, SLHW, SLHB, SCEC, SADC22 sig figs sufficient

Country code confusion — why there are two:

country_a3 = site.get("country_code_alpha3", "DNK")  # 3-letter → profile header
country_2  = loc.get("country_code",          "DK")   # 2-letter → site data line

The Rust SiteProperties.country_code_alpha3 is 3-letter (from the SOL header line). The Location.country_code is 2-letter ISO (from the database record). The Rust to_sol_format() uses:

  • self.site.country_code_alpha3 on the profile header line
  • self.location.country_code on the site data line

Denmark centroids often resolve to profiles stored as DE (Germany) because the database does not store many profiles inside Denmark's borders and the nearest land profile may be across the German border. That is expected — the 2-letter code reflects where the profile record came from, not where we queried.

What the output looks like

DN.SOL (first two profiles shown):

*SOILS: Denmark ISIMIP 0.5-deg Grid  [ISRIC SoilGrids + HC27 via MSU DSI Soil Query API]

*DK0001       DNK              Loam      200    ISRIC soilgrids + HC27
@SITE        COUNTRY          LAT     LONG SCS Family
 -99          DNK              54.792   8.708 HC_GEN0011
@ SCOM  SALB  SLU1  SLDR  SLRO  SLNF  SLPF  SMHB  SMPX  SMKE
    BK  0.10  6.00  0.50 75.00  1.00  1.00 SA001 SA001 SA001
@  SLB  SLMH  SLLL  SDUL  SSAT  SRGF  SSKS  SBDM  SLOC  SLCL  SLSI  SLCF  SLNI  SLHW  SLHB  SCEC  SADC
    5 A     0.115 0.245 0.395  1.00  1.05  1.26  4.01 18.87 40.46 -99.0  0.12  6.13 -99.0 18.40 -99.0
   15 A     0.126 0.256 0.399  0.85  0.89  1.28  3.40 20.69 39.68 -99.0  0.09  6.20 -99.0 16.10 -99.0
   30 AB    0.140 0.271 0.403  0.70  0.72  1.31  2.59 23.02 38.58 -99.0  0.07  6.29 -99.0 15.60 -99.0
   60 BA    0.153 0.284 0.408  0.50  0.59  1.36  1.66 25.26 37.26 -99.0  0.06  6.41 -99.0 16.30 -99.0
  100 B     0.153 0.282 0.407  0.38  0.60  1.42  0.96 25.27 36.68 -99.0  0.05  6.55 -99.0 16.50 -99.0
  200 BC    0.142 0.270 0.402  0.05  0.72  1.47  0.55 23.51 36.27 -99.0  0.05  6.73 -99.0 16.40 -99.0

*DK0002       DNK              Sandy loam 200    ISRIC soilgrids + HC27
...

soil_fetch_report.csv (first rows):

grid_idlatlonapi_latapi_londistance_kmn_layerstexturemax_depth_cmstatus
154.750778.7554.7928.7085.3916Loam200OK
254.750779.2554.7929.2473.2636Sandy loam200OK

What the DSSAT soil fields mean

FieldFull nameUnit
SLLLLower limit (wilting point)cm³/cm³
SDULDrained upper limit (field capacity)cm³/cm³
SSATSaturation upper limitcm³/cm³
SRGFRoot growth factor0–1
SSKSSaturated hydraulic conductivitycm/h
SBDMBulk densityg/cm³
SLOCOrganic carbon%
SLCLClay content%
SLSISilt content%
SLCFCoarse fragments%
SLNITotal nitrogen%
SLHWpH in water
SLHBpH in buffer
SCECCation exchange capacitycmol/kg
SALBAlbedofraction
SLRORunoff curve number

The profile ID DK0001 in DN.SOL corresponds to grid_id = 1 in the centroids CSV. That same grid_id produces WTH filename prefix 0001:

00017144.WTH   ← zone 1, start year 71, 44 years

And in the WTH file header:

@ INSI ...
     1    54.95  ...

The INSI value 1 matches the zone. In the DSSAT experiment file, you link WSTA = 0001 (the weather station) and SOIL_ID = DK0001 (the soil profile) to run a simulation at Denmark grid cell 1.

Running the script

On the HPC:

GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"
$GEO_PY /home/<user>/DEN/denmark/scripts/14_fetch_soil_and_build_sol.py

Locally (Windows):

python 14_fetch_soil_and_build_sol.py

Expected console output:

Loaded 60 centroids from denmark_grid_05deg_centroids.csv
Output SOL   : .../deliverables/soil/DN.SOL
Output report: .../deliverables/soil/soil_fetch_report.csv

[  1/ 60]  grid=  1  lat=54.75077  lon=8.75000   dist=   5.4 km  layers=6  Loam        [OK]
[  2/ 60]  grid=  2  lat=54.75077  lon=9.25000   dist=   3.3 km  layers=6  Sandy loam  [OK]
...
[ 60/ 60]  grid= 60  lat=57.75086  lon=10.75000  dist=   8.1 km  layers=6  Loam        [OK]

============================================================
Profiles fetched        : 60 / 60
  within 50 km (OK)   : 58
  beyond 50 km (FAR)  : 2
Failed / skipped        : 0
SOL  file               : .../deliverables/soil/DN.SOL
QC report               : .../deliverables/soil/soil_fetch_report.csv
============================================================

What FAR means and what to do about it

A FAR flag means the nearest profile in the database was more than 50 km from the Denmark centroid. This typically happens for grid cells that are mostly open sea — the API found the nearest land-based soil record but that record belongs to a coastal point or an island some distance away.

For those cells, you have three choices:

  1. Accept the profile — the soil at 60 km distance may still be representative of the coastal fringe of Denmark.
  2. Use the nearest OK profile from an adjacent grid cell instead.
  3. Mark those cells as excluded and do not run DSSAT simulations for them.

The soil_fetch_report.csv lists all FAR cases with their actual distance, so you can make an informed decision for each one.


Why some centroids fall in water

If you plot the 60 Denmark grid cells on a map you will notice that several centroids land in open sea — most visibly the cells around Bornholm (grid IDs 10 and 11, lon ~14.75–15.25 E, lat ~54.75 N) and the Kattegat cells (IDs 22 and 23, lon ~14.75–15.25 E, lat ~55.25 N).

This is not a bug. It is a direct consequence of how the grid was built.

How cells with water centroids enter the grid

denmark_grid_05deg.shp is built by overlaying a regular 0.5-degree global grid on Denmark's administrative and territory boundary, then keeping every cell that has any geometric overlap with Danish territory — even a single pixel of coastline is enough to include the whole 0.5-degree polygon.

Bornholm island sits at roughly 14.9 E, 55.1 N, surrounded by open Baltic Sea. Its 0.5-degree grid polygon is approximately 55 km × 35 km. The centroid of that large polygon falls in the sea west of the island. But because the polygon does touch Bornholm's coastline, the cell is included and gets a grid_id. The same logic applies to any other cell that clips a sliver of Danish coastal territory or outlying island.

Grid polygon (0.5° × 0.5°)
┌─────────────────────────────────┐
│                                 │
│        Open Baltic Sea          │
│                                 │
│    ×  ← centroid in water       │
│                                 │
│                    ███ Bornholm │ ← overlap keeps this cell
└─────────────────────────────────┘

The centroid is just the geometric centre of the polygon. It does not know whether it is over land or sea.

How the pipeline handled water-centroid cells at each stage

Stage 1 — Climate (WTH files)

ISIMIP3b climate grids cover the entire globe, land and ocean alike. Every 0.5-degree cell in the world has values for temperature, precipitation, radiation, and wind. When 07_extract_zonal_stats_parquet.py does its point-in-polygon extraction, it finds NetCDF grid points that fall inside the polygon — and for a mostly-sea polygon it averages ocean climate signal. A complete parquet file is produced and 08_generate_wth_files.py writes a valid .WTH file. DSSAT reads it without complaint.

The ocean-influenced climate signal for Bornholm cells is actually reasonable: Bornholm's climate is genuinely maritime — warmer winters, cooler summers, higher humidity than inland Denmark. Using the sea-cell climate to represent the island is defensible.

Stage 2 — Soil (SOL file)

The MSU Soil API performs a nearest-neighbour spatial lookup. When the query point is in the sea, the API walks outward through its R-tree index until it finds a real soil profile on land. For cells 10 and 11 it found profiles 24–30 km away — still within the 50 km acceptance threshold — and returned them as OK. The assigned profiles come from a land pixel near the coast or on Bornholm itself, so they are geologically reasonable.

The soil_fetch_report.csv records the actual distance_km between your query point and the profile that was returned:

grid_id,lat,lon,api_lat,api_lon,distance_km,...,status
10,54.75077,14.74999,54.958,15.042,29.67,...,OK
11,54.75077,15.24999,54.958,15.125,24.39,...,OK

A distance of 25–30 km means the profile came from a land location about that far from your water centroid — entirely plausible for a coastal island.

What this means for your DSSAT runs

Having a cell in the dataset does not mean you must simulate crops there. The table below gives a practical decision framework:

Cell typeVerdictRationale
Centroid in water, polygon covers a real island (e.g. Bornholm)KeepThe polygon contains farmland; climate and soil are representative
Centroid in water, polygon covers a narrow coastal stripReviewClimate is ocean-dominated; check whether the strip has agricultural land
Centroid is mostly open sea, minimal landExcludeRunning DSSAT there produces output but it has no agronomic meaning

A clean way to filter after the fact: cross the 60 cells with an agricultural land-cover raster (e.g. ESA CCI land cover or MODIS MCD12Q1) and compute the cropland fraction per cell. Cells where cropland fraction is below a threshold (for example 5 %) can be dropped from the ensemble analysis without affecting the representativeness of the results for actual Danish farmland.

The pipeline deliberately includes all overlapping cells and leaves this scientific decision to the analyst. The numbered scripts produce outputs; the analyst decides which outputs are meaningful.


Beginner takeaway

The full pipeline is more than a collection of scripts.

It is a chain of decisions:

  • which variables to download
  • which units to convert to
  • how to handle spatial averaging
  • how to handle missing elevation
  • how to format the output so DSSAT can read it

Each script in the chain has one clear job.

When something breaks, the numbering tells you where to look. The logs from each Slurm job tell you what went wrong. The validate script tells you whether the outputs are complete.

That combination of numbered scripts, explicit logging, and a final count check is the discipline that makes a pipeline reproducible rather than just runnable once.

Denmark ISIMIP Technical Record

This chapter is the detailed project log for the Denmark ISIMIP work.

It is meant to answer the practical questions a future researcher will ask:

  • where did the climate data come from
  • what was already present
  • which scripts were used
  • what each script produced
  • what broke and how it was repaired
  • how the final weather and soil products were validated
  • how the results were proven in a local DSSAT run

This is not just a conceptual workflow. It is the concrete record of the work that was actually carried out across the HPC project tree and the local DSSAT installation.

Project locations

Remote HPC project

  • /home/<user>/DEN
  • /home/<user>/DEN/denmark
  • /home/<user>/DEN/soil_data

Ethiopia soil reference on HPC

  • /home/<user>/Ethiopia
  • /home/<user>/Ethiopia/soil_data

Local DSSAT environment

  • C:\DSSAT48
  • C:\DSSAT48\Peanut
  • C:\DSSAT48\Weather
  • C:\DSSAT48\Soil

Local documentation workspace

  • C:\Users\chich\Downloads\cropmodelling

The main objective

The practical Denmark objective was to take ISIMIP climate inputs and build a reproducible chain ending in DSSAT-ready weather files, while also testing a first Denmark soil linkage strongly enough to prove the files could run in a real DSSAT example.

By the end of the work, the project had:

  • processed Denmark climate NetCDF files in crop-model units
  • parquet zonal outputs
  • TAV and AMP outputs
  • DSSAT .WTH files
  • repaired DEM-based elevation metadata
  • a Denmark soil core sampled from ISIMIP soil input data
  • a first Denmark .SOL file for local smoke testing
  • a local DSSAT experiment proving Denmark weather and soil could execute

Climate data source

The Denmark climate workflow was based on ISIMIP data already staged under the model folders in /home/<user>/DEN.

The project worked with five climate models:

  • GFDL_ESM4
  • IPSL-CM6A-LR
  • MPI-ESM1-2-HR
  • MRI-ESM2-0
  • UKESM1-0-LL

The scenario set used in the Denmark pipeline was:

  • historical
  • ssp126
  • ssp370
  • ssp585

The climate grid logic was based on the ISIMIP global 0.5 degree grid, which is why the Denmark project also used a 0.5 degree Denmark grid.

Soil data source

The soil source used for the Denmark test build came from ISIMIP3a soil input data.

Two upstream soil products were considered:

  • ISIMIP3a/InputData/geo_conditions/soil/hwsd_soil_data_all_land.nc
  • ISIMIP3a/InputData/geo_conditions/soil/hwsd_soil_data_on_cropland.nc

For crop modeling, the preferred source became the cropland-focused product:

  • ISIMIP3a/InputData/geo_conditions/soil/hwsd_soil_data_on_cropland.nc

The reasoning was simple:

  • all_land represents dominant soil across land in a grid cell
  • on_cropland represents soil conditions more relevant to agricultural use

The Denmark soil core was therefore built from the cropland product and the all-land product was kept only as a comparison reference.

What already existed before restructuring

Before the wrapper was created, Denmark already had many key ingredients in place:

  • shapefiles under /home/<user>/DEN/shapefiles
  • DEM tiles under /home/<user>/DEN/COP30
  • processed climate files under each model folder
  • legacy zonal outputs under /home/<user>/DEN/zonal_stats

Important existing shapefiles included:

  • /home/<user>/DEN/shapefiles/denmark_grid_05deg.shp
  • /home/<user>/DEN/shapefiles/denmark_extent.shp

The large climate data were not moved during cleanup because copying them would have been slow, heavy, and unnecessary on HPC.

The wrapper created for the Denmark project

To make the work shareable and reproducible, a clean wrapper was created at:

  • /home/<user>/DEN/denmark

Inside that wrapper, the project was organized into:

  • /home/<user>/DEN/denmark/scripts
  • /home/<user>/DEN/denmark/docs
  • /home/<user>/DEN/denmark/config
  • /home/<user>/DEN/denmark/deliverables
  • /home/<user>/DEN/denmark/logs

This wrapper separated:

  • heavy source inputs already present in the parent project
  • workflow logic
  • validation artifacts
  • final deliverables for sharing

Denmark weather workflow scripts

The numbered weather workflow scripts in the Denmark wrapper were:

  • /home/<user>/DEN/denmark/scripts/01_download_isimip.sh
  • /home/<user>/DEN/denmark/scripts/02_prepare_shapefiles_and_grid.sh
  • /home/<user>/DEN/denmark/scripts/03_create_denmark_grid.py
  • /home/<user>/DEN/denmark/scripts/04_prepare_dem.sh
  • /home/<user>/DEN/denmark/scripts/05_merge_clip_convert_climate.sh
  • /home/<user>/DEN/denmark/scripts/06_compute_tav_amp.py
  • /home/<user>/DEN/denmark/scripts/07_extract_zonal_stats_parquet.py
  • /home/<user>/DEN/denmark/scripts/08_generate_wth_files.py
  • /home/<user>/DEN/denmark/scripts/09_validate_outputs.py
  • /home/<user>/DEN/denmark/scripts/10_run_model_scenario_weather.sh
  • /home/<user>/DEN/denmark/scripts/11_submit_denmark_weather.sh
  • /home/<user>/DEN/denmark/scripts/12_validate_weather_and_soil.py

What each Denmark weather script did

01_download_isimip.sh

This is the top-level climate acquisition stage.

In practice, the Denmark project was able to proceed without a fresh bulk download because the climate data were already present under the model folders inside /home/<user>/DEN.

02_prepare_shapefiles_and_grid.sh

This represents the shapefile staging step, linking the Denmark extent and the Denmark 0.5 degree climate grid.

03_create_denmark_grid.py

This is the explicit grid-construction step used to formalize the Denmark gridded geometry that later drives zonal extraction and WTH station creation.

04_prepare_dem.sh

This step prepared the DEM support raster used for WTH elevation values.

The key merged raster produced by this logic was:

  • /home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif

05_merge_clip_convert_climate.sh

This step transformed climate inputs into processed Denmark NetCDF files ready for crop-model use.

Its logic included:

  • merging yearly or chunked files
  • clipping to Denmark extent
  • converting variables into DSSAT-appropriate units

The target units were:

  • pr to mm/day
  • tas to degC
  • tasmax to degC
  • tasmin to degC
  • rsds to MJ m-2 day-1

06_compute_tav_amp.py

This script computed the two climatological values needed in DSSAT weather headers:

  • TAV
  • AMP

Outputs were written under:

  • /home/<user>/DEN/denmark/deliverables/tav_amp/{MODEL}/{SCENARIO}/{PERIOD}/tav_amp.csv

07_extract_zonal_stats_parquet.py

This step converted processed climate grids into zone-level parquet outputs following the same general logic used in the Ethiopia workflow.

Outputs were written under:

  • /home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/pr/pr.parquet
  • /home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/tasmax/tasmax.parquet
  • /home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/tasmin/tasmin.parquet
  • /home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/rsds/rsds.parquet

08_generate_wth_files.py

This script assembled final DSSAT weather files by combining:

  • rainfall from pr
  • maximum temperature from tasmax
  • minimum temperature from tasmin
  • solar radiation from rsds
  • TAV
  • AMP
  • DEM-based elevation

Outputs were written under:

  • /home/<user>/DEN/denmark/deliverables/wth/{MODEL}/{SCENARIO}/{PERIOD}

09_validate_outputs.py

This script was used for quick production counts of:

  • parquet files
  • WTH files
  • TAV and AMP files

10_run_model_scenario_weather.sh

This was the scenario-level orchestration script. It allowed a single model-scenario combination to be rebuilt cleanly.

11_submit_denmark_weather.sh

This was the cluster batch-submission wrapper used to launch the Denmark weather production across the full model-scenario matrix.

12_validate_weather_and_soil.py

This script validated two things together:

  • the final weather files
  • the Denmark soil core

It wrote:

  • /home/<user>/DEN/soil_data/reports/wth_validation.csv
  • /home/<user>/DEN/soil_data/reports/soil_core_validation.csv
  • /home/<user>/DEN/soil_data/reports/weather_soil_validation_summary.json

The main weather processing chain

The Denmark weather chain was:

  1. use the processed climate files already staged in /home/<user>/DEN/{MODEL}/processed
  2. confirm and repair unit conversions where needed
  3. prepare a merged Denmark DEM
  4. compute TAV and AMP
  5. generate parquet zonal outputs
  6. generate DSSAT .WTH files
  7. validate unit metadata, counts, and final file sanity

Major problems and repairs during the weather build

Batch execution and dependency issues

Early submission attempts left jobs pending behind dependencies or failing for environment reasons. The wrapper scripts were tightened so they launched with a more explicit environment and clearer control flow.

Corrupted processed climate files

One real data failure affected MPI-ESM1-2-HR / ssp370.

Two processed files were corrupted and had to be regenerated:

  • /home/<user>/DEN/MPI-ESM1-2-HR/processed/MPI-ESM1-2-HR_ssp370_tasmin_denmark.nc
  • /home/<user>/DEN/MPI-ESM1-2-HR/processed/MPI-ESM1-2-HR_ssp370_rsds_denmark.nc

After they were rebuilt from raw source, the missing Denmark WTH set completed successfully.

Mislabelled temperature metadata

Unit values across the processed climate outputs were checked after production.

The physical values were correct, but two files had temperature metadata still labelled as Kelvin even though the values were already Celsius. Those metadata labels were fixed so the final processed set was internally consistent.

Zero-elevation WTH headers

Many coastal or water-dominated Denmark grid cells initially ended up with ELEV = 0.

The elevation logic in /home/<user>/DEN/denmark/scripts/08_generate_wth_files.py was improved so the fallback order became:

  1. DEM sample at land-intersection centroid
  2. land-polygon mean
  3. land-polygon max
  4. nearest grid cell with valid land-based elevation

After this repair, the final Denmark WTH outputs no longer carried zero elevations in the production set.

Final Denmark weather deliverables

The weather production finished with:

  • 80 parquet files
  • 20 TAV/AMP files
  • 1160 WTH files

This corresponds to:

  • 5 models
  • 4 scenarios
  • 20 model-scenario combinations
  • 58 WTH files per combination

Important deliverables included:

  • /home/<user>/DEN/denmark/deliverables/wth
  • /home/<user>/DEN/denmark/deliverables/parquet_zonal_stats
  • /home/<user>/DEN/denmark/deliverables/tav_amp
  • /home/<user>/DEN/denmark/deliverables/validation/processed_units_validation.csv
  • /home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif
  • /home/<user>/DEN/denmark_wth_bundle.tar.gz

Processed climate validation

All processed Denmark climate files were checked across:

  • pr
  • tas
  • tasmax
  • tasmin
  • rsds

The final validation result was:

  • 100/100 processed NetCDF files passed

The expected final units were:

  • pr: mm/day
  • tas: degC
  • tasmax: degC
  • tasmin: degC
  • rsds: MJ m-2 day-1

Denmark soil build

The soil side did not begin as a full DSSAT .SOL project. It began with a grid-aligned soil core sampled from ISIMIP soil input data.

The local and remote staging paths used for the Denmark soil work included:

  • /home/<user>/DEN/soil_data/raw
  • /home/<user>/DEN/soil_data/processed
  • /home/<user>/DEN/soil_data/reports

First downloaded soil source

The first downloaded source was:

  • /home/<user>/DEN/soil_data/raw/hwsd_soil_data_all_land.nc

This was used to inspect the available fields and to test whether the Denmark grid aligned naturally with the ISIMIP 0.5 degree soil grid.

Preferred soil source for crop modeling

The final preferred source became:

  • /home/<user>/DEN/soil_data/raw/hwsd_soil_data_on_cropland.nc

because it better represents agricultural land within each cell.

Soil properties extracted from ISIMIP

The core ISIMIP soil variables used in the Denmark build were:

  • texture_class
  • mu_global
  • soil_ph
  • soil_caco3
  • bulk_density
  • cec_soil
  • oc
  • root_obstacles
  • impermeable_layer
  • awc
  • sand
  • silt
  • clay
  • gravel
  • ece
  • bs_soil
  • issoil

Denmark soil outputs produced on HPC

The two main processed Denmark soil-core tables created were:

  • /home/<user>/DEN/soil_data/processed/denmark_soil_core_all_land.csv
  • /home/<user>/DEN/soil_data/processed/denmark_soil_core_cropland.csv

The main reports created were:

  • /home/<user>/DEN/soil_data/reports/denmark_soil_core_summary.csv
  • /home/<user>/DEN/soil_data/reports/denmark_soil_core_cropland_summary.csv
  • /home/<user>/DEN/soil_data/reports/denmark_soil_all_land_vs_cropland.csv

The cropland-core build became the preferred source of truth.

Denmark soil-core validation status

The grid-aligned Denmark cropland core reached:

  • 60 rows
  • 55 direct land-centroid matches
  • 5 nearest-valid fallbacks for tiny coastal cells
  • 0 missing soil_ph
  • 0 missing bulk_density
  • 0 missing awc

This means the soil core was usable as an upstream Denmark soil backbone, even though it was not yet a full layered DSSAT soil library by itself.

Weather and soil joint validation

The combined Denmark weather and soil validation stage produced:

  • /home/<user>/DEN/soil_data/reports/wth_validation.csv
  • /home/<user>/DEN/soil_data/reports/soil_core_validation.csv
  • /home/<user>/DEN/soil_data/reports/weather_soil_validation_summary.json

The main outcome was:

  • 1160/1160 WTH files parsed successfully
  • 60/60 Denmark soil-core rows passed the sanity checks

That proved:

  • the weather files were structurally usable
  • the soil core was numerically sane and aligned to the Denmark grid

It did not yet prove a final Denmark .SOL because the local DSSAT smoke test had not been performed at that stage.

Local DSSAT smoke-test build

To prove that Denmark weather and soil could actually be executed by DSSAT, a local smoke test was built in the Windows DSSAT installation at C:\DSSAT48.

Baseline example used for local DSSAT

The first step was to run the unmodified peanut example:

  • C:\DSSAT48\Peanut\UFGA7901.PNX

This baseline run succeeded using:

  • C:\DSSAT48\DSCSM048.EXE MPN A UFGA7901.PNX

That confirmed the local DSSAT installation was healthy before the Denmark files were introduced.

Local Denmark staging files

The Denmark local smoke-test staging area was:

  • C:\Users\chich\Downloads\DEN\local_test\source
  • C:\Users\chich\Downloads\DEN\local_test\reports

The remote Denmark files copied locally for this setup included:

  • a validated Denmark WTH source file from HPC
  • the Denmark cropland soil core CSV

Local preparation script

The main local setup script written for this proof was:

  • C:\Users\chich\Downloads\DEN\denmark\scripts\13_prepare_local_dssat_smoke_test.py

This script:

  • created a local single-year DSSAT weather file from the Denmark WTH export
  • built a first Denmark soil library at C:\DSSAT48\Soil\DN.SOL
  • synchronized Denmark profiles into C:\DSSAT48\Soil\SOIL.SOL
  • created a copied DSSAT peanut experiment linked to Denmark weather and soil
  • wrote a local setup manifest

Local DSSAT smoke-test outputs

The key local outputs were:

  • C:\DSSAT48\Weather\DK017101.WTH
  • C:\DSSAT48\Soil\DN.SOL
  • C:\DSSAT48\Peanut\UFGA7103.PNX
  • C:\Users\chich\Downloads\DEN\local_test\reports\local_dssat_smoke_setup.json

Local DSSAT test logic

The test was intentionally conservative:

  • keep the stock peanut example structure
  • change only the weather and soil linkage to Denmark
  • shift the experiment dates into a Denmark-compatible weather year
  • refine the setup until DSSAT accepted the files and ran

What happened in the local Denmark run

The local Denmark-linked peanut example eventually ran successfully.

That proved:

  • the Denmark weather file was readable by DSSAT
  • the Denmark soil file was readable by DSSAT
  • DSSAT could execute an experiment using Denmark weather and the Denmark soil profile

What the local smoke test did not prove

The smoke test used peanut because the objective was technical proof of file execution, not agronomic realism.

The final run still showed freeze-related crop failure later in the season, which is expected for a peanut example under Denmark conditions.

So the smoke test proved technical integration, not that peanut is an appropriate Denmark production crop.

Why the Ethiopia project still matters

The Ethiopia soil project remains the best architectural reference for what the Denmark soil side should become.

The key confirmed Ethiopia files were:

  • /home/<user>/Ethiopia/soil_data/docs/soil_first_workflow.md
  • /home/<user>/Ethiopia/soil_data/docs/soil_hydraulics_runoff.md
  • /home/<user>/Ethiopia/soil_data/processed/soil_with_slro.csv
  • /home/<user>/Ethiopia/soil_data/processed/horizon.csv
  • /home/<user>/Ethiopia/soil_data/dssat/ET.SOL
  • /home/<user>/Ethiopia/soil_data/stics/sols.xml
  • /home/<user>/Ethiopia/soil_data/apsim/soils/*.json

The design lesson is that Denmark should eventually have:

  • one authoritative processed soil backbone
  • model-specific exports derived from that backbone
  • QC notes and fallback logic documented explicitly

The future Denmark soil structure should look like:

  • /home/<user>/DEN/soil_data/raw
  • /home/<user>/DEN/soil_data/processed
  • /home/<user>/DEN/soil_data/docs
  • /home/<user>/DEN/soil_data/reports
  • /home/<user>/DEN/soil_data/dssat
  • /home/<user>/DEN/soil_data/apsim
  • /home/<user>/DEN/soil_data/stics

At minimum, the authoritative processed tables should eventually include:

  • final_soil_profile_all_points.csv
  • hydraulic_parameters_from_profile.csv
  • soil_with_slro.csv
  • horizon.csv

And the model-specific outputs should include:

  • DN.SOL
  • APSIM soil exports
  • STICS soil exports

Final result summary

The Denmark climate and soil work reached a point where the pipeline is now fully traceable.

The project now has:

  • clearly identified upstream climate and soil sources
  • a named HPC wrapper structure
  • numbered workflow scripts
  • documented repairs
  • validated climate outputs
  • validated grid-aligned soil core outputs
  • a first Denmark DSSAT soil library
  • a successful local DSSAT smoke test linking Denmark weather and soil

The most important practical conclusion is this:

the Denmark inputs are no longer just files sitting in folders. They are now a documented, validated workflow with named scripts, named outputs, and a clear record from source data to runnable DSSAT proof.

DSSAT Soil Files Explained

Soil files are where many beginners start to feel overwhelmed.

That is understandable because soil inputs combine:

  • physical structure
  • hydrology
  • chemistry
  • initial conditions

The good news is that you do not need to understand all of pedology to work carefully with DSSAT soil files.

What a soil file is trying to represent

A DSSAT soil file tries to describe the rooting environment layer by layer.

That includes questions like:

  • How deep can roots grow?
  • How much water can the soil hold?
  • How easily is water extracted?
  • How dense is the soil?
  • How much organic matter or carbon is present?
  • What initial water and nitrogen conditions existed?

Why layers matter

Soils are not uniform from the surface to depth.

One layer may be:

  • sandy
  • shallow
  • low in organic matter
  • easy to dry out

Another may be:

  • denser
  • finer textured
  • higher in water-holding capacity

DSSAT therefore uses layered soil descriptions rather than one single number for "the soil."

Common soil concepts beginners should know

Bulk density

Bulk density is a measure of how compact the soil is.

It affects rooting and water behavior.

Lower limit and drained upper limit

These help define plant-available water.

In practical terms:

  • the lower limit is near the dry point where roots struggle to extract water
  • the drained upper limit is near field capacity after excess water drains away

The difference between them is a major part of the plant-available water range.

Saturation

This is the water content when pore space is essentially full of water.

Organic carbon or organic matter

These influence soil fertility, structure, and in some workflows nitrogen dynamics.

Rooting depth

Even a well-watered surface may not rescue a crop if effective rooting depth is too small.

Why soil can strongly affect biomass

Soil influences:

  • water supply
  • nitrogen availability
  • root exploration
  • stress timing

That means a crop can have correct weather and decent genotype coefficients but still simulate poorly if the soil profile is unrealistic.

Soil file versus experiment-level soil information

There are two related but distinct ideas:

  • the general soil profile definition
  • experiment-specific initial conditions or soil analysis sections

The soil file usually provides the main profile structure. The experiment file may refine starting conditions for a particular run.

Common beginner problems

Typical issues include:

  • copying a soil profile from another site without checking depth or texture
  • missing lower layers
  • unrealistic hydraulic values
  • inconsistent initial water content
  • confusion between measured values and estimated defaults

Practical questions to ask about a soil profile

Before trusting a profile, ask:

  1. Does the depth structure make sense for the site?
  2. Are the texture or hydraulic values plausible by layer?
  3. Is rooting depth likely to be limited?
  4. Are initial water and nitrogen conditions documented?
  5. Was the profile measured, estimated, or borrowed?

Beginner takeaway

Think of a soil file as the hidden reservoir and root environment beneath the crop.

If weather drives what the atmosphere offers, soil determines what the plant can actually access.

How to Build Soil Profiles for DSSAT

This chapter focuses on workflow.

The goal is to help a beginner move from field or laboratory information to a soil profile that DSSAT can use responsibly.

Start with honesty about where the data came from

A soil profile can be:

  • directly measured in the field and laboratory
  • partly measured and partly estimated
  • borrowed from a similar nearby site
  • assembled from published or mapped information

These are not equally strong situations.

Always document which one applies.

A practical build workflow

Step 1: define the site and profile depth

Record:

  • site identifier
  • coordinates if available
  • effective soil depth
  • any known restrictive layer

If you do not know how deep roots can realistically explore, note that as an uncertainty rather than hiding it.

Step 2: define soil layers

Choose layers that reflect the real profile as well as the resolution supported by your data.

Typical layer information includes:

  • top and bottom depth
  • texture or textural class
  • bulk density
  • hydraulic information
  • carbon or organic matter

Step 3: gather hydraulic information

DSSAT needs a reasonable description of water behavior.

This often includes values related to:

  • lower water limit
  • drained upper limit
  • saturation

If these were not measured directly, document how they were estimated.

Pedotransfer functions or standard tables may be used, but they should not be treated as exact field truth.

Step 4: define initial conditions

A soil profile alone is not the whole story.

You also need to think about the starting condition at planting, such as:

  • initial soil water
  • initial mineral nitrogen
  • residue or previous-crop effects if relevant

These may be represented partly in the experiment file rather than only in the main soil profile.

Step 5: check consistency

Common consistency questions include:

  • Are layer depths continuous?
  • Do hydraulic values increase or decrease in physically impossible ways?
  • Is saturation greater than drained upper limit?
  • Is drained upper limit greater than lower limit?
  • Is bulk density plausible for the texture?

These checks catch many silent data problems.

Step 6: write the DSSAT soil file

Once the profile is internally consistent, write it into the DSSAT format with:

  • correct site identifier
  • correct layer boundaries
  • correct layer variables

Step 7: sanity test in a simulation

Before using the profile for serious calibration, run a simple test and inspect:

  • early growth
  • stress behavior
  • whether the crop appears unrealistically drought-limited or unstressed

If stress patterns look impossible, revisit the soil assumptions before changing genetics.

When borrowing a profile is acceptable

Sometimes the real site profile is unavailable.

Borrowing a profile may be acceptable if:

  • the soil type is genuinely similar
  • the climate and management context are not wildly different
  • the borrowed profile is clearly labeled as provisional

Borrowed soil should be treated as a starting point, not as a final truth.

Common calibration trap

One frequent mistake is trying to fix a soil problem by changing cultivar coefficients.

For example:

  • if soil water is too available, biomass may be too high
  • if soil water is too restrictive, biomass may be too low

Changing phenology or partitioning coefficients to compensate can make the model less biologically meaningful.

Beginner takeaway

Building a DSSAT soil profile is part measurement, part estimation, and part quality control.

The important thing is not pretending the file is perfect. The important thing is making every assumption visible and internally consistent.

Management Files and Experiment Sections

Weather and soil describe the environment.

Management describes what people did to the crop.

This is one of the most important beginner ideas in crop modeling:

the model does not simulate an abstract crop in abstract weather.

It simulates a managed crop in a specific field context.

What management usually includes

Management information can include:

  • planting date
  • planting method
  • planting density
  • row spacing
  • fertilizer timing and amount
  • irrigation timing and amount
  • tillage or residue conditions
  • harvest timing or harvest rules

Different projects will use different subsets of these.

Why management matters so much

Two experiments with the same cultivar, soil, and weather can still behave very differently if management differs.

Examples:

  • planting two weeks earlier changes temperature and daylength exposure
  • higher plant density changes canopy development and competition
  • more nitrogen changes growth potential and partitioning
  • irrigation changes stress timing

Where management appears in DSSAT

In many DSSAT workflows, management is encoded inside the experiment file rather than one standalone file.

That means sections such as:

  • planting details
  • fertilizer applications
  • irrigation events
  • harvest details

are part of the experimental recipe.

Planting information

Planting details often include:

  • planting date
  • seed or plant population
  • row spacing
  • planting depth

These variables strongly affect:

  • establishment
  • canopy development
  • competition for light
  • development trajectory

Fertilizer information

Fertilizer management usually specifies:

  • application date
  • nutrient type
  • amount applied
  • sometimes method or depth

Nitrogen management can strongly influence biomass and partitioning, so these entries should be checked carefully in any calibration workflow.

Irrigation information

Irrigation events tell the model when and how much water was added.

Even if your main scientific question is not about irrigation, getting these events wrong can dramatically change stress patterns.

Harvest information

Harvest sections matter because they define what outcome is being represented and when it is measured.

This becomes especially important when comparing simulated versus observed harvest biomass or stage timing.

Why experiment sections matter for paper reproduction

In published case studies, the experiment file is often where the actual treatment logic lives.

That means understanding a paper's:

  • cultivar treatments
  • planting dates
  • nitrogen levels
  • harvest schedule

often requires reading the experiment file closely, not only the article text.

Beginner takeaway

Management is the bridge between "what the environment allowed" and "what the researchers or farmers actually did."

If a simulation seems wrong, never inspect genetics alone. Management often contains the simplest explanation.

Genotype, Cultivar, and Ecotype in DSSAT

These terms are easy to confuse, especially for new modelers.

This chapter explains the practical meaning of each one in a DSSAT workflow.

Why genotype information exists

Weather, soil, and management describe the environment around the crop.

Genotype-related files describe how this crop type responds within that environment.

Without genotype coefficients, the model would have no crop-specific identity.

Species, cultivar, and ecotype

In practical DSSAT usage, beginners often encounter three levels:

  • species-level information
  • cultivar-level information
  • ecotype-level information

The exact details vary by family, but the broad idea is:

Species

Species-level definitions often hold more general crop information shared across many cultivars.

Cultivar

Cultivar coefficients distinguish one named or coded cultivar from another.

These may affect:

  • development timing
  • partitioning
  • morphology
  • maturity behavior

Ecotype

Ecotype coefficients often describe broader adaptation patterns related to environmental response, especially development and phenology.

However, not all DSSAT families use ecotype files in the same way, and some do not use .ECO files at all.

Why this matters for a general wrapper

The original crop-specific wrapper could assume more about file patterns.

A broader omniwrapper cannot assume:

  • every crop has .ECO
  • every cultivar key is stored under the same column name
  • every family matches genotype files the same way

That is why registry logic and family-aware parsing matter in this repository.

What calibration usually changes

In many workflows, calibration focuses on cultivar or ecotype coefficients rather than species-level definitions.

Typical goals may include improving:

  • flowering date
  • maturity timing
  • biomass accumulation
  • stem or grain partitioning
  • leaf-area development

Common beginner misunderstanding

Beginners sometimes think cultivar coefficients are just labels.

They are not.

They are part of the biological identity of the simulation.

Changing them is changing how the virtual crop behaves.

Why coefficient changes should be disciplined

Because many coefficients interact, it is easy to create a nice-looking fit that has poor biological meaning.

A better approach is to ask:

  • Which process is wrong?
  • Which parameter family controls that process?
  • Is the change biologically plausible?

That is much safer than tuning everything at once.

Beginner takeaway

Genotype files are where the model learns what kind of crop it is simulating.

If weather and soil describe the world around the plant, genotype coefficients describe the plant's built-in behavior inside that world.

Worked Case Study: Weather, Soil, and Management Construction

This chapter is a fully worked teaching example.

It is meant to answer the beginner question:

"If someone hands me raw notes from a field experiment, how do I turn them into the pieces DSSAT actually needs?"

Important note before we start

This is an instructional case study, not a claim that every number below came from a published hemp dataset.

The purpose is to show the workflow clearly and realistically:

  • what information is needed
  • what assumptions are acceptable
  • what should be checked before running DSSAT

The scenario

Imagine an intern receives the following field summary from a hemp trial.

Raw field notes

Site:

  • site name: North Farm Block A
  • latitude: 29.70 N
  • longitude: 82.41 W
  • elevation: 32 m

Crop:

  • crop: industrial hemp
  • cultivar: IH Williams
  • purpose: fiber-focused biomass evaluation

Management notes:

  • planting date: 2022-05-12
  • row spacing: 0.76 m
  • target population: 60 plants m-2
  • planting depth: 2.5 cm
  • nitrogen fertilizer:
    • 60 kg N ha-1 at planting
    • 120 kg N ha-1 on 2022-06-10
  • irrigation:
    • 20 mm on 2022-05-14
    • 25 mm on 2022-06-01
  • biomass harvests:
    • 2022-06-20
    • 2022-07-05
    • 2022-07-22
    • 2022-08-15

Weather notes:

  • automatic station near field
  • daily data available for:
    • solar radiation
    • minimum temperature
    • maximum temperature
    • rainfall

Soil notes:

  • sandy loam surface
  • deeper layer becomes sandy clay loam
  • effective rooting depth about 120 cm
  • bulk density measured for three depth ranges
  • field team estimated initial soil water as "moderately moist"

At first glance, this is useful, but it is not yet a DSSAT-ready dataset.

The rest of the chapter shows how to transform it.

Step 1: separate the problem into DSSAT input categories

The raw notes need to be reorganized into four modeling categories:

  1. weather
  2. soil
  3. management
  4. genotype

For this chapter we focus on the first three, because those are the pieces most new interns struggle to construct from raw notes.

Step 2: build the weather component

What raw weather information must become

For DSSAT, the weather input must become:

  • one station definition
  • one daily time series
  • consistent daily units
  • continuous dates across the simulation period

Example weather table before DSSAT formatting

Suppose the station export begins like this:

DateSolarRad_MJ_m2Tmin_CTmax_CRain_mm
2022-05-1224.118.731.20.0
2022-05-1322.919.230.54.3
2022-05-1425.318.931.70.0
2022-05-1523.820.132.01.2

Before writing this into a DSSAT weather file, the intern should ask:

  • Are there any missing days?
  • Are radiation units really MJ m-2 day-1?
  • Are rainfall values daily totals?
  • Are minimum and maximum temperatures ever swapped?

Weather quality-control checklist

For this example, the intern should do all of the following:

  1. confirm the file has every date from planting through the harvest period
  2. confirm there are no duplicate rows
  3. confirm TMAX >= TMIN for every day
  4. inspect whether radiation values are physically plausible
  5. inspect whether rainfall totals look reasonable relative to the site and season

Station metadata

The station definition should also be made explicit:

ItemValue
Station codeNFBA
Latitude29.70
Longitude-82.41
Elevation32

Even if the intern does not yet know the exact final .WTH syntax by memory, they should already understand that DSSAT needs both:

  • the station description
  • the day-by-day weather values

What matters scientifically

For hemp, weather is not just a background file. It affects:

  • thermal accumulation
  • photoperiod context through site location
  • water balance through rainfall
  • biomass opportunity through solar radiation

If flowering or biomass later look wrong, weather is one of the first things to recheck.

Step 3: build the soil component

What the raw soil notes must become

The field notes said:

  • sandy loam surface
  • deeper layer becomes sandy clay loam
  • effective rooting depth about 120 cm
  • bulk density measured for three depth ranges
  • initial soil water moderately moist

That description must become a layered soil profile.

Example layer table before DSSAT formatting

Layer bottom (cm)TextureBulk density (g cm-3)Lower limitDrained upper limitSaturationOrganic C (%)
15Sandy loam1.420.080.200.391.20
30Sandy loam1.450.090.210.380.95
60Sandy clay loam1.480.120.260.400.70
90Sandy clay loam1.500.130.270.400.55
120Sandy clay loam1.520.140.280.410.45

This is now much closer to DSSAT-ready thinking, because the soil is expressed layer by layer rather than as a paragraph.

What the intern should check

Before this becomes a DSSAT .SOL entry, the intern should confirm:

  1. layer depths are continuous and increasing
  2. saturation is greater than drained upper limit
  3. drained upper limit is greater than lower limit
  4. bulk density values are plausible for the textures listed
  5. the effective rooting depth is consistent with the deepest useful layer

Initial conditions

The field note "moderately moist" is not yet numeric enough for a simulation.

The intern needs to decide whether initial soil water will come from:

  • direct measured layer water contents
  • estimated fractions between lower limit and drained upper limit
  • default assumptions documented in the experiment notes

For teaching purposes, a careful provisional assumption might be:

  • upper two layers start near 75% of plant-available water
  • deeper layers start near 65% of plant-available water

That is not perfect truth. It is an explicit assumption that can later be replaced with measurements.

Why this matters scientifically

If the soil profile holds too much water, the crop may look unrealistically vigorous.

If it holds too little water, the crop may show water stress too early.

That is why soil should be checked before blaming cultivar coefficients.

Step 4: build the management component

Raw management notes

From the original notes:

  • planting date: 2022-05-12
  • row spacing: 0.76 m
  • target population: 60 plants m-2
  • planting depth: 2.5 cm
  • nitrogen:
    • 60 kg N ha-1 at planting
    • 120 kg N ha-1 on 2022-06-10
  • irrigation:
    • 20 mm on 2022-05-14
    • 25 mm on 2022-06-01
  • biomass harvests:
    • 2022-06-20
    • 2022-07-05
    • 2022-07-22
    • 2022-08-15

Management interpretation table

ConceptDSSAT-relevant interpretation
Planting dateplanting operation date
Populationplants per square meter
Row spacingrow geometry affecting canopy setup
Planting depthestablishment input
Nitrogen applicationsfertilizer events with date and amount
Irrigation applicationsirrigation events with date and amount
Harvest datesobservation schedule and possibly harvest management

A management sheet the intern can draft

Planting

ItemValue
Planting date2022-05-12
Population60 plants m-2
Row spacing0.76 m
Depth2.5 cm

Fertilizer

DateNutrientAmount
2022-05-12N60 kg ha-1
2022-06-10N120 kg ha-1

Irrigation

DateAmount
2022-05-1420 mm
2022-06-0125 mm

Harvest observations

DatePurpose
2022-06-20destructive biomass sample
2022-07-05destructive biomass sample
2022-07-22destructive biomass sample
2022-08-15destructive biomass sample

This gives the intern a structured view before they ever touch DSSAT syntax.

Step 5: connect the pieces in the experiment logic

At this point the intern has the three major non-genetic building blocks:

  • a cleaned daily weather series
  • a layered soil profile
  • a structured management schedule

The next conceptual step is to connect them in one experiment description:

  • the experiment references the weather station
  • the field references the soil profile
  • the treatment references the planting, fertilizer, irrigation, and harvest logic
  • the treatment also references the cultivar choice

That connection is exactly why experiment files are the "conductor" of a DSSAT run.

Step 6: define what still remains uncertain

A realistic teaching case should not hide uncertainty.

For this example, open questions might still include:

  • Were hydraulic limits measured or estimated?
  • Was initial soil water measured or assumed?
  • Were plant populations final established counts or target planting rates?
  • Did the field station sit exactly at the plot location?

These are not embarrassments. They are part of honest model documentation.

Step 7: what the intern should save as project artifacts

By the end of this construction phase, a careful intern should have:

  1. a cleaned weather table
  2. a weather metadata note
  3. a layered soil table
  4. a management summary table
  5. a written list of assumptions and unresolved uncertainties

That package is often more important than the final DSSAT files alone.

What this case study teaches

This example shows that "building DSSAT inputs" is really a chain of decisions:

  • data collection
  • unit checking
  • structure definition
  • assumption logging
  • model-oriented formatting

That is the discipline that later makes calibration and paper reproduction trustworthy.

Installed DSSAT Setup

This path is the simplest if DSSAT is already installed on your machine.

Requirements

  • a local DSSAT installation
  • R
  • the required R packages:
    • DSSAT
    • dplyr
    • tidyr
    • lubridate

Install the R packages if needed:

install.packages(c("DSSAT", "dplyr", "tidyr", "lubridate"))

Source the wrapper

From your local DSSAT-wrapper clone:

source("C:/path/to/DSSAT-wrapper/R/DSSAT_omniwrapper.R")

Minimal example

result <- DSSAT_omniwrapper(
  model_options = list(
    DSSAT_path = "C:/path/to/DSSAT48",
    DSSAT_exe = "DSCSM048.EXE",
    project_file = "C:/path/to/DSSAT48/Wheat/KSAS8101.WHX",
    suppress_output = TRUE
  ),
  situation = "KSAS8101_1",
  var = "GSTD"
)

What the wrapper needs

At minimum, the omniwrapper expects:

  • DSSAT_path
  • project_file

It can often infer:

  • crop code
  • crop folder
  • default module
  • genotype file names

When to provide Crop

Provide Crop when the experiment file lives outside the installed DSSAT crop folder and the wrapper should know which installed crop directory to stage against.

When to provide module_code

Provide module_code when you want to force an alternate engine instead of the default module in the DSSAT profile. Examples:

  • MZIXM048
  • WHAPS048
  • TFAPS048
  • SUOIL048
  • SCCSP048
  • SCSAM048

Basic validation

You can run the bundled smoke test:

Rscript C:\path\to\DSSAT-wrapper\R\test_DSSAT_omniwrapper.R

And the broader family validator:

Rscript C:\path\to\DSSAT-wrapper\R\validate_DSSAT_omniwrapper_families.R

Using GitHub-Sourced Example Data

You do not need every experiment file to live inside the DSSAT installation folder.

A common pattern is:

  • keep DSSAT installed locally
  • clone or download an example-data repository
  • point project_file to the external experiment file
  • optionally set Crop
  • optionally set module_code

Example layout

C:/work/
  DSSAT48/
  dssat-csm-data/

Example: alternate maize engine

source("C:/path/to/DSSAT-wrapper/R/DSSAT_omniwrapper.R")

result <- DSSAT_omniwrapper(
  model_options = list(
    DSSAT_path = "C:/work/DSSAT48",
    DSSAT_exe = "DSCSM048.EXE",
    Crop = "Maize",
    project_file = "C:/work/dssat-csm-data/Maize/UFGA8201.MZX",
    module_code = "MZIXM048",
    suppress_output = TRUE
  ),
  situation = "UFGA8201_1",
  var = "CWAD"
)

Example: wheat with NWHEAT

result <- DSSAT_omniwrapper(
  model_options = list(
    DSSAT_path = "C:/work/DSSAT48",
    DSSAT_exe = "DSCSM048.EXE",
    Crop = "Wheat",
    project_file = "C:/work/dssat-csm-data/Wheat/KSAS8101.WHX",
    module_code = "WHAPS048",
    suppress_output = TRUE
  ),
  situation = "KSAS8101_1",
  var = "GSTD"
)

Why this works

The omniwrapper stages a temporary run directory and copies in the files DSSAT needs for that run:

  • the experiment file
  • companion *.A and *.T files when present
  • genotype files from the local DSSAT installation or the project folder

That means your source-data checkout stays clean and DSSAT still runs from the local executable.

Good practice

  • Keep DSSAT itself installed locally.
  • Treat GitHub example repositories as input-data sources.
  • Do not hard-code personal paths in scripts you plan to share.
  • Use placeholders like C:/path/to/... in public documentation.

Your First Run

This chapter shows the shortest useful path from a fresh machine to a working DSSAT run from R.

Step 1: Prepare the prerequisites

You need:

  • R installed
  • DSSAT installed locally
  • this wrapper repository available locally

For example:

  • DSSAT at C:/path/to/DSSAT48
  • the repo in any local folder

Step 2: Install R packages

Run:

install.packages(c("DSSAT", "dplyr", "tidyr", "lubridate"))

These are the minimum packages used by the wrapper core.

Step 3: Source the omniwrapper

This step assumes you have also cloned the DSSAT-wrapper module separately.

For example:

  • C:/path/to/DSSAT-wrapper
source("C:/path/to/DSSAT-wrapper/R/DSSAT_omniwrapper.R")

This loads the public function and the helper modules behind it.

Step 4: Choose a known-good example

A good first example is a shipped DSSAT case that is already validated.

For instance:

result <- DSSAT_omniwrapper(
  model_options = list(
    DSSAT_path = "C:/path/to/DSSAT48",
    DSSAT_exe = "DSCSM048.EXE",
    project_file = "C:/path/to/DSSAT48/Wheat/KSAS8101.WHX",
    suppress_output = TRUE
  ),
  situation = "KSAS8101_1",
  var = "GSTD"
)

Step 5: Inspect the result

names(result)
names(result$sim_list)
head(result$sim_list[["KSAS8101_1"]])

At this point, the main question is not yet "does the biology look perfect?"

The main question is:

"Did the wrapper successfully identify the model context, launch DSSAT, and read the outputs back?"

Step 6: Read observations if available

obs <- DSSAT_omni_read_obs(
  model_options = list(
    DSSAT_path = "C:/path/to/DSSAT48",
    DSSAT_exe = "DSCSM048.EXE",
    project_file = "C:/path/to/DSSAT48/Wheat/KSAS8101.WHX"
  ),
  situation = "KSAS8101_1",
  read_end_season = TRUE
)

This is the bridge between "a run works" and "a run can be evaluated."

Common beginner mistakes

Using the wrong project file

Make sure the file extension and crop family make sense for the case you want to run.

Pointing only to copied crop inputs

Some crops also require install-level metadata to be present and consistent.

Confusing crop choice with module choice

Some crops have alternate engines. If a non-default engine is required, you may need to provide module_code.

Example:

module_code = "WHAPS048"

What to do if the first run fails

Do not immediately start editing coefficients.

Instead:

  1. run the self-check
  2. verify the project file path
  3. verify the DSSAT install path
  4. verify that the required genotype files exist
  5. check whether the example is supposed to use a non-default engine

That sequence saves a lot of time.

Self-Check and Validation

This chapter explains how the wrapper protects itself against common failures.

Why a self-check matters

Many DSSAT failures happen before the biology starts:

  • a required file is missing
  • the wrong crop folder is assumed
  • a genotype file is absent
  • the project file lives outside the installed DSSAT crop folder
  • the requested variable does not exist for that family

If we only discover those issues after a failed run, debugging becomes slow and noisy.

DSSAT_omni_self_check()

The self-check is designed to fail early and readably.

It inspects things like:

  • DSSAT executable
  • SIMULATION.CDE
  • DSSATPRO.V48
  • project file
  • genotype directory
  • species, cultivar, and ecotype files when relevant
  • companion observation files
  • situation naming consistency
  • requested variable availability

This check runs before DSSAT_omniwrapper() launches DSSAT.

Why this is especially important in a multi-family wrapper

A crop-specific wrapper can rely on more fixed assumptions.

A multi-family wrapper cannot.

That means validation has to be more explicit because we are working across:

  • different model engines
  • different genotype patterns
  • different output conventions
  • different companion observation layouts

Smoke tests vs broader validation

The repo uses more than one validation layer.

Smoke tests

These confirm the core path is still alive.

Examples:

  • DSSAT-wrapper/R/test_DSSAT_omni_self_check.R
  • DSSAT-wrapper/R/test_DSSAT_omniwrapper.R
  • DSSAT-wrapper/R/test_DSSAT_omni_read_obs.R

Cross-family validation

This confirms the architecture still works across the broader supported set.

Example:

  • DSSAT-wrapper/R/validate_DSSAT_omniwrapper_families.R

These validation scripts are designed for other users to run on their own machines. They use:

  • DSSAT_PATH for the local DSSAT installation
  • DSSAT_CSM_DATA for an optional local clone of dssat-csm-data

If DSSAT_PATH is not set, the scripts try the common Windows default C:/DSSAT48.

Why validated families matter

There is a difference between:

  • "the code recognizes this family"
  • "the code can launch this family"
  • "the code has been validated end-to-end"

The documentation should always keep those categories separate. That is how we avoid overclaiming support.

Practical workflow for contributors

When changing wrapper internals:

  1. run the self-check test
  2. run the smoke test
  3. run the cross-family validator
  4. if you touched observation logic, run the observation regression test too

That sequence should be treated as part of development, not as an afterthought.

How Calibration Actually Works

Calibration is one of the most misunderstood parts of crop modeling.

Many beginners think calibration means:

"change numbers until the curve looks good."

That is not a reliable modeling strategy.

Calibration should be structured, biologically informed, and documented.

What calibration is

Calibration means adjusting uncertain model parameters so that simulated behavior agrees better with observed behavior for a defined dataset and question.

In crop-model work, common calibration targets include:

  • emergence date
  • flowering date
  • maturity date
  • biomass trajectory
  • stem biomass
  • grain yield
  • leaf area

What calibration is not

Calibration is not:

  • hiding a bad weather file
  • compensating for a wrong soil profile by changing genetics
  • forcing every variable to match perfectly
  • proof that the model is universally correct

A good calibration improves fit while preserving biological meaning.

A safer calibration order

For many projects, especially beginner projects, the safest order is:

  1. confirm the run and inputs are correct
  2. calibrate timing and stage transitions
  3. calibrate overall biomass trajectory
  4. calibrate organ partitioning
  5. evaluate on cases not used for adjustment when possible

This order follows process logic rather than aesthetic curve fitting.

Why timing usually comes first

If flowering timing is wrong, many later outputs can be wrong for the wrong reason.

For example:

  • the crop may stop vegetative growth too early
  • stem biomass may be too low simply because the crop transitioned too soon
  • seed filling may begin at the wrong time

That is why phenology often deserves attention before final biomass.

Why input checking comes before parameter changes

Before changing cultivar or ecotype parameters, confirm:

  • weather is correct
  • soil is plausible
  • management matches the experiment
  • observations are being read correctly
  • the correct model family and module are actually running

Otherwise, you may calibrate around an avoidable input error.

Objective functions

Calibration needs a way to judge whether one parameter set is better than another.

Common criteria include:

  • RMSE
  • agreement indices such as Willmott's d
  • bias
  • visual fit of time series

No single metric tells the whole story, so it is often best to inspect both numbers and plots.

Multi-variable calibration

Real projects often fit more than one variable at once.

For hemp, this may include:

  • flowering time
  • total aboveground biomass
  • stem biomass
  • plant height

This creates tradeoffs. A parameter change that improves one variable may worsen another.

That is why calibration should be guided by process understanding, not only by one summary score.

Validation after calibration

Calibration is stronger when you can test the calibrated model on:

  • another season
  • another site
  • another treatment set

If the fit is good only on the exact data used for tuning, the model may be overfit.

Beginner takeaway

Calibration is best thought of as a scientific workflow:

  • check inputs
  • define targets
  • adjust plausible parameters
  • evaluate carefully
  • document everything

That mindset is much more valuable than chasing a single perfect metric.

Hopf Paper Case Study

This chapter explains how the wrapper can be used in a real reproduction task.

The example is the hemp paper:

Hopf et al. (2025), "Adaptation of the process-based CSM-CROPGRO model to simulate the growth and development of industrial hemp for seed and fiber production."

Why this case study matters

It is a good teaching example because it combines:

  • a real publication
  • real observed data files
  • real genotype coefficients
  • a model family that required install metadata to be consistent
  • a workflow that can later inform work in other model frameworks

The four key experiment files

In the local reproduction workspace, the exact paper experiment files came from dssat-csm-data/Hemp:

  • UFCI2101.HMX
  • UFCI2201.HMX
  • UFJA2101.HMX
  • UFJA2201.HMX

Together they define the 15 Florida cases listed in the paper's experiment table.

The matching observation files

Each experiment also had:

  • a time-course observation file
  • an anthesis or summary observation file

For hemp these appeared as:

  • .HMT
  • .HMA

That is exactly the kind of structure the wrapper's observation-reading logic needs to support if it is going to help with paper reproduction.

Why the observation-path fix mattered

Originally, the new wrapper could read observations only from the installed DSSAT crop folder.

That was not enough for this paper workflow, because the exact paper experiment files and observations lived in an external project directory.

The wrapper was updated so that observation files can be found relative to the project_file directory when needed.

That small technical change is a good example of how real case studies improve general wrapper design.

First-pass reproduction logic

The reproduction workflow does this:

  1. define the 15 paper cases explicitly
  2. run each case through DSSAT_omniwrapper()
  3. read the matching observations
  4. join simulated and observed values on date
  5. compute first-pass metrics such as d and RMSE
  6. compare observed and simulated flowering dates

What the first pass can and cannot claim

It can claim:

  • the paper experiments rerun successfully
  • the observed and simulated data can be aligned reproducibly
  • the resulting performance metrics are strong enough to justify deeper analysis

It should not yet claim:

  • that every published metric has been matched exactly
  • that every figure in the paper has been rebuilt identically
  • that every aggregation choice used by the authors has already been reproduced

That distinction is part of good scientific communication.

Why this case study belongs in the book

Because it shows the difference between:

  • building a wrapper in theory
  • building one that survives contact with a real publication workflow

That is the level of evidence that helps a community trust a tool.

End-to-End Intern Exercise: From Raw Notes to a Finished Run

This exercise is designed for a beginner who knows little or nothing about:

  • agronomy
  • DSSAT
  • hemp
  • weather files
  • soil profiles

The goal is not only to finish a run.

The goal is to learn how to think from raw field information all the way to a traceable simulation result.

Exercise goal

By the end of this exercise, the intern should be able to:

  1. organize raw field notes into DSSAT input categories
  2. identify missing information and explicit assumptions
  3. build weather, soil, and management tables in a DSSAT-oriented way
  4. connect those inputs to a project file
  5. launch a run with the wrapper
  6. inspect the outputs and explain what happened at a beginner level

The starting packet

Imagine the intern receives this packet.

Field note summary

Site:

  • site code: TRN1
  • latitude: 29.70 N
  • longitude: 82.41 W
  • elevation: 32 m

Crop:

  • crop: industrial hemp
  • cultivar: IH Williams
  • objective: evaluate stem and total biomass under moderate nitrogen

Management:

  • planting date: 2022-05-12
  • population: 60 plants m-2
  • row spacing: 0.76 m
  • planting depth: 2.5 cm
  • nitrogen applied:
    • 60 kg N ha-1 on planting day
    • 120 kg N ha-1 on 2022-06-10
  • irrigation applied:
    • 20 mm on 2022-05-14
    • 25 mm on 2022-06-01

Observation dates:

  • 2022-06-20
  • 2022-07-05
  • 2022-07-22
  • 2022-08-15

Soil notes:

  • sandy loam surface
  • sandy clay loam subsoil
  • profile depth 120 cm
  • moderate initial moisture

Weather source:

  • nearby station export with daily radiation, minimum temperature, maximum temperature, and rainfall

Part 1: organize the packet

Task 1

Create a four-part note with headings:

  • weather
  • soil
  • management
  • genotype

Expected result

The intern should place:

  • station coordinates and daily weather under weather
  • texture, depth, and water status clues under soil
  • planting, fertilizer, irrigation, and observation dates under management
  • cultivar name under genotype

Why this matters

This is the first step in learning to think like a modeler rather than like a reader of mixed field notes.

Part 2: build a weather preparation sheet

Task 2

Create a table with columns:

  • Date
  • SRAD
  • TMIN
  • TMAX
  • RAIN

Populate it from the station export and check:

  1. no missing dates
  2. no duplicate dates
  3. TMAX >= TMIN
  4. rainfall units are daily totals
  5. radiation units are consistent

Deliverable

A cleaned daily weather table plus a one-paragraph note describing:

  • source of data
  • date coverage
  • any corrections made

Common beginner trap

Do not jump straight into a DSSAT weather file before doing the quality checks.

Part 3: build a soil preparation sheet

Task 3

Turn the soil notes into a layered table with at least five layers down to 120 cm.

Suggested columns:

  • Layer bottom
  • Texture
  • Bulk density
  • LL
  • DUL
  • SAT
  • Organic C

Deliverable

A provisional soil profile table plus a note that states clearly which values were:

  • measured
  • estimated
  • borrowed from standard references or nearby profiles

Common beginner trap

Do not pretend estimated hydraulic values are measured truths.

Part 4: build a management preparation sheet

Task 4

Create separate tables for:

  • planting
  • fertilizer
  • irrigation
  • harvest or observation schedule

Deliverable

A management packet that could later be encoded in an experiment file.

Common beginner trap

Do not mix destructive biomass harvest dates with routine standing observations without labeling the difference clearly.

Part 5: document assumptions

Task 5

Write a short list of assumptions such as:

  • initial soil moisture estimated from field note "moderately moist"
  • weather station assumed representative of the field
  • cultivar name mapped to the intended DSSAT cultivar entry

Deliverable

A plain-language assumptions log.

Why this matters

Good model work is not about hiding uncertainty. It is about making uncertainty visible and manageable.

Part 6: prepare the run context

For the first execution exercise, the intern does not need to hand-author every DSSAT file from scratch.

Instead, the intern should work from a known project template and inspect how its file structure maps onto the prepared notes.

Task 6

Open a known hemp project file and identify:

  • weather linkage
  • soil linkage
  • planting section
  • fertilizer section
  • cultivar section
  • simulation controls

Deliverable

A short annotation note explaining where each of those pieces lives.

Why this step is important

This builds the bridge between the preparation sheets and the real DSSAT file ecosystem.

Part 7: launch a first run

Once the intern has a valid project file and a working DSSAT installation, they can launch a test run with the wrapper.

Example run pattern

source("C:/path/to/DSSAT-wrapper/R/DSSAT_omniwrapper.R")

result <- DSSAT_omniwrapper(
  model_options = list(
    DSSAT_path = "C:/path/to/DSSAT48",
    DSSAT_exe = "DSCSM048.EXE",
    project_file = "C:/path/to/project/Hemp/TRN1.HMX",
    suppress_output = TRUE
  ),
  situation = "TRN1_1",
  var = "CWAD"
)

Task 7

Run the project and record:

  • whether the wrapper returned without error
  • how many rows were in the simulation output
  • which variables were available
  • the first and last simulated dates

Deliverable

A short run note summarizing the simulation status.

Part 8: perform a first interpretation

Task 8

Look at the simulation output and answer:

  1. Did the crop simulate through the intended season?
  2. Is the requested biomass variable present?
  3. Does the output look like a daily time series?
  4. Does the timing appear plausible relative to planting and harvest dates?

Deliverable

A one-page beginner interpretation note.

Example interpretation language

Good:

  • "The run completed and produced a daily biomass series through the expected season."
  • "I still need to verify whether flowering timing and biomass magnitude are realistic."

Not good:

  • "The model is correct because it ran."

Part 9: optional observation comparison

If observation files are available, the intern can take one more step and read them through the wrapper:

obs <- DSSAT_omni_read_obs(
  model_options = list(
    DSSAT_path = "C:/path/to/DSSAT48",
    DSSAT_exe = "DSCSM048.EXE",
    project_file = "C:/path/to/project/Hemp/TRN1.HMX"
  ),
  situation = "TRN1_1",
  read_end_season = TRUE
)

Task 9

Check whether observation data were found and whether their dates overlap with the simulated series.

Deliverable

A note answering:

  • Were observations present?
  • Were they in-season, end-season, or both?
  • Are the dates aligned enough for later plotting or metrics?

What a successful exercise submission should contain

A strong intern submission should include:

  1. an organized raw-notes summary
  2. a cleaned weather table
  3. a layered soil table
  4. a management table set
  5. an assumptions log
  6. a project-file annotation note
  7. a run summary
  8. a first interpretation note

What this exercise is really teaching

This is not just a software exercise.

It teaches the intern how to move across four ways of thinking:

  • field observation
  • structured data preparation
  • simulation setup
  • scientific interpretation

That transition is the heart of practical crop-model work.

Visualizing Results

Running a model is only the first half of the job.

The second half is making the results interpretable.

Why visualization matters in crop-model workflows

Good plots help you answer different kinds of questions:

  • Did the model run at all?
  • Are the observed and simulated values on the same scale?
  • Is the mismatch systematic or occasional?
  • Which treatments are hardest for the model?
  • Are errors coming from timing, magnitude, or both?

Without plots, calibration can become blind.

What to visualize first

For a beginner, the most useful first figures are:

  1. observed vs simulated scatter plots
  2. time-series overlays
  3. flowering or stage-date comparisons
  4. summary metric panels

That sequence moves from broad agreement to case-specific diagnosis.

A practical pattern

For real projects, keep figure generation separate from the main run logic.

That separation makes it easier to:

  • rerun figures without rerunning every simulation
  • audit the data feeding each figure
  • debug joins and aggregation choices

The Hopf case-study figures

In the local reproduction workspace, a focused figure script was built around the reproduction outputs rather than around older dashboard-specific code.

The idea is general:

  • use the wrapper to generate clean outputs
  • save joined time-series and metric tables
  • build plots from those saved outputs

That pattern is more portable than a plotting script that is tightly coupled to one personal machine layout.

Common figure types in this project

Observed vs simulated panels

Best for:

  • quick agreement checks
  • identifying bias
  • comparing multiple variables side by side

Time-series overlays

Best for:

  • checking whether timing is right
  • checking growth trajectory shape
  • seeing whether the model diverges early or late

Stage-date alignment plots

Best for:

  • phenology validation
  • identifying treatment-specific timing issues

Metric summaries

Best for:

  • reporting
  • comparing variables at a glance
  • identifying which variables need more work

Advice for contributors

Treat plotting code as part of the reproducibility workflow, not as a cosmetic extra.

That means:

  • keep inputs explicit
  • save derived tables
  • use consistent naming
  • avoid hard-coded personal directories
  • document what each figure is meant to prove

Troubleshooting

This chapter collects common failure modes and how to think through them.

Problem: the wrapper cannot find a file

Likely causes:

  • wrong DSSAT_path
  • wrong project_file
  • missing genotype files
  • companion observation files not where the wrapper expects them

What to do:

  1. run the self-check
  2. inspect the reported paths
  3. confirm whether the project lives in the DSSAT install or in an external directory

Problem: the run exits with no PlantGro.OUT

Likely causes:

  • DSSAT itself rejected the experiment
  • the wrong module was used
  • a required companion file is missing
  • the experiment is valid only with a non-default engine

What to do:

  1. inspect ERROR.OUT
  2. inspect WARNING.OUT
  3. confirm the intended module_code

Problem: observations are not found

Likely causes:

  • the observation files live beside the project file, not in the installed crop folder
  • file suffixes differ from what you expected
  • the requested experiment name does not match the actual observation filename stem

What to do:

  1. check the project directory
  2. check the installed crop folder
  3. inspect the experiment stem and crop code suffix

Problem: the variable you requested is missing

Likely causes:

  • that family uses a different output variable name
  • the variable is not written in that output file
  • the run succeeded but your request needs an alias

What to do:

  1. inspect the names returned by the simulation object
  2. check whether another family-specific variable is the right target
  3. extend alias handling if this is a legitimate recurring case

Problem: the model launches but the biology looks wrong

That is a different class of problem.

Now you should inspect:

  • weather linkage
  • soil parameters
  • planting density
  • fertilizer schedule
  • cultivar and ecotype choice
  • target variable definition

At that stage, the wrapper may be working perfectly and the scientific setup may still need revision.

Attribution and Disclaimer

Attribution

This fork builds on the original DSSAT wrapper work associated with the AgMIP Calibration Phase III effort.

The original wrapper header credits:

  • Jing Qi
  • Amir Souissi
  • Samuel Buis
  • Vakhtang Shelia

This fork extends that base with:

  • a registry-driven omniwrapper
  • support for additional DSSAT model families
  • alternate-engine launching
  • broader validation scripts
  • hostable public documentation

Why attribution is kept visible

This repository is not presented as a disconnected rewrite.

It is a continuation and extension of earlier wrapper work, so preserving that lineage is part of being technically honest and community-friendly.

License note

At the time of writing, the checked upstream wrapper snapshot used for this fork does not include an explicit LICENSE file.

Because of that, this documentation does not invent license terms on behalf of the upstream project. Users should check the current upstream repository state for any later license clarification before redistributing code more broadly.

The DSSAT model source and data repositories are separate from this wrapper repository and may have their own licensing terms.

Disclaimer

This work is shared in good faith for research and educational use, but it is provided as-is without warranty, and users are responsible for verifying suitability for their own data, workflows, and redistribution context.

pSIMS Explained for Crop Modellers

Who this is for: you understand crop models (DSSAT, STICS, APSIM) and know what a simulation needs — climate, soil, management. You are not a software developer. This page explains what pSIMS does in plain language, what files it needs, what it produces, and how to run it.


What Is pSIMS?

Normally, to run DSSAT you prepare .WTH weather files, SOIL.SOL soil profiles, and *.X experiment files by hand. For one or two sites that is manageable. For 1,000 sites across Ethiopia it becomes impossible.

pSIMS is an assembly line that does all of that automatically.

You give it:

  • Climate data for every site (one standard file format)
  • Soil data for every site (same format)
  • Your experiment settings (planting date, cultivar, fertilizer — once)
  • A config file saying which model to use

pSIMS then:

  1. Extracts climate and soil for each site from those files
  2. Converts them into whatever format the chosen model needs
  3. Runs the model for every site
  4. Pulls the outputs (yield, anthesis date, etc.) into one clean results file

The same climate and soil files work for DSSAT, STICS, or APSIM — you just change one line in the config to switch models.


The Assembly Line — Step by Step

When you run pSIMS, it executes these steps in order for each site:

1. Stage inputs     — copy climate and soil tiles to a working folder
2. Convert climate  — turn NetCDF climate tile → model weather file
                      (Psims2Wth for DSSAT, Psims2Met for APSIM, Psims2Stics for STICS)
3. Build experiment — merge your JSON experiment settings into the model input format
                      (Camp2Json → Jsons2Dssat / Jsons2Apsimx / Jsons2Stics)
4. Run model        — call DSSAT / APSIM / STICS binary
5. Collect output   — read model output → one standard NetCDF results file
6. Stage outputs    — move results to the outputs/ folder

Each step is a small Python class. If one step fails, pSIMS prints False for that step and stops — you can see exactly where the problem is.


Section 1 — Input Data

pSIMS needs three kinds of input.

1a. Climate data — the "tile" file

File: clim_0001_0001.tile.nc4

This is a NetCDF4 file (a scientific data format — think of it as a structured spreadsheet that holds daily time series for one grid cell). It contains:

VariableWhat it is
tmaxDaily maximum air temperature (°C)
tminDaily minimum air temperature (°C)
precipDaily precipitation (mm)
solarDaily solar radiation (MJ/m²/day)

One tile file = one grid cell (e.g. 0.5° × 0.5°) for the full climate period (e.g. 1980–2010).

You do not create these by hand. They come from processed climate products (CMIP6 GCMs, ERA5, CHIRPS, etc.) that are already in pSIMS tile format. For the Ethiopia ensemble, these are prepared by the climate pipeline (ws2 scripts).

1b. Soil data — the "tile" file

File: soil_0001_0001.tile.nc4

Same NetCDF4 format, but contains soil profile properties for the same grid cell:

VariableWhat it is
sltSilt fraction per layer (%)
clyClay fraction per layer (%)
sndSand fraction per layer (%)
socSoil organic carbon (%)
bdBulk density (g/cm³)
phSoil pH

Multiple soil layers are stored along a depth dimension. For the Ethiopia ensemble, these come from SoilGrids via the soil pipeline (ws1 scripts).

1c. Experiment settings — the campaign file

File: experiment.json (for APSIM) or exp_template.json (for DSSAT)

This is where you describe your agronomy — the same information you would enter in the DSSAT experiment file or APSIM manager script, but written in JSON format.

APSIM example (simple — 18 lines):

{
  "StartDate":   "1980-01-01",
  "EndDate":     "2010-12-31",
  "Crop":        "Maize",
  "PlantingDate":"15-Apr",
  "Cultivar":    "B_110",
  "SowingDensity": 7.5,
  "RowSpacing":   750,
  "SowingDepth":  30
}

DSSAT example (detailed — ~220 lines) includes:

  • Planting date, density, depth
  • Fertilizer applications (date, amount, type, placement)
  • Irrigation schedule
  • Harvest rule
  • Simulation control flags (water balance method, N balance on/off, etc.)

Crop modeller note: the JSON keys map 1-to-1 to the fields you already know from DSSAT X-files or APSIM manager. The values are the same numbers in the same units. The format is just different (JSON instead of fixed-width text).


Section 2 — The Config File (params)

File: params.dssat48.point.sample (or similar name)

This is a YAML file — essentially a list of key: value settings. It is the only file you normally edit when setting up a run.

Here are the most important settings, explained:

# Which model to use
model: dssat48

# Where your climate tiles are
weather: samples/dssat48_point_bundle/weather

# Where your soil tiles are
soils: samples/dssat48_point_bundle/soils

# Where your experiment JSON is
campaign: samples/dssat48_point_bundle/campaign

# Path to the model executable
executable: /path/to/DSCSM048.EXE

# Grid cell size in arc-minutes (30 = 0.5 degrees)
delta: "30,30"

# How many grid cells in this run
num_lats: 1
num_lons: 1

# What output variables to extract
variables: harwt,adat,mdat,lai_max

# Reference year for the climate tiles
ref_year: 1980

# Output file name
out_file: dssat48-point

That is it. You set the paths, pick your variables, and run.


Section 3 — Output Data

File: outputs/output_0001_0001.psims.nc

This is another NetCDF4 file. It contains one value per simulation year for each variable you requested. For example, if your climate spans 1980–2010 (30 years), you get 30 yield values, 30 anthesis dates, 30 maturity dates.

VariableWhat it meansUnits
harwtGrain yield at harvestkg/ha
adatAnthesis (silking) dateday of year
mdatMaturity dateday of year
lai_maxMaximum leaf area indexm²/m²

Opening the output

In Python:

import netCDF4
nc = netCDF4.Dataset('outputs/output_0001_0001.psims.nc')
yield_kgha = nc.variables['harwt'][:]   # array of 30 values
print(yield_kgha)
nc.close()

In R:

library(ncdf4)
nc <- nc_open('outputs/output_0001_0001.psims.nc')
yield <- ncvar_get(nc, 'harwt')
print(yield)
nc_close(nc)

CSV output (easier for Excel/R)

Add csv: true in your params file and pSIMS also writes a .csv next to the NetCDF. You can open that directly in Excel or read it with read.csv() in R.


Section 4 — The Run Command, Decoded

python pysims/pysims.py \
    --param   params/params.dssat48.point.sample \
    --campaign samples/dssat48_point_bundle/campaign \
    --tlatidx  0001 \
    --tlonidx  0001 \
    --latidx   1 \
    --lonidx   1
PieceWhat it means
python pysims/pysims.pyRun the pSIMS pipeline script
--param params/...Use this params file (your config)
--campaign samples/.../campaignUse this folder for experiment JSON files
--tlatidx 0001Which latitude tile to process (4-digit, zero-padded)
--tlonidx 0001Which longitude tile to process
--latidx 1Which row inside that tile (1-based)
--lonidx 1Which column inside that tile

For the sample data there is only one tile (0001) containing one point (1, 1). For a 1,000-site ensemble you would loop over all tile/point combinations — that is what the Slurm batch scripts handle automatically.


Section 5 — Can a Bachelor Intern Run This?

Honest answer: maybe, with a good setup guide and a mentor for the first session.

Here is where interns will succeed and where they will struggle:

Easy parts

  • Editing the params file — it is just key-value pairs
  • Reading outputs in R or Python — two lines of code
  • Running the command once everything is installed — it is one command
  • Understanding what each input means — it maps to familiar agronomy concepts

Hard parts

ChallengeWhy it is hard
InstallationPython virtual environment, model binaries, Java, paths — many steps, many ways to fail
Path errorsRelative vs absolute paths; running from the wrong directory causes cryptic errors
NetCDF formatNot familiar; requires a library; error messages are not beginner-friendly
Debugging FalseWhen a pipeline step fails with False, finding the log and interpreting the traceback requires Python familiarity
Tile structureUnderstanding 0001/clim_0001_0001.tile.nc4 naming convention is non-obvious
HPCSSH, Slurm, modules, environment variables — a second learning curve on top of pSIMS

Verdict: An intern can run pre-configured sample data successfully on their first day. Building a new experiment from scratch (new crop, new region, new climate dataset) requires 2–3 weeks of guided work. Running the full HPC ensemble independently requires several months of experience.


Section 6 — Simplifications We Recommend

The following changes would make pSIMS significantly more intern-friendly without breaking anything.

6a. A one-line run script per model

Instead of remembering the full command with 6 flags, create small shell scripts in the pSIMS root:

# run_dssat.sh
#!/bin/bash
cd "$(dirname "$0")"
python pysims/pysims.py \
    --param   params/params.dssat48.point.sample \
    --campaign samples/dssat48_point_bundle/campaign \
    --tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1

Then an intern just types bash run_dssat.sh. Done. Equivalent scripts: run_stics.sh, run_apsim.sh, run_all.sh.

6b. Simpler params file names

The current name params.dssat48.point.sample is not intuitive. Rename (or add symlinks) to something like:

Old nameSuggested name
params.dssat48.point.sampleparams_dssat_sample.yaml
params.stics10.point.sampleparams_stics_sample.yaml
params.apsimx.point.linux.sampleparams_apsim_linux_sample.yaml

6c. A "quick check" script

A one-page script that reads the output NetCDF and prints a readable summary:

# check_output.py
import sys, netCDF4, numpy as np

nc = netCDF4.Dataset(sys.argv[1])
print(f"\n{'Variable':<12} {'Mean':>10} {'Min':>10} {'Max':>10}")
print("-" * 45)
for var in nc.variables:
    if var in ('lat','lon','time'): continue
    data = nc.variables[var][:].flatten()
    data = data[~np.isnan(data)]
    if len(data):
        print(f"{var:<12} {data.mean():>10.1f} {data.min():>10.1f} {data.max():>10.1f}")
nc.close()

Usage: python check_output.py outputs/output_0001_0001.psims.nc

6d. Cleaner error messages

When a pipeline step returns False, pSIMS currently prints only:

0001/0001, ApsimX, run, ..., False

A small patch to pysims.py could print the last few lines of the model's stderr, which is what you actually need to diagnose the problem.

6e. A minimal DSSAT experiment template

The current exp_template.json is 220 lines covering every possible option. For a standard rainfed maize run most of those lines are noise. A "minimal template" with only the 10–15 fields that actually change between experiments would be much less intimidating as a starting point.


Summary

WhatFormatTool to open
Climate inputNetCDF4 tile (.tile.nc4)Python netCDF4, R ncdf4, Panoply
Soil inputNetCDF4 tile (.tile.nc4)same
Experiment settingsJSON (.json)any text editor
ConfigYAML (.yaml or .sample)any text editor
ResultsNetCDF4 (.psims.nc)Python, R, Panoply
Results (optional)CSV (.csv)Excel, R, Python

The only files you edit for a new run:

  1. experiment.json — your agronomy (planting date, cultivar, management)
  2. params_*.yaml — your paths, model choice, and output variables

Everything else is infrastructure that, once set up, you do not touch.

pSIMS 2.0 — Windows Setup and Run Guide

Status as of 2026-05-21: fully operational. All three models (DSSAT 4.8, STICS 10, APSIM-X) run successfully on Windows.


Quick reference — what is where

ThingPath
pSIMS working directoryC:\Users\<user>\Downloads\psims-release-2.0\
Python (venv)psims-release-2.0\.venv\Scripts\python.exe
DSSAT binarysamples\dssat48_point_bundle\refdata\DSCSM048.EXE (relative to pSIMS root)
STICS launcherC:\Users\<user>\Downloads\JavaSTICS-10.4.1-STICS-10.4.1\...\JavaSticsCmd.exe
APSIM-X binaryC:\Program Files\APSIM<version>\bin\Models.exe
Sample datapsims-release-2.0\samples\
Params (Windows)psims-release-2.0\params.*.sample (in root)
Outputspsims-release-2.0\outputs\

Important: always run from the psims-release-2.0\ directory. The params files use relative paths like samples/... and pysims/... that only resolve correctly when you start there.


Prerequisites

These must already be installed on your machine. If any are missing, install them before trying to run a model.

1. Python virtual environment

The venv lives at psims-release-2.0\.venv\. It is created once and reused for all runs.

To create it (first time only):

cd C:\Users\<user>\Downloads\psims-release-2.0
python -m venv .venv
.venv\Scripts\python.exe -m pip install -r requirements.txt

Verify:

.venv\Scripts\python.exe -c "import netCDF4, scipy, numpy; print('OK')"

2. DSSAT 4.8

The DSCSM048.EXE binary is bundled inside the repo at samples\dssat48_point_bundle\refdata\DSCSM048.EXE. No separate installation is needed for the sample test.

For a full installation, DSSAT 4.8 can be installed from the official DSSAT website. pSIMS will use the binary path set in the params file.

3. STICS 10

Install from the JavaStics package. After installation, note the path to JavaSticsCmd.exe — it will be something like:

C:\Users\<user>\Downloads\JavaSTICS-10.4.1-STICS-10.4.1\
  JavaSTICS-10.4.1-STICS-10.4.1\JavaSticsCmd.exe

The params file (params.stics.point.sample) needs both:

  • executable — full path to JavaSticsCmd.exe
  • workdir — directory containing JavaSticsCmd.exe (JavaStics must be run from its own folder)

4. APSIM-X (Next Generation)

Install APSIM Next Generation from the APSIM website. After installation, the binary will be at:

C:\Program Files\APSIM<version>\bin\Models.exe

Check params.apsimx.point.sample for the exact version path that was verified — update it if your installed version differs.


Adjusting paths in params files

Before running, open each params file and confirm the executable paths match your local installation.

params.dssat48.point.sample — no changes needed for the sample test. DSSAT runs from the refdata directory using the bundled EXE.

params.stics.point.sample — set these two lines:

executable:  "C:\\Users\\<user>\\Downloads\\JavaSTICS-10.4.1-STICS-10.4.1\\JavaSTICS-10.4.1-STICS-10.4.1\\JavaSticsCmd.exe"
workdir:     "C:\\Users\\<user>\\Downloads\\JavaSTICS-10.4.1-STICS-10.4.1\\JavaSTICS-10.4.1-STICS-10.4.1"

params.apsimx.point.sample — set:

executable:  "C:\\Program Files\\APSIM<version>\\bin\\Models.exe"

Use double backslashes (\\) inside YAML quoted strings.


Running a model — template command

Open PowerShell or Git Bash, cd to the pSIMS root, then:

cd C:\Users\<user>\Downloads\psims-release-2.0

.venv\Scripts\python.exe pysims\pysims.py `
    --param  params.<model>.point.sample `
    --campaign samples\<model>_point_bundle\campaign `
    --tlatidx 0001 --tlonidx 0001 `
    --latidx  1    --lonidx  1

(In Git Bash, use forward slashes and \ line continuation instead of the PowerShell backtick.)


Running each model

DSSAT 4.8

cd C:\Users\<user>\Downloads\psims-release-2.0

.venv\Scripts\python.exe pysims\pysims.py `
    --param params.dssat48.point.sample `
    --campaign samples\dssat48_point_bundle\campaign `
    --tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1

Expected output — every line ends with True:

0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run,      ..., True
0001/0001, Camp2Json,           run,      ..., True
0001/0001, Psims2Wth,           run,      ..., True
0001/0001, Jsons2Dssat,         run,      ..., True
0001/0001, Dssat48,             run,      ..., True
0001/0001, Out2Psims,           run,      ..., True
0001/0001, StageOutputsSharedFS,run,      ..., True

Known harmless warning (safe to ignore):

UserWarning: Real number is too long

STICS 10

cd C:\Users\<user>\Downloads\psims-release-2.0

.venv\Scripts\python.exe pysims\pysims.py `
    --param params.stics.point.sample `
    --campaign samples\stics_point_bundle\campaign `
    --tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1

Expected output:

0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run,      ..., True
0001/0001, Psims2Stics,         run,      ..., True
0001/0001, Jsons2Stics,         run,      ..., True
0001/0001, Stics,               run,      ..., True
0001/0001, Out2Psims,           run,      ..., True
0001/0001, StageOutputsSharedFS,run,      ..., True

How STICS runs on Windows:
pSIMS calls JavaSticsCmd.exe --run <workspace_abs_path> maize from the JavaStics install directory (workdir). JavaSticsCmd.exe bridges the XML workspace format to the underlying stics_modulo binary.


APSIM-X (Next Generation)

cd C:\Users\<user>\Downloads\psims-release-2.0

.venv\Scripts\python.exe pysims\pysims.py `
    --param params.apsimx.point.sample `
    --campaign samples\apsimx_point_bundle\campaign `
    --tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1

Expected output:

0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run,      ..., True
0001/0001, Psims2Met,           run,      ..., True
0001/0001, Jsons2Apsimx,        run,      ..., True
0001/0001, ApsimX,              run,      ..., True
0001/0001, Out2Psims,           run,      ..., True
0001/0001, StageOutputsSharedFS,run,      ..., True

Checking output values

After any run the output is written to psims-release-2.0\outputs\output_0001_0001.psims.nc.

cd C:\Users\<user>\Downloads\psims-release-2.0

.venv\Scripts\python.exe -c "
import netCDF4
nc = netCDF4.Dataset('outputs/output_0001_0001.psims.nc')
for v in nc.variables:
    if v not in ('lon', 'lat', 'time', 'scen'):
        val = float(nc.variables[v][:].flatten()[0])
        print(f'{v} = {val} {nc.variables[v].units}')
nc.close()
"

Verified reference outputs (sample data, 1982 Gainesville FL)

ModelVariableValueUnits
DSSAT 4.8HWAM (grain yield)2283kg/ha
DSSAT 4.8PDAT (planting DOY)57DOY
DSSAT 4.8MDAT (maturity)128Days
STICS 10masec(n) (biomass)1.1875t/ha
STICS 10mafruit (grain)0.8032t/ha
APSIM-XYield2143.31kg/ha

These match the HPC Linux runs exactly.


Troubleshooting

Any step shows False

The failed step will print a Python traceback above the result line.

SymptomLikely causeFix
FileNotFoundError: DSCSM048.EXEWrong executable pathCheck executable in params file
FileNotFoundError: JavaSticsCmd.exeWrong STICS pathUpdate executable and workdir in params
FileNotFoundError: Models.exeWrong APSIM path or versionCheck APSIM installation directory
ModuleNotFoundErrorWrong Python usedMake sure you use .venv\Scripts\python.exe
FileNotFoundError in StageInputsSharedFSRunning from wrong directorycd to psims-release-2.0\ first

"Real number is too long" warning

This comes from DSSAT's CUL/ECO file reader and is harmless. The run completes correctly and values are read correctly.

APSIM takes longer than expected

APSIM-X takes ~10–12 seconds on the first run (JIT warm-up). Subsequent runs in the same session are faster. This is normal.

Output values don't match reference

Possible causes:

  1. Different APSIM version — APSIM results can vary slightly between versions. Check your installed version against what the params file references.
  2. Different STICS version — same applies.
  3. The run completed but with a different input — check that the campaign and weather tile are the same sample files.

Architecture overview

params file  ─────────────────────────────────────────────┐
                                                           ▼
pysims.py → StageInputsSharedFS  (copy tiles to workdir)
          → [tappcmp]            (campaign → experiment.json)
          → tappwth              (climate tile → model weather file)
          → tappinp              (soil tile + experiment → model input)
          → model binary         (DSSAT / STICS / APSIM)
          → postprocess          (model output → psims.nc)
          → StageOutputsSharedFS (copy psims.nc to outputs/)

Each step is a Python class. The pipeline logs each step with its runtime and a True/False pass flag.


What is different between Windows and Linux

AspectWindowsLinux HPC
Python.venv\Scripts\python.exe~/.conda/envs/mamba_env/envs/psims/bin/python
Working directorypsims-release-2.0\~/psims_2.0/
DSSAT binaryDSCSM048.EXE (Windows PE)dscsm048 (Linux ELF)
STICS invocationJavaSticsCmd.exe called directlyjava -jar JavaSticsCmd.exe (bundled JRE 17)
APSIM invocationModels.exe called directlyShell wrapper → apptainer exec datamill.sif Models
Params locationroot of working directoryparams/ subdirectory
Path separatorsbackslash (or forward slash in YAML)forward slash only

The pipeline code, sample data, and output format are identical on both platforms.

pSIMS 2.0 — Linux HPC Setup and Run Guide

Status as of 2026-05-21: fully operational. All three models (DSSAT 4.8, STICS 10, APSIM-X) run successfully on the HPC.


Quick reference — what is where

ThingPath on HPC
pSIMS root~/psims_2.0/
Python environment~/.conda/envs/mamba_env/envs/psims/bin/python
DSSAT binary~/psims_2.0/tools/dssat48/dscsm048
STICS JAR~/psims_2.0/tools/stics10/JavaSticsCmd.exe
STICS Java runtime~/psims_2.0/tools/jdk-17.0.19+10-jre/bin/java
APSIM-X wrapper~/psims_2.0/tools/apsimx/Models
APSIM-X container~/crop-ensemble/containers/datamill.sif
Sample data~/psims_2.0/samples/
Params (Linux)~/psims_2.0/params/
Outputs~/psims_2.0/outputs/

Replace ~ with /home/<user> if you need absolute paths.


How to connect

ssh <user>@<hpc-host>

If you have an SSH alias configured (e.g. hpc in ~/.ssh/config), you can just run ssh hpc. See your SSH config for the actual hostname.


Running a model — template command

All three models follow the same pattern:

cd ~/psims_2.0

~/.conda/envs/mamba_env/envs/psims/bin/python pysims/pysims.py \
    --param  params/<params-file>     \
    --campaign samples/<bundle>/campaign \
    --tlatidx 0001 --tlonidx 0001    \
    --latidx  1    --lonidx  1

The --tlatidx/--tlonidx flags choose which tile to read from disk. --latidx/--lonidx choose the point within that tile. For the single-point sample data both are always 0001 / 1.


Running each model

DSSAT 4.8

cd ~/psims_2.0

~/.conda/envs/mamba_env/envs/psims/bin/python pysims/pysims.py \
    --param params/params.dssat48.point.linux.sample \
    --campaign samples/dssat48_point_bundle/campaign \
    --tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1

Expected output — every line ends with True:

0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run,      ..., True
0001/0001, Camp2Json,           run,      ..., True
0001/0001, Psims2Wth,           run,      ..., True
0001/0001, Jsons2Dssat,         run,      ..., True
0001/0001, Dssat48,             run,      ..., True
0001/0001, Out2Psims,           run,      ..., True
0001/0001, StageOutputsSharedFS,run,      ..., True

Known harmless warning (safe to ignore):

UserWarning: Real number is too long

STICS 10

cd ~/psims_2.0

~/.conda/envs/mamba_env/envs/psims/bin/python pysims/pysims.py \
    --param params/params.stics.point.linux.sample \
    --campaign samples/stics_point_bundle/campaign \
    --tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1

Expected output:

0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run,      ..., True
0001/0001, Psims2Stics,         run,      ..., True
0001/0001, Jsons2Stics,         run,      ..., True
0001/0001, Stics,               run,      ..., True
0001/0001, Out2Psims,           run,      ..., True
0001/0001, StageOutputsSharedFS,run,      ..., True

How STICS runs on Linux:
pSIMS invokes java -jar JavaSticsCmd.exe --run <workspace> maize using the bundled JRE 17 (tools/jdk-17.0.19+10-jre/). The java_executable param in the params file activates this mode. No system Java or JavaStics installation is needed.


APSIM-X (Next Generation)

cd ~/psims_2.0

~/.conda/envs/mamba_env/envs/psims/bin/python pysims/pysims.py \
    --param params/params.apsimx.point.linux.sample \
    --campaign samples/apsimx_point_bundle/campaign \
    --tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1

Expected output:

0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run,      ..., True
0001/0001, Psims2Met,           run,      ..., True
0001/0001, Jsons2Apsimx,        run,      ..., True
0001/0001, ApsimX,              run,      ..., True
0001/0001, Out2Psims,           run,      ..., True
0001/0001, StageOutputsSharedFS,run,      ..., True

How APSIM runs on Linux:
The file tools/apsimx/Models is a shell script that calls:

apptainer exec --bind "$(pwd):$(pwd)" \
  ~/crop-ensemble/containers/datamill.sif \
  Models "$@"

The Models binary lives inside the datamill.sif Apptainer container (AgriScale v1.2.0). The container auto-binds $HOME, so all paths under ~/psims_2.0/ are accessible to APSIM inside the container.


Checking output values

After any run, the output is written to ~/psims_2.0/outputs/output_0001_0001.psims.nc.

cd ~/psims_2.0
~/.conda/envs/mamba_env/envs/psims/bin/python - << 'EOF'
import netCDF4
nc = netCDF4.Dataset('outputs/output_0001_0001.psims.nc')
for v in nc.variables:
    if v not in ('lon', 'lat', 'time', 'scen'):
        val = float(nc.variables[v][:].flatten()[0])
        print(f'  {v} = {val} {nc.variables[v].units}')
nc.close()
EOF

Verified reference outputs (sample data, 1982 Gainesville FL)

ModelVariableValueUnits
DSSAT 4.8HWAM (grain yield)2283kg/ha
DSSAT 4.8PDAT (planting DOY)57DOY
DSSAT 4.8MDAT (maturity)128Days
STICS 10masec(n) (biomass)1.1875t/ha
STICS 10mafruit (grain)0.8032t/ha
APSIM-XYield2143.31kg/ha

These values match exactly with the Windows runs — the pipeline is deterministic across platforms for the sample data.


Troubleshooting

Any step shows False

Check the full output (remove 2>/dev/null from your command). The failed step will print a Python traceback. Common causes:

SymptomLikely causeFix
FileNotFoundError in StageInputsSharedFSMissing sample data directoryCheck samples/<model>_point_bundle/ structure
ApsimX step failsApptainer can't find the containerVerify ~/crop-ensemble/containers/datamill.sif exists
Stics step failsJava not foundCheck tools/jdk-17.0.19+10-jre/bin/java exists
Dssat48 step failsBinary not executableRun chmod +x tools/dssat48/dscsm048

Output file not created

If the Out2Psims or StageOutputsSharedFS step shows False, the output file will not exist. Fix the root cause and re-run. You can safely delete outputs/output_0001_0001.psims.nc between runs.

Conda environment not found

The activate step is not needed when using the full path:

~/.conda/envs/mamba_env/envs/psims/bin/python

If that path does not exist, check with:

conda env list

Architecture overview

params file  ─────────────────────────────────────────────┐
                                                           ▼
pysims.py → StageInputsSharedFS  (copy tiles to workdir)
          → [tappcmp]            (campaign → experiment.json)
          → tappwth              (climate tile → model weather file)
          → tappinp              (soil tile + experiment → model input)
          → model binary         (DSSAT / STICS / APSIM)
          → postprocess          (model output → psims.nc)
          → StageOutputsSharedFS (copy psims.nc to outputs/)

Each step is a Python class. The pipeline logs each step with its runtime and a True/False pass flag.


Sample data layout

samples/
  dssat48_point_bundle/
    campaign/   Campaign.nc4 + exp_template.json
    weather/    0001/clim_0001_0001.tile.nc4
    soils/      0001/soil_0001_0001.tile.nc4
    refdata/    DSCSM048.EXE, CUL/ECO/SPE files, ...

  stics_point_bundle/
    campaign/   experiment.json
    weather/    0001/clim_0001_0001.tile.nc4
    soils/      0001/soil_0001_0001.tile.nc4
    refdata/    workspace_template/ (STICS XML files)

  apsimx_point_bundle/
    campaign/   experiment.json + Maize.apsimx
    weather/ →  symlink to dssat48_point_bundle/weather/
    soils/   →  symlink to dssat48_point_bundle/soils/
    refdata/    Maize.apsimx

The APSIM weather and soils directories are symlinks to the DSSAT bundle — the pSIMS NetCDF tile format is the same for all models.


Re-deploying after code changes

When you push pSIMS code changes from Windows, sync them to the HPC with:

# From your local machine (Git Bash or WSL)
rsync -av --exclude='*.pyc' --exclude='__pycache__' \
  /path/to/psims-release-2.0/pysims/ \
  <user>@<hpc-host>:~/psims_2.0/pysims/

Sample data and params files only need to be re-synced if they change.