Crop Modelling Guide
This book is the learning and design core of the cropmodelling repository.
It is not only about one wrapper, one crop, or one model family.
It is meant to grow into a broader crop-modeling guide that can support:
- beginners learning the field from zero
- interns learning to move from raw notes to simulation
- researchers building reproducible workflows
- contributors working across multiple crop-model ecosystems
What this repository is trying to become
The long-term aim is a crop-modeling environment that can eventually support:
- DSSAT workflows
- APSIM workflows
- STICS workflows
- multiple wrappers and runners
- ensemble modeling
- model comparison
- crop-model-agnostic orchestration
That means the repository should not be framed as "the DSSAT-wrapper repo."
Instead, DSSAT-wrapper is one concrete lesson and implementation pathway inside a larger crop-modeling system.
What this book tries to do
This book combines several layers:
- beginner agronomy and crop-modeling concepts
- DSSAT-oriented file and workflow lessons
- a hemp case-study pathway
- practical onboarding exercises
- a broader ecosystem vision for future models and runners
Where DSSAT-wrapper fits
The DSSAT-wrapper work is already hosted separately:
Inside this book, DSSAT-wrapper is treated as:
- one practical implementation example
- one lesson path for running a real model family
- one stepping stone toward broader crop-modeling infrastructure
Suggested reading paths
If you are completely new
Start here:
- Agronomy Basics for Modelers
- Concepts for Beginners
- What Is Crop Modeling?
- How DSSAT Thinks Day by Day
- Growth Stages, GDD, and Photoperiod
- DSSAT Weather Files Explained
- DSSAT Soil Files Explained
- Worked Case Study: Weather, Soil, and Management Construction
- End-to-End Intern Exercise: From Raw Notes to a Finished Run
If you want the current DSSAT lesson path
Start here:
- DSSAT in This Ecosystem
- DSSAT File Anatomy
- Installed DSSAT Setup
- Using GitHub-Sourced Example Data
- Hopf Paper Case Study
If you care about the bigger platform direction
Start here:
What has changed in this reframed version
This guide is no longer presented as a wrapper manual with extra lessons around it.
It is now presented as a crop-modeling guide with:
- a DSSAT section
- a hemp example pathway
- a training path for interns
- a future-facing vision for multiple models and multiple runners
That change is intentional.
Model Ecosystem Vision
This chapter explains the strategic direction of the repository.
The core idea
This repository is not meant to stop at one wrapper around one model.
The broader goal is to support a crop-modeling ecosystem that can work across:
- model families
- crops
- calibration workflows
- data-preparation pipelines
- analysis and visualization layers
Why this matters
In practice, crop-model work often becomes fragmented:
- one model has one workflow
- another model has a completely different runner
- data preparation is repeated by hand
- comparisons across models become difficult
That fragmentation makes teaching, reuse, and ensemble work harder than it needs to be.
The direction we want
Over time, this repository can grow toward a layered system:
- learning materials
- input-preparation workflows
- model-specific wrappers and runners
- shared evaluation and visualization tools
- ensemble and comparison workflows
- a crop-model-agnostic orchestration layer
What "crop-model agnostic" means here
It does not mean every model is identical.
It means we aim to build shared concepts and workflow layers for things like:
- weather preparation
- soil preparation
- management encoding
- observation handling
- metric calculation
- plotting
- reproducibility checks
while still respecting that each model family has its own structure.
Why wrappers still matter
Model-agnostic design does not eliminate model-specific wrappers.
Wrappers remain valuable because they provide clean entry points into each system.
The point is to avoid stopping there.
We want to move from:
- one wrapper per model
to:
- one broader system that can work with several wrappers and models coherently
Near-term examples
Today, the strongest implemented pathway is DSSAT.
Tomorrow, this repository may also include:
- APSIM-oriented lessons and runner logic
- STICS-oriented lessons and runner logic
- shared case studies expressed across multiple models
Ensemble modeling
Ensemble modeling matters because no single crop model captures every process or every uncertainty perfectly.
A stronger long-term system should eventually support:
- running multiple models on the same scenario
- comparing their outputs systematically
- understanding agreement and disagreement
- building ensemble summaries
That is one of the motivations for keeping the repo broader than one wrapper.
Agronomy Basics for Modelers
This chapter comes before file formats, wrappers, and calibration on purpose.
If you do not have a basic agronomy picture in your head, crop-model outputs can feel like disconnected columns instead of a field story.
The goal here is simple:
understand the agronomic logic that most crop models are trying to represent.
What agronomy is
Agronomy is the practical science of crop production and field management.
It asks questions like:
- what was planted?
- where was it planted?
- when was it planted?
- what soil was it planted into?
- what weather did it experience?
- how was water and fertility managed?
- what stresses reduced growth?
- what was finally harvested?
For a modeler, agronomy is not background decoration. It is the structure behind the inputs.
The basic agronomy triangle
At the simplest level, field outcomes come from the interaction of three big things:
CropThe species, cultivar, and biological traits being grown.EnvironmentWeather, soil, landscape position, and seasonal conditions.ManagementWhat people did to the field: planting, irrigation, fertilization, harvest, and related decisions.
Most crop models are trying to simulate what happens when those three meet through time.
A crop is a living population, not a single plant
Beginners sometimes picture one ideal plant.
Agronomy usually works at field scale, where we care about a crop stand or population.
That means questions such as:
- how many plants emerged?
- how evenly are they spaced?
- how quickly did the canopy close?
- how deep can roots explore?
- how much competition exists among plants?
This matters because yield is usually the result of population performance, not one perfect individual.
Establishment comes first
Before a crop can produce biomass or yield, it has to establish successfully.
Establishment includes:
- seed quality or planting material quality
- planting depth
- planting date
- soil temperature
- soil moisture near planting
- emergence success
- early survival
If establishment is poor, everything later in the season starts from a weaker base.
For modelers, this means early assumptions matter more than they may appear.
The crop needs resources every day
From an agronomy perspective, daily growth depends on access to key resources:
- light
- water
- nutrients
- temperature conditions within a usable range
A crop model turns those ideas into equations, but the agronomic story stays the same:
- leaves intercept light
- roots access water and nutrients
- temperature regulates development speed
- stress reduces what the crop could otherwise do
Soil is more than dirt
In crop modeling, soil is one of the most important agronomic controls.
Soil affects:
- how much water can be stored
- how quickly water drains
- how deep roots can grow
- how much nitrogen may become available
- whether roots encounter physical restrictions
Two fields with the same weather and cultivar can behave very differently if one has deep, well-structured soil and the other has shallow or restrictive soil.
That is why soil files are never just administrative details.
Weather drives opportunity and stress
Agronomically, weather does two things at once:
- it creates growth opportunity
- it creates stress risk
Warmth and radiation can support faster growth. Too little rain, too much heat, or cold conditions can reduce it.
That is why crop models care so much about daily weather rather than seasonal averages.
The crop experiences the season one day at a time.
Management changes the season the crop experiences
Management is not separate from the environment in practice. It changes the environment that the crop actually feels.
Examples:
- irrigation changes water availability
- fertilizer changes nutrient supply
- planting date changes the daylength and temperature pattern the crop sees
- plant density changes competition and canopy structure
- harvest timing changes what outcome is counted as success
This is one reason management files are so influential in crop modeling.
Growth is not the same as development
This distinction helps beginners a lot.
Growth usually refers to increase in size or biomass.
Development usually refers to progress through life stages such as:
- emergence
- vegetative growth
- flowering
- grain fill or reproductive filling
- maturity
A crop can keep developing even when growth is stressed. It can also accumulate biomass rapidly during some stages and slowly during others.
Good agronomy thinking keeps both ideas in view.
Yield is built from components
Agronomists do not usually think of yield as magic that appears at harvest.
Yield is built gradually from intermediate processes and components such as:
- plant population
- tiller or branch number in some crops
- canopy size
- flowering success
- seed or grain number
- individual seed or grain weight
- harvest index or partitioning pattern
If yield is wrong in a model, the most useful question is often:
which component went wrong first?
Stress is central to field reality
Real fields are almost never operating at perfect potential.
Common agronomic stresses include:
- drought
- nutrient deficiency
- heat stress
- cold stress
- waterlogging
- salinity
- pests and diseases
- weed competition
Not every model represents all of these equally well. But agronomically, they matter because they explain why actual fields often underperform relative to ideal conditions.
The same species can behave very differently
One of the easiest mistakes is to talk about a crop species as if it had one fixed behavior.
In reality, cultivars can differ in:
- maturity timing
- photoperiod sensitivity
- biomass potential
- partitioning
- height
- stress tolerance
That is why genotype information matters so much in model-based work.
Agronomic data are usually imperfect
Field agronomy is messy.
You may encounter:
- missing weather days
- incomplete fertilizer records
- uncertain planting depth
- borrowed soil profiles
- observations collected on different dates across variables
- trials with uneven stands or edge effects
A strong modeler does not hide that mess. They document assumptions and keep track of uncertainty.
What a modeler should ask about any field experiment
Before trusting a dataset, start with questions like:
- What crop and cultivar were grown?
- What was the target production goal: grain, fiber, seed, biomass?
- When was the crop planted and harvested?
- What soil was used, and how was it described?
- What weather source represents the site?
- What irrigation and fertilizer were applied, and when?
- What observations were actually measured rather than inferred?
- What obvious stress or management events might have shaped the season?
Those questions usually matter more than jumping straight to parameters.
Why this chapter belongs before the modeling details
Later chapters will talk about:
- weather files
- soil profiles
- management sections
- phenology variables
- genotype coefficients
Those topics make much more sense once you see the agronomic story underneath them.
Crop models are not only software objects. They are field logic translated into structured inputs and equations.
Beginner takeaway
If you remember only one thing, let it be this:
a crop model is trying to simulate how a crop population responds to environment and management through time.
That is an agronomy problem before it becomes a file-format problem.
Concepts for Beginners
This chapter is for readers who are new to crop modeling, DSSAT, or both.
The goal is not to make you memorize jargon. The goal is to give you a working mental model so that the rest of the book feels natural instead of cryptic.
What a crop model is
A crop model is a mathematical description of how a crop grows through time under a given environment and management system.
In plain language, a crop model tries to answer:
- when does the crop emerge?
- when does it flower?
- how much leaf, stem, root, and grain biomass does it produce?
- how do weather, soil, water, planting date, cultivar, and fertilizer change that outcome?
The model does not "see" the field directly. It only sees the numbers we give it through input files and parameters.
What DSSAT is
DSSAT stands for Decision Support System for Agrotechnology Transfer.
At a practical level, DSSAT is:
- a collection of crop models
- a file format ecosystem
- an executable that reads experiment files and writes outputs
That last point matters. DSSAT is not just one crop model. It is a family of models that share a common workflow but differ in their crop-specific logic.
Why wrappers exist
If DSSAT already runs, why build an R wrapper?
Because DSSAT by itself is not optimized for:
- large batch experimentation from code
- calibration loops
- sensitivity analysis
- automated validation
- reproducible research workflows
A wrapper lets us treat DSSAT as a callable engine inside a broader scientific workflow.
The basic logic of a DSSAT run
Every simulation needs four big ingredients:
WeatherDaily minimum and maximum temperature, radiation, rainfall, and related variables.SoilLayer-by-layer information about texture, water holding, bulk density, and initial conditions.ManagementPlanting date, planting density, irrigation, fertilizer, harvest rules, and other field operations.GeneticsCultivar, ecotype, and species coefficients that tell the model how this crop behaves.
The experiment file ties those pieces together.
Why the same crop can still have multiple models
This is one of the most important beginner ideas in this repo.
In DSSAT, "crop" and "model family" are related but not identical.
Examples:
- wheat can be run through
CERESorNWHEAT - sugarcane has multiple engines such as
CANEGRO,CASUPRO, andSAMUCA - cassava can involve
CSYCAorCSCAS
That is exactly why the DSSAT_omniwrapper() exists. A single hard-coded
wrapper is not enough if we want to work across the broader DSSAT ecosystem.
What calibration means
Calibration means adjusting uncertain model parameters so the model reproduces observed field behavior more closely.
Typical targets include:
- flowering date
- maturity date
- aboveground biomass
- stem biomass
- grain yield
- leaf area
Calibration is not guessing until the curve looks nice. It should be guided by:
- real observations
- clearly defined objective functions
- a biologically sensible order of parameter adjustment
A good beginner workflow
If you are starting from zero, this is the safest order:
- Make the model run without errors.
- Confirm that the correct crop family and module are being used.
- Confirm that the experiment file points to the intended weather, soil, and management setup.
- Confirm that observations are being read correctly.
- Compare simulated and observed time series.
- Only then start changing parameters.
That order sounds slow, but it prevents most confusing failures later.
What success looks like
A successful workflow does not mean "the model matches perfectly."
It means:
- the run is reproducible
- the files and assumptions are transparent
- the comparison with observations is explicit
- improvements can be explained rather than guessed
That is the standard this book and this repo aim for.
What Is Crop Modeling?
This chapter is written for a reader who may be strong in coding but new to agronomy, or strong in agronomy but new to modeling.
The goal is simple:
understand what a crop model is trying to represent before worrying about file formats, wrapper functions, or calibration metrics.
A field is a process, not just an outcome
When a farmer or researcher looks at a field, they usually care about outcomes:
- emergence
- flowering
- yield
- fiber production
- grain production
- biomass
A crop model tries to represent the processes that create those outcomes.
That means the model asks questions like:
- How warm was each day?
- How long was the day?
- How much water could roots access?
- How much nitrogen was available?
- How fast could leaves expand?
- When did the plant switch from vegetative growth to reproductive growth?
The model does not "know" the plant the way a human observer does. It knows only the processes and parameters encoded in its equations and inputs.
What a crop model does each day
Most process-based crop models operate on a daily time step.
For each simulated day, the model takes inputs such as:
- weather
- soil water and soil nitrogen status
- crop development stage
- management events such as irrigation or fertilizer
Then it updates internal state variables such as:
- days since planting
- thermal time or photothermal time
- leaf area
- root growth
- dry matter production
- biomass partitioning to leaves, stems, roots, seed, or storage organs
The next day starts from the updated state of the previous day.
This is why one bad input or one poor parameter choice can affect the whole season instead of only one date.
Why process-based models are useful
Process-based models are useful because they let us reason beyond a single observed season.
If the model captures the main processes reasonably well, we can ask:
- What if planting date changes?
- What if rainfall changes?
- What if the crop is grown at a different latitude?
- What if soil depth is shallower?
- What if cultivar photoperiod sensitivity is different?
That is why crop models are used in:
- agronomy
- breeding
- climate adaptation studies
- irrigation and fertilizer analysis
- systems research
What a model is not
A crop model is not:
- a perfect copy of reality
- a replacement for field data
- an excuse to ignore measurement quality
- automatically transferable from one crop or location to another
A model is a disciplined simplification.
The useful question is not "Is the model true?"
The useful questions are:
- Is it transparent?
- Is it reproducible?
- Is it good enough for the question being asked?
- Do the errors make biological sense?
The three big ingredients of crop-model quality
The quality of a simulation usually depends on three broad things:
- input quality
- parameter quality
- model structure
Input quality
If weather, soil, or management inputs are wrong, the model can fail even if the equations are sensible.
Parameter quality
If cultivar or ecotype coefficients are unrealistic, the crop may flower too early, accumulate too little biomass, or partition biomass incorrectly.
Model structure
Even with good data and reasonable parameters, a model can still struggle if its underlying assumptions are not suitable for the crop or environment.
This matters a lot in new-crop work such as hemp adaptation.
Why beginners often feel lost
Beginners often see file names first and biology second.
They encounter:
.WTH.SOL.CUL.ECOPlantGro.OUTGSTDCWAD
and it feels like a codebase without a story.
The story is:
- weather drives daily opportunity and stress
- soil controls what roots can access
- management defines what the farmer did
- genetics defines how this crop behaves
- the model combines them to simulate growth through time
Once that story is clear, the file system starts to make sense.
Where DSSAT fits in
DSSAT is a crop modeling platform that provides:
- crop-specific model engines
- standard file structures
- a simulation executable
- output conventions
This repository is not DSSAT itself.
This repository adds a wrapper around DSSAT so that the model can be used more cleanly from R for:
- reproducible runs
- validation
- calibration
- paper reproduction
- multi-family experimentation
How to read the rest of this book
If this chapter is your starting point, the safest next order is:
- How DSSAT Thinks Day by Day
- Hemp Biology for Modelers
- Growth Stages, GDD, and Photoperiod
- DSSAT File Anatomy
That path gives you the biology and process logic before you dive into files and code.
How DSSAT Thinks Day by Day
This chapter gives a mental model of a daily DSSAT simulation.
If you understand this daily loop, many confusing terms become easier:
- thermal time
- phenology
- biomass accumulation
- partitioning
- stress
- stage transitions
The daily loop
At a high level, DSSAT repeats a sequence like this:
- read today's weather
- update development progress
- compute growth potential
- reduce that potential if water, nitrogen, or temperature are limiting
- partition new biomass among organs
- update soil and plant states
- move to the next day
That sounds simple, but each step contains real biology and many assumptions.
Step 1: read today's weather
The model uses daily weather variables such as:
- solar radiation
- minimum temperature
- maximum temperature
- rainfall
Some models also depend strongly on:
- daylength
- vapor pressure or humidity-related effects
- potential evapotranspiration
These values set the opportunity for photosynthesis, development, and water balance on that day.
Step 2: update development progress
The model keeps track of where the crop is in its life cycle.
Examples of broad phases are:
- planting to emergence
- emergence to vegetative growth
- vegetative growth to flowering
- flowering to seed fill or maturation
The model uses rules such as:
- accumulated heat
- daylength sensitivity
- photothermal response
- stress effects
to decide how quickly the crop moves through these phases.
This is called phenology.
Step 3: compute growth potential
Once the model knows the crop's stage and environment for that day, it estimates how much dry matter the crop could produce under those conditions.
This is influenced by:
- intercepted radiation
- radiation use efficiency
- canopy size or leaf area
- temperature
Potential growth is what the crop could do before water, nitrogen, or other constraints reduce it.
Step 4: apply stress reductions
Real crops rarely grow at full potential every day.
The model therefore applies reductions based on stresses such as:
- too little water
- too much or too little temperature
- too little nitrogen
- other family-specific limitations
This is one reason the same cultivar can behave very differently in different soils or seasons.
Step 5: partition new biomass
New growth is not sent to one single pool.
The model decides how much of today's new biomass goes to:
- leaves
- stems
- roots
- grain
- flowers
- storage organs
The proportions depend on crop type and development stage.
This is called partitioning.
For a fiber-oriented hemp workflow, stem partitioning matters a lot. For a seed-oriented workflow, reproductive partitioning matters more.
Step 6: update soil and plant states
The model then updates internal state variables such as:
- total biomass
- leaf area
- root depth
- soil water
- soil nitrogen
- stage counters
These updated states are what the next simulated day starts from.
That is why calibration errors can accumulate through time. A mismatch in early growth may still matter at flowering or harvest.
Why outputs look the way they do
When DSSAT writes a time-series file such as PlantGro.OUT, it is recording
parts of this daily state.
Variables often represent:
- a stage or stage counter
- an organ biomass pool
- a morphological trait
- a stress-related progression value
If you treat outputs as random column names, the files feel opaque.
If you remember they are snapshots of a daily process loop, they become much easier to read.
Why this matters for debugging
Suppose biomass is too low.
That could happen because:
- the crop developed too quickly and ended vegetative growth early
- leaf area stayed too small
- radiation was too low
- water stress reduced growth
- nitrogen stress reduced growth
- partitioning moved too much biomass away from the organ you are studying
The point is that a final error at harvest is often the result of earlier daily processes, not just one wrong coefficient.
Why this matters for calibration
Calibration is much easier when you think in process order:
- get emergence and phenology roughly right
- get canopy and vegetative growth roughly right
- get partitioning and harvest components right
Trying to fit final biomass before understanding the timing of stage transitions usually creates confusion.
The beginner takeaway
The most important thing to remember is this:
DSSAT is not jumping straight from planting date to final yield.
It is simulating a chain of daily decisions and state updates.
That is the logic behind almost everything else in this book.
Hemp Biology for Modelers
This chapter is not a full botany text.
Its purpose is narrower:
to give enough hemp biology that a new modeler understands what the model is trying to reproduce.
Why hemp is a special case in modeling
Hemp is not just "another generic broadleaf crop."
It can be grown for very different purposes:
- fiber
- seed or grain
- dual-purpose production
- sometimes floral biomass in other contexts
Those goals matter because they change what traits we care about:
- stem growth
- partitioning
- flowering timing
- plant height
- branching
- grain production
Major growth phases in practical terms
For a beginner, it helps to think about hemp growth in broad stages:
- germination and emergence
- vegetative growth
- floral initiation and flowering
- seed set or reproductive filling
- maturation or harvest stage
During vegetative growth, the plant is mostly building structure:
- leaves
- stems
- roots
Later, more of the plant's energy may shift toward reproductive structures and seed production, depending on genotype and management.
Why flowering timing matters so much
Hemp is often strongly sensitive to photoperiod, meaning daylength influences when it transitions toward reproductive development.
This matters because flowering timing controls:
- how long the crop stays vegetative
- how much biomass can accumulate before reproduction
- how tall the crop becomes
- how much stem biomass is available for fiber
- how much time remains for seed filling
If a hemp model flowers too early, it may produce too little total biomass. If it flowers too late, it may overgrow or miss the observed reproductive pattern.
Fiber versus seed logic
For fiber-oriented production, modelers often care especially about:
- stem dry weight
- plant height
- canopy development
- timing of flowering
For seed-oriented production, modelers may care more about:
- flowering and reproductive timing
- grain or seed biomass
- partitioning to reproductive organs
- maturation timing
The same crop model may be used in both contexts, but the most important target variables and the most sensitive parameters can differ.
Cultivar differences
Not all hemp cultivars respond the same way.
Differences may include:
- photoperiod sensitivity
- maturity type
- growth habit
- total biomass potential
- partitioning pattern
- height and canopy structure
That is why genotype coefficients matter so much in hemp adaptation work.
Why latitude and planting date matter
Because daylength changes with latitude and season, planting date can change hemp behavior dramatically.
Two crops planted in different places or on different dates may experience:
- different daylength trajectories
- different temperature accumulation
- different rainfall environments
That means a hemp model must usually represent both:
- temperature-driven development
- photoperiod-driven development
What an intern should watch first
When inspecting hemp simulations, the most useful first questions are often:
- Is the crop flowering at roughly the right time?
- Is stem biomass tracking observations?
- Is total biomass trajectory plausible?
- Is plant height developing at a reasonable pace?
- Are differences among cultivars visible?
Those are often more informative early on than focusing immediately on every single output column.
Why hemp adaptation in DSSAT matters
When a crop is new to a modeling platform, the task is not only "make it run."
The task is to decide whether the platform's structure can represent:
- the right developmental signals
- the right partitioning behavior
- the right observed responses to environment and management
That is why the hemp case study in this repository is so valuable. It tests the wrapper against a real adaptation problem rather than only against legacy crops.
Growth Stages, GDD, and Photoperiod
This chapter introduces several terms that confuse beginners very quickly:
- growth stage
- phenology
- GDD
- thermal time
- photoperiod
- photothermal day
They are closely related, but they are not the same thing.
Growth stages
A crop does not behave the same way from planting to harvest.
It passes through recognizable developmental phases such as:
- sowing or planting
- emergence
- vegetative growth
- flowering
- seed fill
- maturity
These phases are called growth stages or phenological stages.
The exact names vary by crop and model family, but the basic idea is always the same: stage affects what the plant can do and how the model treats it.
Phenology
Phenology is the study or simulation of when these developmental stages occur.
In crop models, phenology is critical because it controls:
- how long the crop remains vegetative
- when it begins reproduction
- when partitioning rules change
- when final harvest components can accumulate
If phenology is wrong, many later variables can also be wrong even if the rest of the model is reasonable.
What is GDD?
GDD means Growing Degree Days.
It is a simple way to express how much useful warmth a crop experienced.
The broad idea is:
- crops develop faster on warmer days
- but only within a biologically meaningful range
A simple textbook version looks like:
GDD = mean daily temperature - base temperature
with rules to prevent unrealistic values when temperatures are too low or too high.
This is only the basic concept. Real crop models may use more detailed thermal response functions than this one-line formula.
Why GDD matters
GDD helps explain why calendar days alone are not enough.
Two crops planted on the same date can accumulate very different development if one season is cool and the other is warm.
That affects:
- emergence timing
- flowering timing
- maturity timing
Thermal time versus GDD
People sometimes use these terms loosely, but thermal time is the broader
concept.
GDD is one common way of expressing thermal time.
In practice, when modelers say:
- "the crop needs more thermal time to flower"
they often mean the model needs more accumulated temperature-driven progress before stage transition.
Photoperiod
Photoperiod means daylength.
Some crops are sensitive to how long the day is, not only how warm it is.
For those crops, development may slow or accelerate depending on whether the day is longer or shorter than the crop prefers.
This matters strongly in hemp because flowering behavior is often tied to daylength response.
Photothermal day
Some model formulations combine temperature and photoperiod into a single idea,
often called a photothermal day or something similar.
The exact equation depends on the model family, but the concept is:
- development is not driven by temperature alone
- temperature and daylength together shape progress toward flowering or later stages
This is especially important when comparing:
- different planting dates
- different latitudes
- different cultivars with different photoperiod sensitivity
Why stage variables matter in outputs
When you see variables related to stages in DSSAT output, they are often trying to summarize this development logic.
Even if you are mainly interested in biomass, stage-related outputs help answer:
- Did the model flower too early?
- Did it remain vegetative too long?
- Did reproductive development begin at the right time?
Those questions are often the key to understanding yield or biomass mismatches.
A practical way to think about calibration
For a beginner, a useful rule is:
- first fix timing
- then fix size
- then fix partitioning
In other words:
- get stage transitions roughly right
- get total growth roughly right
- get organ-specific biomass roughly right
Trying to fit stem or seed biomass while flowering timing is still wrong often leads to misleading parameter changes.
Common beginner confusion
"The model has the right harvest date, so phenology must be fine"
Not necessarily.
It may still have the wrong:
- emergence date
- flowering date
- duration of vegetative growth
- duration of reproductive growth
"More GDD always means better growth"
Not necessarily.
Too much heat, stress, or an unfavorable daylength response can still reduce performance even if thermal accumulation is high.
"Photoperiod only matters at flowering"
Often flowering is the most visible effect, but photoperiod sensitivity can influence the whole developmental pathway around the transition to reproduction.
Beginner takeaway
If you remember only one thing from this chapter, let it be this:
calendar date is not enough.
Crop models care deeply about how temperature and daylength accumulate through time, because that is how they decide when the crop changes from one stage to the next.
Repository Tour
This chapter explains what is in the repository and why each piece exists.
Top-level structure
The key folders and files are:
src/ThemdBooksource files for the learning site.examples/Public-safe example materials and case-study scripts..github/workflows/GitHub Actions, including documentation deployment.book.tomlThemdBookconfiguration.README.mdThe short project-facing introduction.
The src/ folder
This folder is the book.
Hosted documentation should not be treated as decoration. In a project like this, the documentation is part of the reproducibility layer.
The purpose of the book is to:
- teach beginners how the system fits together
- explain why design decisions were made
- document validation status
- make onboarding easier for future contributors
The examples/ folder
This folder is where model-specific case studies can live without defining the identity of the whole repository.
Right now it contains a hemp-oriented DSSAT case-study example.
Over time it could also contain:
- APSIM examples
- STICS examples
- cross-model comparison examples
- ensemble examples
Where the wrapper source lives conceptually
The DSSAT-wrapper is part of the ecosystem, but it is not the identity of this repository.
Its dedicated documentation lives separately here:
This repository should link to such model-specific modules rather than pretending they are the whole system.
What is deliberately outside the repo
The umbrella cropmodelling repo is not the same thing as a full local runtime workspace.
A local research workspace may also contain:
- a DSSAT installation such as
C:/path/to/DSSAT48 - cloned example data such as
dssat-csm-data - APSIM installations
- STICS installations
- project-specific reproduction scripts
- temporary outputs and figures
The goal of this repo is to teach how to work with those pieces without assuming that everyone has the same machine layout or the same preferred model.
DSSAT in This Ecosystem
This chapter explains how DSSAT fits inside the broader cropmodelling
repository.
DSSAT is one model ecosystem, not the whole repo
DSSAT is currently the most developed pathway in this project because:
- it is mature
- it has rich file structures
- it has real example data
- it has already been used in the hemp case-study work
But DSSAT is still one ecosystem among several that this repository aims to support over time.
Where the wrapper belongs
The DSSAT-wrapper work belongs in the project as:
- one implementation module
- one teaching pathway
- one reproducibility layer
It should not define the identity of the whole repository.
Why the hosted wrapper site matters
The wrapper already has its own hosted documentation:
That means this repository does not need to duplicate every wrapper-level detail as if it were the only story.
Instead, this repository can:
- explain why the wrapper matters
- teach the beginner concepts around it
- link out to the dedicated wrapper site when deep DSSAT-wrapper detail is needed
What DSSAT currently contributes here
DSSAT currently provides:
- the clearest end-to-end lesson path
- a rich example of weather, soil, management, and genotype interaction
- a hemp adaptation and paper-reproduction case study
- a concrete example of why wrappers and validation matter
What comes later
As the project grows, the same pattern can be used for other ecosystems:
- APSIM section
- STICS section
- shared comparison sections
The important idea is that each model can be both:
- its own ecosystem
- part of a broader shared crop-modeling framework
DSSAT File Anatomy
One of the biggest beginner hurdles in DSSAT is that everything seems to be "just files." This chapter explains what those files do.
The most important idea
The experiment file is the conductor.
It does not contain every piece of data itself. Instead, it points to:
- the crop setup
- the field and soil context
- the planting and fertilizer treatments
- the weather station
- the cultivar and ecotype identifiers
If you understand the experiment file, DSSAT becomes much easier to reason about.
Experiment files
Examples:
*.WHXfor wheat examples*.MZXfor maize examples*.HMXfor hemp examples
An experiment file usually contains sections like:
*TREATMENTS*CULTIVARS*FIELDS*SOIL ANALYSIS*INITIAL CONDITIONS*PLANTING DETAILS*FERTILIZERS*HARVEST DETAILS*SIMULATION CONTROLS
When you read an experiment file, think of it as a structured recipe rather than a flat text file.
Companion observation files
Two observation file types matter often:
*.TTime-course observations.*.AEnd-of-season or summary observations.
In hemp case studies you may see:
.HMT.HMA
Those are the crop-specific versions of the same idea.
Genotype files
These usually live in the DSSAT Genotype folder.
Common file types:
.CULCultivar coefficients..ECOEcotype coefficients..SPESpecies-level definitions.
Important caution:
Not all crop/model families use the exact same genotype structure. Some have no
.ECO file at all. That is one reason a truly universal wrapper cannot assume a
single fixed genotype pattern.
Soil files
These define the physical and hydraulic context of the field.
Typical concerns include:
- layer depth
- bulk density
- lower and upper water limits
- saturation
- organic carbon
- initial water and nitrogen conditions
If biomass or stress behavior looks unrealistic, soil inputs are often one of the first places to inspect.
Weather files
Weather files are daily drivers. If the weather is wrong, the simulation can be beautifully reproducible and still biologically wrong.
Typical variables include:
- solar radiation
- minimum temperature
- maximum temperature
- rainfall
Weather files should be checked for:
- missing dates
- unrealistic spikes
- unit consistency
- correct station linkage from the experiment file
Metadata files in the DSSAT install
Two install-level files are especially important for this repo:
SIMULATION.CDE
This provides model and crop metadata.
DSSATPRO.V48
This maps crop codes to installed folders and default modules.
These files are crucial for the registry-driven wrapper because they let us infer model context instead of forcing the user to hard-code everything.
Outputs you will see often
PlantGro.OUT
The main time-series output file for growth variables.
Evaluate.OUT
Often provides end-of-run summary metrics and stage-related information.
Other output files
Some families also write additional outputs or family-specific variables.
That is why the omniwrapper uses:
- a generic output reader
- variable alias logic
- fallback parsing where necessary
Practical reading order
When debugging a run, a good order is:
- experiment file
- companion observation files
- genotype files
- install metadata
- output files
That order helps you understand not only what happened, but why the wrapper made the choices it made.
DSSAT Weather Files Explained
Weather is one of the most important DSSAT inputs because it drives the crop every single day.
If the weather data are poor, everything downstream can look coherent and still be wrong.
What a DSSAT weather file does
A weather file provides the daily atmospheric conditions the crop experiences.
Typical daily drivers include:
- solar radiation
- maximum temperature
- minimum temperature
- rainfall
Depending on workflow, other variables may also be relevant or derived.
Why weather matters so much
Weather influences almost every major process:
- emergence
- phenology
- photosynthesis potential
- evapotranspiration
- soil water balance
- stress timing
Even a small issue such as missing rainfall on the wrong date can shift water stress and therefore biomass or flowering patterns.
The basic structure of a weather file
A DSSAT weather file typically contains:
- station metadata
- geographic information
- a header defining daily columns
- one row per day
The station metadata usually help DSSAT understand:
- where the site is
- what climate station identifier to use
- how to compute daylength and seasonal context
The daily rows provide the actual time series.
Common weather variables
The exact names can vary, but beginners should understand the common concepts.
Solar radiation
This represents incoming shortwave energy available to drive canopy interception and photosynthesis-related processes.
If radiation is too low or too high, biomass can be strongly biased.
Minimum and maximum temperature
These influence:
- development rate
- heat stress or cold stress
- evapotranspiration
- potential growth
Temperature errors often show up first as phenology errors.
Rainfall
Rainfall feeds the soil water balance.
If rainfall timing is wrong, the model may create water stress at the wrong stage even if seasonal rainfall totals look reasonable.
Why location metadata matter
Latitude and longitude are not decoration.
They influence:
- daylength
- seasonal solar geometry
- sometimes weather interpretation choices
For crops with photoperiod sensitivity, such as hemp, location metadata can be especially important.
Common data-quality problems
Beginner workflows often run into:
- missing dates
- duplicate dates
- impossible temperatures
- negative radiation
- unit confusion
- station metadata copied from the wrong location
Any of these can damage a simulation quietly.
How weather files relate to experiment files
The experiment file typically references a weather station identifier rather than embedding the daily weather table directly.
That means debugging weather requires checking both:
- the weather file itself
- the station linkage used by the experiment file
Practical checks before you trust a weather file
Before running calibration, check:
- Does the file cover the full simulation period?
- Are all dates present?
- Do units match the expected DSSAT convention for that variable?
- Do min and max temperatures look physically plausible?
- Does rainfall timing look believable?
- Does the station metadata match the intended site?
Beginner takeaway
When a run looks biologically wrong, do not assume the genotype is the problem.
Weather is often the first major input to inspect because it drives development and growth every day of the season.
How to Build Climate Data for DSSAT
This chapter is about workflow rather than theory.
It explains how you go from real weather information to a weather file that DSSAT can actually use.
Where weather data usually come from
Common sources include:
- on-site weather stations
- nearby public weather stations
- gridded weather products
- satellite-supported products
- reanalysis products
The best source depends on:
- how close it is to the field
- whether it covers the full date range
- whether it provides the required variables
- whether its quality control is trustworthy
The core rule: daily completeness matters
DSSAT usually needs a complete daily time series across the simulation period.
That means you should aim for:
- no missing dates
- consistent daily units
- no duplicated rows
- no accidental mixing of stations
Even if a source is scientifically reputable, your assembled file can still be wrong if the processing pipeline is careless.
A practical build workflow
Step 1: define the site and period
Write down:
- station or site identifier
- latitude
- longitude
- elevation if available
- start date
- end date
These become the backbone of the weather file.
Step 2: collect the daily variables
At minimum, assemble the variables required by your DSSAT workflow, often:
- daily solar radiation
- daily minimum temperature
- daily maximum temperature
- daily rainfall
If your source does not provide one of these directly, document how it was derived.
Step 3: standardize units
This is one of the most common failure points.
Before writing the file, make sure each variable is in the convention expected by the DSSAT weather format you are targeting.
Never assume two data providers use the same units.
Step 4: fill or flag missing values
If data are missing, decide explicitly whether to:
- gap-fill from another source
- interpolate where scientifically justified
- mark the series as unsuitable
Do not silently leave gaps in a file that is supposed to drive a daily simulation.
Step 5: run quality checks
Plot and inspect:
- rainfall through time
- min and max temperature through time
- radiation through time
Also check summary statistics such as:
- hottest day
- coldest day
- wettest day
- total seasonal rainfall
These quick checks often reveal transcription or unit errors immediately.
Step 6: write the DSSAT weather file
Once the data are clean, write them into the required DSSAT structure with:
- correct station header information
- correct variable names
- one row per day
Step 7: test with a known experiment
Before using the file in serious calibration, run a simple known experiment and confirm:
- the simulation period is covered
- DSSAT reads the file
- no weather-related missing-data warnings appear
When weather must be estimated
Sometimes field experiments do not have perfect station data.
In that case, you may use:
- nearby stations
- merged products
- satellite-supported products
- reanalysis products
The important thing is not to pretend the data are more direct than they are. Document the source and any transformation clearly.
A good habit for reproducibility
Keep a separate record of:
- source dataset names
- source URLs or citations
- download date
- unit conversions
- gap-filling rules
- final weather file name used in DSSAT
That record is often more valuable than the final file alone.
Beginner takeaway
Building weather for DSSAT is not just "put climate in a text file."
It is a careful process of:
- sourcing
- cleaning
- standardizing
- validating
- documenting
That discipline is what makes later calibration and paper reproduction credible.
Denmark ISIMIP Workflow Walkthrough
This chapter records what we actually did to build the Denmark weather deliverables, which scripts were involved, where they lived on the remote HPC, what outputs were produced, and how the soil side should be structured if we later extend the Denmark project using the Ethiopia soil-first pattern.
The goal is not just to say "the files were made." The goal is to leave a traceable map that another researcher or intern can follow.
The main question we answered
The practical Denmark question was:
"Can we take ISIMIP climate data for Denmark and push it all the way to DSSAT-ready weather outputs, using a workflow that can later be paired with a proper soil pipeline?"
The answer became yes.
By the end of the work, we had:
- processed Denmark climate NetCDF files in crop-model units
- parquet zonal summaries
- TAV and AMP outputs
- DSSAT
.WTHfiles - repaired elevation headers
- validation of processed units
The real project locations
Denmark climate and weather project on HPC
/home/<user>/DEN/home/<user>/DEN/denmark
The denmark folder was created as a clean project wrapper so we would not
have to move heavy climate data around unnecessarily.
Ethiopia soil reference project on HPC
/home/<user>/Ethiopia/home/<user>/Ethiopia/soil_data
This Ethiopia project is important because it already demonstrates the soil-first pattern that we want future projects to follow.
What already existed before the Denmark cleanup
Before the cleanup and restructuring, Denmark already had important raw and processed ingredients:
- Denmark shapefiles under
/home/<user>/DEN/shapefiles - Denmark COP30 DEM tiles under
/home/<user>/DEN/COP30 - processed ISIMIP climate NetCDFs under each model's
processed/folder - legacy zonal outputs under
/home/<user>/DEN/zonal_stats
Those legacy zonal outputs were not yet in the Ethiopia-style parquet layout. They existed mostly as per-grid Excel outputs.
The Denmark wrapper we built
We created a clean wrapper at:
/home/<user>/DEN/denmark
Inside that wrapper we organized:
/home/<user>/DEN/denmark/scripts/home/<user>/DEN/denmark/docs/home/<user>/DEN/denmark/config/home/<user>/DEN/denmark/deliverables/home/<user>/DEN/denmark/logs
This mattered because it separated:
- large source data that already existed
- lightweight workflow code
- final deliverables
- validation outputs
The Denmark scripts we used
These are the numbered workflow scripts we set up for Denmark:
/home/<user>/DEN/denmark/scripts/01_download_isimip.sh/home/<user>/DEN/denmark/scripts/02_prepare_shapefiles_and_grid.sh/home/<user>/DEN/denmark/scripts/03_create_denmark_grid.py/home/<user>/DEN/denmark/scripts/04_prepare_dem.sh/home/<user>/DEN/denmark/scripts/05_merge_clip_convert_climate.sh/home/<user>/DEN/denmark/scripts/06_compute_tav_amp.py/home/<user>/DEN/denmark/scripts/07_extract_zonal_stats_parquet.py/home/<user>/DEN/denmark/scripts/08_generate_wth_files.py/home/<user>/DEN/denmark/scripts/09_validate_outputs.py/home/<user>/DEN/denmark/scripts/10_run_model_scenario_weather.sh/home/<user>/DEN/denmark/scripts/11_submit_denmark_weather.sh
What each Denmark script did
01_download_isimip.sh
This is the placeholder entry for climate acquisition.
In practice, much of the Denmark climate data already existed under model folders, so the project did not depend on a fresh redownload to finish the weather pipeline.
02_prepare_shapefiles_and_grid.sh
This script represents the shapefile preparation stage.
Conceptually this stage ties together:
- Denmark extent
- Denmark 0.5 degree grid
- the grid-cell logic that later drives zonal extraction and WTH station IDs
03_create_denmark_grid.py
This is the grid-construction step.
It formalizes the Denmark climate grid as the geographic framework against which climate and later soil can be mapped.
04_prepare_dem.sh
This script prepared the Denmark DEM support raster used for WTH elevation.
It produced the merged DEM used in the workflow:
/home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif
This step mattered because the WTH headers need defensible elevation values.
05_merge_clip_convert_climate.sh
This was the bridge from raw model outputs to processed Denmark climate files.
The underlying climate conversion logic matched the older project script:
- merge yearly files
- clip to Denmark extent
- convert variables to crop-model units
The effective unit targets were:
pr->mm/daytas->degCtasmax->degCtasmin->degCrsds->MJ m-2 day-1
06_compute_tav_amp.py
This script computed DSSAT header climatology summaries:
- TAV
- AMP
It reads processed tasmax and tasmin, aligns them by date, and writes:
/home/<user>/DEN/denmark/deliverables/tav_amp/{MODEL}/{SCENARIO}/{PERIOD}/tav_amp.csv
07_extract_zonal_stats_parquet.py
This script converted processed Denmark NetCDF climate variables into Ethiopia-style zonal parquet outputs.
It writes outputs such as:
/home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/pr/pr.parquet/home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/tasmax/tasmax.parquet/home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/tasmin/tasmin.parquet/home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/rsds/rsds.parquet
This was the key format bridge that let us move from NetCDF to WTH generation.
08_generate_wth_files.py
This script assembled the final DSSAT weather files.
It merged the parquet variables:
prtasmaxtasminrsds
Then it wrote DSSAT .WTH files into:
/home/<user>/DEN/denmark/deliverables/wth/{MODEL}/{SCENARIO}/{PERIOD}
It also handled:
- station header coordinates
- TAV and AMP
- DEM-based elevation
Later we repaired this script so that zero-elevation coastal cells would borrow
the nearest valid land-based elevation instead of remaining at 0 m.
09_validate_outputs.py
This script counted the main deliverables:
- parquet files
- WTH files
- TAV and AMP files
It was useful for fast completeness checks during batch reruns.
10_run_model_scenario_weather.sh
This was the per-model, per-scenario orchestration script.
It did:
- determine the time span from processed
pr - compute TAV and AMP
- build all parquet variables
- generate WTH files
This script made it easy to re-run only one broken scenario after a repair.
11_submit_denmark_weather.sh
This was the batch submission wrapper for the cluster.
It was used to submit the multi-model, multi-scenario Denmark weather runs.
The main climate processing logic we followed
The Denmark weather build followed this chain:
- use the existing processed model folders under
/home/<user>/DEN - ensure the processed NetCDF values were in DSSAT-ready units
- prepare a merged Denmark DEM for elevation support
- compute TAV and AMP
- convert daily climate grids to zone-level parquet summaries
- generate WTH files for each Denmark grid cell
- validate units, counts, and headers
Problems we had to repair
The work was not a simple one-pass run.
We had to repair real issues.
1. Broken batch dependencies and environment issues
Early jobs were stuck behind dependency logic or failed because the right geospatial environment was not consistently activated.
We fixed that by making the wrapper scripts explicit about:
- Python interpreter
PROJ_LIBPATH- environment cleanup
2. Broken processed files for MPI-ESM1-2-HR ssp370
These files were genuinely corrupted and had to be rebuilt from raw inputs:
/home/<user>/DEN/MPI-ESM1-2-HR/processed/MPI-ESM1-2-HR_ssp370_tasmin_denmark.nc/home/<user>/DEN/MPI-ESM1-2-HR/processed/MPI-ESM1-2-HR_ssp370_rsds_denmark.nc
They were reconstructed using the same logic as the main climate processing step:
- merge raw decade files
- clip to Denmark
- convert to the correct units
3. Duplicate parquet and TAV/AMP test artifacts
During iteration we created a few stale outputs from older path logic.
These were later cleaned so the final deliverables reflected only the intended production runs.
4. Zero-elevation WTH headers
Many coastal Denmark grid cells initially got ELEV = 0.
This happened because centroid sampling and some polygon means landed in very water-dominated cells.
We fixed that by improving the elevation fallback logic in:
/home/<user>/DEN/denmark/scripts/08_generate_wth_files.py
The final fallback order became:
- land-intersection centroid DEM sample
- land-polygon mean
- land-polygon max
- nearest grid cell with a valid land-based elevation
Final Denmark weather results
The Denmark weather deliverables finished cleanly.
Final validated counts were:
80parquet files1160WTH files20TAV/AMP files
That corresponds to:
5models4scenarios each20model-scenario combinations58WTH files per combination
Key validation outputs:
/home/<user>/DEN/denmark/deliverables/validation/processed_units_validation.csv/home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif
Key bundled deliverables:
/home/<user>/DEN/denmark/deliverables/denmark_wth_bundle.tar.gz
What we proved about the processed climate files
We validated all 100 processed climate files across:
prtastasmaxtasminrsds
Final result:
100/100files passed unit and range checks
The final expected units were:
pr:mm/daytas:degCtasmax:degCtasmin:degCrsds:MJ m-2 day-1
The Ethiopia soil-first pattern we should imitate
The Ethiopia project is the best reference for how soil should be prepared in a more model-agnostic and reusable way.
The core soil workflow is documented at:
/home/<user>/Ethiopia/soil_data/docs/soil_first_workflow.md/home/<user>/Ethiopia/soil_data/docs/soil_hydraulics_runoff.md
Its core idea is simple:
- keep one authoritative processed soil backbone
- export DSSAT, APSIM, and STICS from that same backbone
Important precision note:
- in the current HPC tree, the documented Ethiopia soil workflow is easiest to
confirm through the files under
/home/<user>/Ethiopia/soil_datarather than through a complete set of matching upstream script files - the docs clearly describe the producer scripts by name, but the present
walkthrough only treats the existing
soil_datafiles and docs as confirmed paths unless a script path was directly observed
Ethiopia authoritative soil backbone
The files to treat as source of truth are:
/home/<user>/Ethiopia/soil_data/processed/final_soil_profile_all_points.csv/home/<user>/Ethiopia/soil_data/processed/HWSD2_hydraulic_parameters_from_profile.csv/home/<user>/Ethiopia/soil_data/processed/soil_with_slro.csv/home/<user>/Ethiopia/soil_data/processed/horizon.csv
In the current Ethiopia tree we directly confirmed at least:
/home/<user>/Ethiopia/soil_data/processed/soil_with_slro.csv/home/<user>/Ethiopia/soil_data/processed/horizon.csv
Ethiopia downstream soil exports
- DSSAT soil file:
/home/<user>/Ethiopia/soil_data/dssat/ET.SOL - APSIM soils:
/home/<user>/Ethiopia/soil_data/apsim/soils/*.json - STICS soils:
/home/<user>/Ethiopia/soil_data/stics/sols.xml
This is exactly the pattern worth copying.
What the Denmark soil side should look like
If Denmark soil preparation is added later, it should not start from DSSAT format first.
It should start from a model-agnostic processed soil core and then export to model-specific products.
Recommended Denmark soil project layout
/home/<user>/DEN/soil_data/raw/home/<user>/DEN/soil_data/processed/home/<user>/DEN/soil_data/docs/home/<user>/DEN/soil_data/reports/home/<user>/DEN/soil_data/dssat/home/<user>/DEN/soil_data/apsim/home/<user>/DEN/soil_data/stics
Recommended authoritative Denmark processed soil tables
At minimum, Denmark should end up with processed tables analogous to Ethiopia:
final_soil_profile_all_points.csvhydraulic_parameters_from_profile.csvsoil_with_slro.csvhorizon.csv
Recommended Denmark DSSAT output
One final DSSAT soil file such as:
DN.SOL
or an equivalent country-specific .SOL file containing all Denmark station
profiles.
Recommended Denmark per-station profile logic
Each station or grid cell should ultimately have:
- point or grid identifier
- coordinates
- layer bottoms
- sand, silt, clay
- bulk density
- organic carbon
- pH if available
- hydraulic limits such as lower limit, drained upper limit, saturation
- runoff/slope enrichment such as
SLRO - root-limiting depth or impermeable-layer depth where available
What ISIMIP soil input data actually provides
For ISIMIP3 soil input data, the official DOI page states that two newer HWSD 1.2-based files are provided:
hwsd_soil_data_all_land.nchwsd_soil_data_on_cropland.nc
For agriculture work, the important one is usually:
hwsd_soil_data_on_cropland.nc
because it represents the soil predominantly occurring on cropland within each grid cell.
The DOI metadata also states that these files provide fields such as:
texture_classmu_globalsoil_phsoil_caco3bulk_densitycec_soilocroot_obstaclesimpermeable_layerawcsandsiltclaygravelecebs_soilissoil
That means ISIMIP soil input data is not already a DSSAT .SOL file.
It is an upstream static geographic soil dataset that still needs to be turned into:
- a canonical processed soil table
- hydraulic estimates or pedotransfer outputs
- runoff or slope enrichments
- final model exports
What that means for a Denmark soil build
If we want to build Denmark soil inputs inspired by the Ethiopia project, the correct logic is:
- start from an ISIMIP soil NetCDF, ideally cropland-focused if the target is crop modeling
- sample the soil properties to the Denmark grid or station points
- create a canonical processed table with one row per layer per station
- derive hydraulic quantities that DSSAT needs
- add slope/runoff support from the DEM if needed
- create a final
horizon.csvor equivalent initialization table - export a Denmark
.SOLfile for DSSAT - optionally export APSIM and STICS soils from the same processed backbone
The key lesson from the Denmark weather work
The weather pipeline succeeded because it had a disciplined chain:
- one clear wrapper
- numbered scripts
- validated units
- validated counts
- explicit deliverables
The soil pipeline should follow the same discipline.
The strongest future design is therefore:
- weather: NetCDF -> parquet -> WTH
- soil: ISIMIP/HWSD-like source -> processed soil core ->
.SOLand other model exports
The practical answer to the soil question
If you want the soil data "prepared correctly" for a Denmark-style project,
motivated by Ethiopia, then the final soil deliverables should not only be a
single .SOL file.
They should be a package containing:
- a raw source record showing which ISIMIP soil file was used
- a processed canonical table with one row per layer per station or grid cell
- a hydraulic-enriched soil table
- a horizon or initial-condition table
- a DSSAT
.SOLexport - QC notes describing fallback assumptions and missing values
That is the structure that makes the soil side auditable, reusable, and ready for future models beyond DSSAT.
Denmark ISIMIP3b to DSSAT: Full Script Walkthrough
Full guide: github.com/Ongevic/cropmodelling
This chapter walks through every script that turned raw ISIMIP3b climate data into DSSAT-ready weather files for Denmark.
It does not summarise the scripts. It goes through each one block by block, explaining what the code does and why each decision was made.
By the end you should be able to reproduce the full pipeline from scratch, or adapt it for any other country or climate dataset.
What this pipeline produces
At the end of this workflow you have:
1160DSSAT.WTHfiles covering all of Denmark20TAV/AMP summary CSVs80Parquet zonal-statistic files as intermediate data
That corresponds to:
5climate models4scenarios per model (historical,ssp126,ssp370,ssp585)58Denmark grid cells per model-scenario combination
The full data flow
ISIMIP data portal
│
│ 01_download_isimip.sh
▼
Raw ISIMIP3b NetCDF files
(one file per model / scenario / variable / year-chunk)
│
│ 05_merge_clip_convert_climate.sh
▼
Processed Denmark NetCDF files
(units converted, clipped to Denmark, one file per model/scenario/variable)
│
├──── 06_compute_tav_amp.py ──────────► tav_amp/*.csv
│
├──── 07_extract_zonal_stats_parquet.py ► *.parquet
│ (also needs 02/03 shapefiles)
│
└──── 08_generate_wth_files.py ──────► *.WTH
(also needs 04 DEM, parquet, shapefile)
Validation
│
│ 09_validate_outputs.py
▼
Counts check + processed_units_validation.csv
Orchestration
│
│ 10_run_model_scenario_weather.sh
│ 11_submit_denmark_weather.sh (Slurm)
▼
All 20 model-scenario combinations run on HPC in parallel
Project directory layout on HPC
Everything lives under /home/<user>/DEN.
/home/<user>/DEN/
├── GFDL_ESM4/
│ ├── raw/ ← original downloaded NetCDF files
│ └── processed/ ← unit-converted, Denmark-clipped NetCDFs
├── IPSL-CM6A-LR/
│ ├── raw/
│ └── processed/
├── MPI-ESM1-2-HR/
│ ├── raw/
│ └── processed/
├── MRI-ESM2-0/
│ ├── raw/
│ └── processed/
├── UKESM1-0-LL/
│ ├── raw/
│ └── processed/
├── shapefiles/
│ ├── denmark_extent.shp
│ └── denmark_grid_05deg.shp
├── COP30/
│ └── *.tif ← Copernicus 30m DEM tiles
└── denmark/ ← clean wrapper (scripts + deliverables)
├── scripts/
├── docs/
├── config/
├── logs/
└── deliverables/
├── wth/
├── tav_amp/
├── parquet_zonal_stats/
└── validation/
The large climate data files stay in place under each model folder.
The denmark/ wrapper only holds scripts and final outputs.
That separation avoids expensive file copies on HPC.
The five climate models
| Short name | Full CMIP6 identifier |
|---|---|
| GFDL_ESM4 | NOAA Geophysical Fluid Dynamics Lab |
| IPSL-CM6A-LR | Institut Pierre Simon Laplace |
| MPI-ESM1-2-HR | Max Planck Institute, high resolution |
| MRI-ESM2-0 | Meteorological Research Institute Japan |
| UKESM1-0-LL | UK Earth System Model |
These five are the standard ISIMIP3b bias-corrected set used by the climate-impacts community.
The five climate variables
| CMIP6 name | Meaning | DSSAT name | Target units |
|---|---|---|---|
pr | Precipitation | RAIN | mm/day |
tasmax | Daily maximum temperature | TMAX | °C |
tasmin | Daily minimum temperature | TMIN | °C |
rsds | Surface downwelling shortwave radiation | SRAD | MJ m⁻² day⁻¹ |
tas | Mean temperature (used only for TAV/AMP) | — | °C |
DSSAT does not need tas directly in the weather file but the processed
tas NetCDF is validated as a consistency check.
Script 01 — Download ISIMIP climate data
File: 01_download_isimip.sh
#!/bin/bash
set -euo pipefail
exec /home/<user>/DEN/scripts/download_isimip.sh "$@"
What this script does
This is a thin wrapper. It delegates immediately to the core download script that lives in the parent project tree.
The real download logic is in:
/home/<user>/DEN/scripts/download_isimip.sh
That older script uses the ISIMIP data portal API to pull all model-scenario-
variable combinations to the raw/ folders.
Line-by-line breakdown
#!/bin/bash
The shebang. Tells the shell this file is a Bash script. This is the first line of every shell script in this pipeline.
set -euo pipefail
Three safety flags in one line. Always include these at the top of any Bash script you write.
-e— stop the script immediately if any command returns a non-zero exit code-u— treat any unset variable as an error instead of silently expanding to empty-o pipefail— if any command in a pipeline fails, the whole pipeline is marked as failed
Without these, a failing download could silently continue and you would only discover the problem much later when WTH generation crashes.
exec /home/<user>/DEN/scripts/download_isimip.sh "$@"
exec replaces the current process with the target script rather than forking
a child process. That means any Slurm job accounting stays with the real
script and there is no intermediate shell process sitting idle.
"$@" passes all arguments through unchanged. If you call this wrapper with
--model GFDL_ESM4 those flags reach the real download script.
Why the data were already present
For the Denmark project, the ISIMIP downloads had been completed in earlier work. The pipeline could therefore proceed directly to the processing step without waiting for a fresh download.
In a new project you would run this step first and let it finish before anything else.
Scripts 02 and 03 — Prepare shapefiles and grid
File: 02_prepare_shapefiles_and_grid.sh
#!/bin/bash
set -euo pipefail
echo "Use existing Denmark shapefiles in /home/<user>/DEN/shapefiles"
echo "Grid builder source: /home/<user>/DEN/scripts/create_denmark_grid.py"
File: 03_create_denmark_grid.py
#!/usr/bin/env python3
import runpy
runpy.run_path("/home/<user>/DEN/scripts/create_denmark_grid.py",
run_name="__main__")
What these scripts do
These two scripts represent the geographic foundation.
The Denmark project needed two shapefiles:
denmark_extent.shp— the outer boundary of Denmark used for clipping climate gridsdenmark_grid_05deg.shp— a 0.5-degree grid of polygons covering Denmark, with each polygon assigned a uniquegrid_id
The grid_id values in denmark_grid_05deg.shp become the station identifiers
in the final WTH filenames and headers.
Why 0.5 degrees
ISIMIP3b bias-corrected climate data are delivered on a 0.5-degree global grid. Building the Denmark analysis grid at the same resolution means every grid polygon maps exactly to one or more NetCDF cells without interpolation.
Denmark spans roughly:
- latitude: 54.5 N to 57.8 N
- longitude: 8.0 E to 15.2 E
At 0.5-degree spacing that gives approximately 58 grid cells that contain meaningful land area.
The runpy pattern
runpy.run_path("/home/<user>/DEN/scripts/create_denmark_grid.py",
run_name="__main__")
runpy.run_path executes a Python file as if it were run directly from the
command line. Setting run_name="__main__" triggers any if __name__ == "__main__":
guard in the target file.
This pattern lets the numbered wrapper scripts stay lean while pointing to the real logic files that already existed before the clean wrapper was built.
Script 04 — Prepare the DEM
File: 04_prepare_dem.sh
#!/bin/bash
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
unset PYTHONPATH || true
GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"
BASE=/home/<user>/DEN
OUTDIR="$BASE/denmark/deliverables/validation"
mkdir -p "$OUTDIR"
echo "COP30 source tiles: $BASE/COP30" > "$OUTDIR/dem_status.txt"
find "$BASE/COP30" -maxdepth 1 -name '*.tif' | sort >> "$OUTDIR/dem_status.txt"
MERGED="$OUTDIR/denmark_dem_merged.tif"
if [ -f "$MERGED" ]; then
echo "Merged DEM already exists: $MERGED" >> "$OUTDIR/dem_status.txt"
exit 0
fi
$GEO_PY - <<'PY'
from pathlib import Path
import rasterio
from rasterio.merge import merge
base = Path("/home/<user>/DEN/COP30")
out = Path("/home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif")
tiles = sorted(base.glob("*.tif"))
if not tiles:
raise SystemExit("No COP30 tiles found")
srcs = [rasterio.open(tile) for tile in tiles]
mosaic, transform = merge(srcs)
meta = srcs[0].meta.copy()
meta.update(
{
"driver": "GTiff",
"height": mosaic.shape[1],
"width": mosaic.shape[2],
"transform": transform,
}
)
out.parent.mkdir(parents=True, exist_ok=True)
with rasterio.open(out, "w", **meta) as dst:
dst.write(mosaic)
for src in srcs:
src.close()
print(out)
PY
echo "Merged DEM: $MERGED" >> "$OUTDIR/dem_status.txt"
What this script does
It merges individual Copernicus COP30 DEM tiles covering Denmark into one single raster file.
That merged raster is later used by Script 08 to read the elevation of each
Denmark grid cell centroid, which goes into the ELEV field of every WTH
header.
Why elevation matters in WTH files
The DSSAT weather file header looks like this:
@ INSI LAT LONG ELEV TAV AMP REFHT WNDHT
1 54.95 8.83 5 8.7 15.7 -99 -99
The ELEV field (here 5 metres) tells DSSAT how high above sea level the
station is. DSSAT uses this for atmospheric pressure adjustments and related
calculations.
If ELEV = 0 for coastal grid cells that actually sit at 3 to 8 metres, the
header is technically wrong even if DSSAT can still run.
The DEM fixes that.
Line-by-line breakdown
export PATH="$HOME/.local/bin:$PATH"
unset PYTHONPATH || true
GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"
These three lines solve the most common HPC Python problem: the wrong interpreter or the wrong package set gets loaded.
PATHis extended so locally installed tools come firstPYTHONPATHis cleared so no old project packages accidentally loadGEO_PYpoints to the exact Python binary inside thegeo_envconda environment, which hasrasterioandgeopandasinstalled
Always use the full interpreter path in HPC scripts. Never rely on python3
resolving to the right environment.
BASE=/home/<user>/DEN
OUTDIR="$BASE/denmark/deliverables/validation"
mkdir -p "$OUTDIR"
BASE is the root of the Denmark project. OUTDIR is where the merged DEM
and all validation outputs will go. mkdir -p creates the directory tree if
it does not already exist.
echo "COP30 source tiles: $BASE/COP30" > "$OUTDIR/dem_status.txt"
find "$BASE/COP30" -maxdepth 1 -name '*.tif' | sort >> "$OUTDIR/dem_status.txt"
Before doing any work, the script writes a log. It records where the source tiles came from and lists each file. This gives you a permanent audit trail without opening any tool.
MERGED="$OUTDIR/denmark_dem_merged.tif"
if [ -f "$MERGED" ]; then
echo "Merged DEM already exists: $MERGED" >> "$OUTDIR/dem_status.txt"
exit 0
fi
If the merged file already exists, the script skips the merge and exits
cleanly. This makes the script safe to rerun. Rerunning 04_prepare_dem.sh
a second time never damages the output.
$GEO_PY - <<'PY'
...
PY
This runs a Python heredoc. The shell passes everything between <<'PY' and
PY directly to the Python interpreter as standard input. The single-quoted
'PY' delimiter means no shell variable expansion happens inside the Python
block.
Inside the Python block:
tiles = sorted(base.glob("*.tif"))
if not tiles:
raise SystemExit("No COP30 tiles found")
sorted() guarantees a consistent tile order. raise SystemExit gives a
clear error message if the COP30 folder is empty, rather than silently
producing a broken output.
srcs = [rasterio.open(tile) for tile in tiles]
mosaic, transform = merge(srcs)
rasterio.merge.merge stitches the tiles together into a single array
(mosaic) and computes the combined geotransform (transform) that describes
where the merged raster sits on Earth.
meta = srcs[0].meta.copy()
meta.update(
{
"driver": "GTiff",
"height": mosaic.shape[1],
"width": mosaic.shape[2],
"transform": transform,
}
)
The metadata dictionary from the first tile is copied and updated with the
new dimensions and transform. The driver stays as GTiff because the COP30
tiles are GeoTIFF files.
with rasterio.open(out, "w", **meta) as dst:
dst.write(mosaic)
for src in srcs:
src.close()
The merged raster is written to disk. Then every source file handle is closed. Forgetting to close file handles on HPC can leave locked files and cause problems for later scripts.
Practical check
After this script runs, confirm the merged DEM exists and has the right extent:
gdalinfo /home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif \
| grep -E "Origin|Pixel Size|Size"
The corner coordinates should bracket Denmark. If the extent covers only part
of Denmark, some COP30 tiles were missing from the COP30/ folder.
Script 05 — Merge, clip, and convert climate data
File: 05_merge_clip_convert_climate.sh
#!/bin/bash
set -euo pipefail
exec /home/<user>/DEN/scripts/merge_denmark.sh "$@"
What this script does
Like Script 01, this is a thin wrapper. The real processing logic is in the pre-existing script:
/home/<user>/DEN/scripts/merge_denmark.sh
That script does the three most important preprocessing steps for each model-scenario-variable combination:
- Merge — yearly or decade-chunked NetCDF files are concatenated into one continuous time series
- Clip — the global or European climate field is cut to Denmark's bounding box using the extent shapefile
- Convert — raw ISIMIP3b units are converted to DSSAT-ready units
Why unit conversion is the critical step
ISIMIP3b delivers data in SI units (inherited from the underlying CMIP6 models):
| Variable | Raw ISIMIP3b unit | Meaning |
|---|---|---|
pr | kg m⁻² s⁻¹ | precipitation flux |
tas, tasmax, tasmin | K | temperature in Kelvin |
rsds | W m⁻² | watts per square metre |
DSSAT expects:
| Variable | DSSAT unit |
|---|---|
RAIN | mm/day |
TMAX, TMIN | °C |
SRAD | MJ m⁻² day⁻¹ |
The conversion formulas used are:
pr [mm/day] = pr [kg/m²/s] × 86400
tas [°C] = tas [K] − 273.15
tasmax [°C] = tasmax [K] − 273.15
tasmin [°C] = tasmin [K] − 273.15
rsds [MJ/m²/day] = rsds [W/m²] × 0.0864
The 86400 factor for precipitation converts seconds to days (1 day = 86400 s). The 0.0864 factor for radiation converts W/m² to MJ/m²/day (1 W/m² × 86400 s/day ÷ 1,000,000 J/MJ = 0.0864).
Output files produced
For each model and scenario, this step produces five NetCDF files:
/home/<user>/DEN/{MODEL}/processed/
{MODEL}_{SCENARIO}_pr_denmark.nc
{MODEL}_{SCENARIO}_tas_denmark.nc
{MODEL}_{SCENARIO}_tasmax_denmark.nc
{MODEL}_{SCENARIO}_tasmin_denmark.nc
{MODEL}_{SCENARIO}_rsds_denmark.nc
Five models × four scenarios × five variables = 100 processed NetCDF files in total.
A real problem we hit here
During production, two files for MPI-ESM1-2-HR / ssp370 were found
corrupted:
MPI-ESM1-2-HR_ssp370_tasmin_denmark.nc
MPI-ESM1-2-HR_ssp370_rsds_denmark.nc
They had to be rebuilt from raw by rerunning the merge-clip-convert logic on the raw source files. After rebuilding, the downstream WTH generation for that model-scenario combination ran correctly.
If you see a model-scenario combination missing from the final WTH output, always check whether its processed NetCDFs are intact first.
Script 06 — Compute TAV and AMP
File: 06_compute_tav_amp.py
#!/usr/bin/env python3
from pathlib import Path
import argparse
import pandas as pd
import xarray as xr
def main() -> None:
parser = argparse.ArgumentParser(
description="Compute Denmark TAV/AMP from processed NetCDFs."
)
parser.add_argument("--model", required=True)
parser.add_argument("--scenario", required=True)
parser.add_argument("--start-year", type=int)
parser.add_argument("--end-year", type=int)
args = parser.parse_args()
base = Path("/home/<user>/DEN")
tasmax = (
base / args.model / "processed"
/ f"{args.model}_{args.scenario}_tasmax_denmark.nc"
)
tasmin = (
base / args.model / "processed"
/ f"{args.model}_{args.scenario}_tasmin_denmark.nc"
)
if not tasmax.exists() or not tasmin.exists():
raise SystemExit("Missing tasmax or tasmin processed NetCDF")
dsx = xr.open_dataset(tasmax)
dsn = xr.open_dataset(tasmin)
tx = (
dsx["tasmax"]
.mean(dim=[d for d in dsx["tasmax"].dims if d != "time"])
.to_series()
)
tn = (
dsn["tasmin"]
.mean(dim=[d for d in dsn["tasmin"].dims if d != "time"])
.to_series()
)
df = pd.DataFrame(
{"date": pd.to_datetime(tx.index), "tasmax": tx.values}
).merge(
pd.DataFrame({"date": pd.to_datetime(tn.index), "tasmin": tn.values}),
on="date",
how="inner",
)
if args.start_year is not None:
df = df[df["date"].dt.year >= args.start_year]
if args.end_year is not None:
df = df[df["date"].dt.year <= args.end_year]
if df.empty:
raise SystemExit("No overlapping tasmax/tasmin dates after filtering")
tavg = (df["tasmax"] + df["tasmin"]) / 2.0
tav_val = float(tavg.mean())
monthly = (
df.assign(month=df["date"].dt.month)
.groupby("month")[["tasmax", "tasmin"]]
.mean()
)
monthly_tavg = (monthly["tasmax"] + monthly["tasmin"]) / 2.0
amp_val = float(monthly_tavg.max() - monthly_tavg.min())
period = f"{df['date'].dt.year.min()}_{df['date'].dt.year.max()}"
outdir = (
base / "denmark" / "deliverables" / "tav_amp"
/ args.model / args.scenario / period
)
outdir.mkdir(parents=True, exist_ok=True)
out = outdir / "tav_amp.csv"
pd.DataFrame(
[
{
"model": args.model,
"scenario": args.scenario,
"tav_degC": round(tav_val, 3),
"amp_degC": round(amp_val, 3),
}
]
).to_csv(out, index=False)
dsx.close()
dsn.close()
print(out)
if __name__ == "__main__":
main()
What this script produces
For each model-scenario pair it writes one CSV file:
deliverables/tav_amp/GFDL_ESM4/historical/1971_2014/tav_amp.csv
That CSV contains two numbers:
model,scenario,tav_degC,amp_degC
GFDL_ESM4,historical,7.951,16.528
Those two numbers end up in the header of every WTH file for that model-scenario combination.
What TAV and AMP mean for DSSAT
TAV is the long-term mean annual temperature at the site in degrees Celsius. DSSAT uses it for soil temperature initialisation.
AMP is the mean annual amplitude — the difference between the warmest month's average temperature and the coldest month's average temperature. It describes how strongly the site's temperature cycles through the year. A site in Denmark has a large AMP (~16 °C) because summers and winters are very different. A tropical site has a small AMP (3 to 5 °C).
Line-by-line breakdown
parser = argparse.ArgumentParser(
description="Compute Denmark TAV/AMP from processed NetCDFs."
)
parser.add_argument("--model", required=True)
parser.add_argument("--scenario", required=True)
parser.add_argument("--start-year", type=int)
parser.add_argument("--end-year", type=int)
args = parser.parse_args()
The script takes four command-line arguments. --model and --scenario are
required. The year filters are optional — when omitted, the full time range
in the NetCDF is used.
This design means the script can be called for any model-scenario combination without editing the code.
tasmax = (
base / args.model / "processed"
/ f"{args.model}_{args.scenario}_tasmax_denmark.nc"
)
tasmin = (
base / args.model / "processed"
/ f"{args.model}_{args.scenario}_tasmin_denmark.nc"
)
if not tasmax.exists() or not tasmin.exists():
raise SystemExit("Missing tasmax or tasmin processed NetCDF")
The script builds the expected file paths from the arguments and checks they
exist before trying to open them. A SystemExit with a clear message is
better than letting xr.open_dataset crash with a cryptic error.
dsx = xr.open_dataset(tasmax)
dsn = xr.open_dataset(tasmin)
tx = (
dsx["tasmax"]
.mean(dim=[d for d in dsx["tasmax"].dims if d != "time"])
.to_series()
)
This opens both NetCDF files and reduces the spatial dimensions to a single
Denmark-wide daily mean. The expression
[d for d in dsx["tasmax"].dims if d != "time"] builds a list of all
dimensions except time — typically ["lat", "lon"] — and averages over
them. The result is a daily time series of the Denmark-mean maximum temperature.
Taking a spatial mean here is a simplification. It produces one TAV/AMP pair per model-scenario combination rather than one per grid cell. The per-cell TAV and AMP values are computed later inside Script 08 from the actual zone data.
df = pd.DataFrame(
{"date": pd.to_datetime(tx.index), "tasmax": tx.values}
).merge(
pd.DataFrame({"date": pd.to_datetime(tn.index), "tasmin": tn.values}),
on="date",
how="inner",
)
The two time series (tasmax and tasmin) are joined on date using an inner merge. Inner merge means only dates present in both series are kept. This protects against misaligned time coordinates between the two NetCDF files.
if args.start_year is not None:
df = df[df["date"].dt.year >= args.start_year]
if args.end_year is not None:
df = df[df["date"].dt.year <= args.end_year]
if df.empty:
raise SystemExit("No overlapping tasmax/tasmin dates after filtering")
The optional year filters are applied. If after filtering nothing remains (the requested years are outside the NetCDF range), the script exits with a clear error.
tavg = (df["tasmax"] + df["tasmin"]) / 2.0
tav_val = float(tavg.mean())
TAV is the mean of the daily mean temperatures across the whole period. The
daily mean temperature is approximated as (tasmax + tasmin) / 2.
monthly = (
df.assign(month=df["date"].dt.month)
.groupby("month")[["tasmax", "tasmin"]]
.mean()
)
monthly_tavg = (monthly["tasmax"] + monthly["tasmin"]) / 2.0
amp_val = float(monthly_tavg.max() - monthly_tavg.min())
AMP is derived from monthly averages. Each month's mean maximum and minimum are averaged, then AMP is the difference between the warmest and coldest month. This is a 12-value lookup, not day-by-day.
period = f"{df['date'].dt.year.min()}_{df['date'].dt.year.max()}"
outdir = (
base / "denmark" / "deliverables" / "tav_amp"
/ args.model / args.scenario / period
)
outdir.mkdir(parents=True, exist_ok=True)
The output directory is constructed from the data itself. If the NetCDF
spans 1971 to 2014, period becomes "1971_2014". This makes the output
path self-documenting and avoids hardcoding year ranges.
pd.DataFrame(
[
{
"model": args.model,
"scenario": args.scenario,
"tav_degC": round(tav_val, 3),
"amp_degC": round(amp_val, 3),
}
]
).to_csv(out, index=False)
dsx.close()
dsn.close()
print(out)
The output CSV has exactly one row. Values are rounded to 3 decimal places. Both dataset handles are closed before the script exits. The path of the output file is printed so the calling shell script can log it.
Script 07 — Extract zonal statistics to Parquet
File: 07_extract_zonal_stats_parquet.py
#!/usr/bin/env python3
"""Convert Denmark processed NetCDF files into Ethiopia-style zonal parquet outputs."""
import argparse
from pathlib import Path
import geopandas as gpd
import numpy as np
import pandas as pd
import xarray as xr
from shapely.geometry import Point
def find_cells(geometry, lons, lats):
bounds = geometry.bounds
lon_idx = np.where((lons >= bounds[0]) & (lons <= bounds[2]))[0]
lat_idx = np.where((lats >= bounds[1]) & (lats <= bounds[3]))[0]
hits = []
for yi in lat_idx:
for xi in lon_idx:
if geometry.contains(Point(lons[xi], lats[yi])):
hits.append((yi, xi))
return hits
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True)
parser.add_argument("--scenario", required=True)
parser.add_argument("--variable", required=True)
parser.add_argument("--start-year", type=int)
parser.add_argument("--end-year", type=int)
args = parser.parse_args()
base = Path("/home/<user>/DEN")
nc_file = (
base / args.model / "processed"
/ f"{args.model}_{args.scenario}_{args.variable}_denmark.nc"
)
shp = base / "shapefiles" / "denmark_grid_05deg.shp"
gdf = gpd.read_file(shp)
ds = xr.open_dataset(nc_file)
var = ds[args.variable]
lats = ds["lat"].values if "lat" in ds.coords else ds["latitude"].values
lons = ds["lon"].values if "lon" in ds.coords else ds["longitude"].values
times = pd.to_datetime(ds["time"].values)
if args.start_year is not None:
mask = times.year >= args.start_year
var = var.isel(time=mask)
times = times[mask]
if args.end_year is not None:
mask = times.year <= args.end_year
var = var.isel(time=mask)
times = times[mask]
if len(times) == 0:
raise SystemExit("No timesteps remain after filtering")
period = f"{times.min().year}_{times.max().year}"
out_dir = (
base
/ "denmark"
/ "deliverables"
/ "parquet_zonal_stats"
/ args.model
/ args.scenario
/ period
/ args.variable
)
out_dir.mkdir(parents=True, exist_ok=True)
out_file = out_dir / f"{args.variable}.parquet"
rows = []
for idx, row in gdf.iterrows():
zone_id = int(row.get("grid_id", idx + 1))
cells = find_cells(row.geometry, lons, lats)
if not cells:
continue
yi = [c[0] for c in cells]
xi = [c[1] for c in cells]
values = var.values[:, yi, xi]
mean_values = np.nanmean(values, axis=1)
counts = np.sum(~np.isnan(values), axis=1)
rows.append(
pd.DataFrame(
{
"zone_id": zone_id,
"date": times,
"value": mean_values,
"cell_count": counts,
}
)
)
if not rows:
raise SystemExit("No zonal rows created")
pd.concat(rows, ignore_index=True).to_parquet(out_file, index=False)
ds.close()
print(out_file)
if __name__ == "__main__":
main()
What this script produces
For each model, scenario, variable, and period it writes one Parquet file:
deliverables/parquet_zonal_stats/
GFDL_ESM4/
historical/
1971_2014/
pr/pr.parquet
tasmax/tasmax.parquet
tasmin/tasmin.parquet
rsds/rsds.parquet
Each Parquet file has one row per zone per day:
| zone_id | date | value | cell_count |
|---|---|---|---|
| 1 | 1971-01-01 | 0.0 | 1 |
| 1 | 1971-01-02 | 1.6 | 1 |
| ... | ... | ... | ... |
| 58 | 2014-12-31 | 2.3 | 1 |
What Parquet is and why it is used here
Parquet is a column-oriented binary file format. Compared to CSV:
- much smaller file size for the same data
- much faster to read in Python with pandas or pyarrow
- stores data types exactly (no ambiguity between integers and strings)
For a daily time series over 58 zones and 44 years, a CSV would have about 930,000 rows per variable. Parquet compresses this by 5 to 10 times and reads it roughly 10 times faster.
The find_cells function
def find_cells(geometry, lons, lats):
bounds = geometry.bounds
lon_idx = np.where((lons >= bounds[0]) & (lons <= bounds[2]))[0]
lat_idx = np.where((lats >= bounds[1]) & (lats <= bounds[3]))[0]
hits = []
for yi in lat_idx:
for xi in lon_idx:
if geometry.contains(Point(lons[xi], lats[yi])):
hits.append((yi, xi))
return hits
This function answers the question: which NetCDF grid points fall inside this shapefile polygon?
Step 1 — bounding box pre-filter:
bounds = geometry.bounds
lon_idx = np.where((lons >= bounds[0]) & (lons <= bounds[2]))[0]
lat_idx = np.where((lats >= bounds[1]) & (lats <= bounds[3]))[0]
geometry.bounds returns (minx, miny, maxx, maxy) for the polygon.
The numpy where calls find all longitude indices that fall within the
east-west range and all latitude indices that fall within the north-south
range. This pre-filter makes the full search much faster by limiting the
candidate cells before the slower point-in-polygon test.
Step 2 — point-in-polygon test:
for yi in lat_idx:
for xi in lon_idx:
if geometry.contains(Point(lons[xi], lats[yi])):
hits.append((yi, xi))
For each candidate cell, a Point is constructed from its longitude and
latitude and tested against the polygon using Shapely's contains method.
Only cells whose centre point falls strictly inside the polygon are kept.
The function returns a list of (yi, xi) index pairs — the row and column
indices into the NetCDF data array.
Line-by-line breakdown of main
lats = ds["lat"].values if "lat" in ds.coords else ds["latitude"].values
lons = ds["lon"].values if "lon" in ds.coords else ds["longitude"].values
Different ISIMIP providers use slightly different coordinate names. Some use
lat/lon, others use latitude/longitude. This one-liner handles both
without crashing.
times = pd.to_datetime(ds["time"].values)
if args.start_year is not None:
mask = times.year >= args.start_year
var = var.isel(time=mask)
times = times[mask]
The time dimension is converted to pandas DatetimeIndex immediately. Pandas
datetime operations (.year, .month, etc.) are simpler and more reliable
than xarray's own time filtering. The isel call selects only the time steps
that pass the year filter, keeping the data and times in sync.
for idx, row in gdf.iterrows():
zone_id = int(row.get("grid_id", idx + 1))
cells = find_cells(row.geometry, lons, lats)
if not cells:
continue
yi = [c[0] for c in cells]
xi = [c[1] for c in cells]
values = var.values[:, yi, xi]
mean_values = np.nanmean(values, axis=1)
counts = np.sum(~np.isnan(values), axis=1)
The outer loop iterates over every Denmark grid polygon. For each polygon:
find_cellsreturns the NetCDF indices for all grid points inside itvar.values[:, yi, xi]extracts a 2D array of shape(n_days, n_cells)np.nanmean(..., axis=1)averages across cells for each day, ignoring NaN values (ocean cells in the NetCDF return NaN)countsrecords how many non-NaN cells contributed to each day's average, which is useful for QC later
rows.append(
pd.DataFrame(
{
"zone_id": zone_id,
"date": times,
"value": mean_values,
"cell_count": counts,
}
)
)
Each zone becomes one DataFrame with one row per day. All zone DataFrames
are collected in rows.
pd.concat(rows, ignore_index=True).to_parquet(out_file, index=False)
All zone DataFrames are concatenated and written as a single Parquet file.
ignore_index=True resets the row index to 0, 1, 2, … so there are no
duplicate index values. index=False keeps the Parquet file clean by not
writing the index as an extra column.
Script 08 — Generate DSSAT WTH files
File: 08_generate_wth_files.py
#!/usr/bin/env python3
"""Generate Denmark DSSAT WTH files from Denmark parquet zonal stats."""
from __future__ import annotations
import argparse
from pathlib import Path
import pandas as pd
VARIABLES = ["pr", "tasmax", "tasmin", "rsds"]
COLUMN_MAP = {"pr": "RAIN", "tasmax": "TMAX", "tasmin": "TMIN", "rsds": "SRAD"}
DEFAULT_DEM = Path(
"/home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif"
)
MIN_VALID_ELEV = 1.0
def load_zone_metadata(shapefile: Path, dem_raster: Path | None):
import geopandas as gpd
import rasterio
from rasterstats import zonal_stats
gdf = gpd.read_file(shapefile)
extent_path = Path("/home/<user>/DEN/shapefiles/denmark_extent.shp")
land_geom = None
if extent_path.exists():
extent_gdf = gpd.read_file(extent_path)
land_geom = (
extent_gdf.union_all()
if hasattr(extent_gdf, "union_all")
else extent_gdf.unary_union
)
meta = {}
sample_points = []
zone_ids = []
land_shapes = []
for idx, row in gdf.iterrows():
zone_id = int(row.get("grid_id", idx + 1))
geom = row.geometry
if land_geom is not None:
clipped = geom.intersection(land_geom)
if not clipped.is_empty:
geom = clipped
centroid = geom.centroid
meta[zone_id] = {"lat": float(centroid.y), "lon": float(centroid.x), "elev": 0.0}
sample_points.append((float(centroid.x), float(centroid.y)))
zone_ids.append(zone_id)
land_shapes.append(geom)
if dem_raster and dem_raster.exists():
with rasterio.open(dem_raster) as src:
nodata = src.nodata
samples = list(src.sample(sample_points))
stats = zonal_stats(land_shapes, dem_raster, stats=["mean", "max"], nodata=nodata)
for zone_id, sample in zip(zone_ids, samples):
elev = float(sample[0]) if sample is not None and len(sample) else 0.0
if nodata is not None and elev == nodata:
elev = 0.0
meta[zone_id]["elev"] = elev
for zone_id, stat in zip(zone_ids, stats):
if meta[zone_id]["elev"] <= 0 and stat.get("mean") is not None:
meta[zone_id]["elev"] = float(stat["mean"])
if meta[zone_id]["elev"] <= 0 and stat.get("max") is not None:
meta[zone_id]["elev"] = float(stat["max"])
positive = [zid for zid in zone_ids if meta[zid]["elev"] >= MIN_VALID_ELEV]
for zone_id in zone_ids:
if meta[zone_id]["elev"] >= MIN_VALID_ELEV or not positive:
continue
lat = meta[zone_id]["lat"]
lon = meta[zone_id]["lon"]
nearest = min(
positive,
key=lambda zid: (meta[zid]["lat"] - lat) ** 2
+ (meta[zid]["lon"] - lon) ** 2,
)
meta[zone_id]["elev"] = meta[nearest]["elev"]
return meta
def normalize_srad(series: pd.Series) -> pd.Series:
max_val = float(series.max()) if not series.empty else 0.0
if max_val > 10000:
return series / 1_000_000.0
if max_val > 100:
return series * 0.0864
return series
def write_wth(
output_path: Path,
zone_id: int,
zone_df: pd.DataFrame,
zone_meta: dict[str, float],
) -> None:
tavg = (zone_df["TMAX"] + zone_df["TMIN"]) / 2.0
tav = round(float(tavg.mean()), 1)
monthly = (
zone_df.assign(month=pd.to_datetime(zone_df["date"]).dt.month)
.groupby("month")[["TMAX", "TMIN"]]
.mean()
)
monthly_tavg = (monthly["TMAX"] + monthly["TMIN"]) / 2.0
amp = round(float(monthly_tavg.max() - monthly_tavg.min()), 1)
years = sorted(zone_df["year"].unique())
start_year = years[0]
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w", encoding="ascii", newline="\n") as h:
h.write(f"*WEATHER DATA : {zone_id:04d} {start_year}\n")
h.write("@ INSI LAT LONG ELEV TAV AMP REFHT WNDHT\n")
h.write(
f"{zone_id:>6}"
f"{zone_meta['lat']:>9.2f}"
f"{zone_meta['lon']:>9.2f}"
f"{zone_meta['elev']:>6.0f}"
f"{tav:>6.1f}"
f"{amp:>6.1f}"
f"{-99:>6}"
f"{-99:>6}\n"
)
h.write("@DATE SRAD TMAX TMIN RAIN\n")
for year, doy, srad, tmax, tmin, rain in zone_df[
["year", "doy", "SRAD", "TMAX", "TMIN", "RAIN"]
].itertuples(index=False, name=None):
h.write(
f"{int(year) % 100:02d}{int(doy):03d}"
f"{float(srad):6.1f}"
f"{float(tmax):6.1f}"
f"{float(tmin):6.1f}"
f"{float(rain):6.1f}\n"
)
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True)
parser.add_argument("--scenario", required=True)
parser.add_argument("--period", required=True)
parser.add_argument("--dem-raster", default=str(DEFAULT_DEM))
args = parser.parse_args()
base = Path("/home/<user>/DEN/denmark/deliverables")
zonal_dir = base / "parquet_zonal_stats" / args.model / args.scenario / args.period
out_dir = base / "wth" / args.model / args.scenario / args.period
meta = load_zone_metadata(
Path("/home/<user>/DEN/shapefiles/denmark_grid_05deg.shp"),
Path(args.dem_raster) if args.dem_raster else None,
)
zone_data = {}
for var in VARIABLES:
pq = zonal_dir / var / f"{var}.parquet"
if not pq.exists():
raise SystemExit(f"Missing {pq}")
zone_data[var] = pd.read_parquet(pq)
merged = zone_data["pr"][["zone_id", "date", "value"]].rename(
columns={"value": "RAIN"}
)
for var in VARIABLES[1:]:
df = zone_data[var][["zone_id", "date", "value"]].rename(
columns={"value": COLUMN_MAP[var]}
)
merged = merged.merge(df, on=["zone_id", "date"], how="inner")
merged["date"] = pd.to_datetime(merged["date"])
merged["SRAD"] = normalize_srad(merged["SRAD"])
merged["year"] = merged["date"].dt.year
merged["doy"] = merged["date"].dt.dayofyear
for zone_id, zone_df in merged.groupby("zone_id"):
years = sorted(zone_df["year"].unique())
filename = f"{int(zone_id):04d}{years[0] % 100:02d}{len(years):02d}.WTH"
write_wth(
out_dir / filename,
int(zone_id),
zone_df.sort_values("date"),
meta[int(zone_id)],
)
print(out_dir)
if __name__ == "__main__":
main()
What this script produces
One .WTH file per Denmark grid cell per model-scenario combination:
deliverables/wth/GFDL_ESM4/historical/1971_2014/
00017144.WTH
00027144.WTH
...
05807144.WTH
The filename pattern is {zone_id:04d}{yy:02d}{n_years:02d}.WTH.
For zone 1, starting in 1971, spanning 44 years: 00017144.WTH.
Each file looks like this:
*WEATHER DATA : 0001 1971
@ INSI LAT LONG ELEV TAV AMP REFHT WNDHT
1 54.95 8.83 5 8.7 15.7 -99 -99
@DATE SRAD TMAX TMIN RAIN
71001 1.2 2.6 -4.7 0.0
71002 2.9 -3.3 -7.4 0.0
...
The module-level constants
VARIABLES = ["pr", "tasmax", "tasmin", "rsds"]
COLUMN_MAP = {"pr": "RAIN", "tasmax": "TMAX", "tasmin": "TMIN", "rsds": "SRAD"}
DEFAULT_DEM = Path(
"/home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif"
)
MIN_VALID_ELEV = 1.0
Defining constants at the top of the file rather than buried in functions
makes them easy to find and change. MIN_VALID_ELEV = 1.0 sets the
threshold below which an elevation value is considered suspect (likely a
water cell or a nodata misread). Any zone with elevation below 1 metre gets
the fallback logic.
The load_zone_metadata function
This function returns a dictionary keyed by zone_id, with each value
containing the latitude, longitude, and elevation for that zone.
land_geom = None
if extent_path.exists():
extent_gdf = gpd.read_file(extent_path)
land_geom = (
extent_gdf.union_all()
if hasattr(extent_gdf, "union_all")
else extent_gdf.unary_union
)
The Denmark extent shapefile is read and dissolved into a single geometry representing all land area. This is used in the next block to clip the grid polygons so coastal cells only sample elevation over their land portion.
union_all is the newer geopandas method; unary_union is the older one.
The hasattr check makes the script compatible with both.
for idx, row in gdf.iterrows():
zone_id = int(row.get("grid_id", idx + 1))
geom = row.geometry
if land_geom is not None:
clipped = geom.intersection(land_geom)
if not clipped.is_empty:
geom = clipped
centroid = geom.centroid
Each grid polygon is intersected with the land boundary. For a purely inland grid cell the intersection returns the same polygon. For a coastal cell that is half sea, the intersection returns only the land portion. The centroid is then computed from the land-only shape, so it sits on land rather than in the sea.
The four-level elevation fallback:
# Level 1: DEM point sample at the land centroid
with rasterio.open(dem_raster) as src:
nodata = src.nodata
samples = list(src.sample(sample_points))
for zone_id, sample in zip(zone_ids, samples):
elev = float(sample[0]) if sample is not None and len(sample) else 0.0
if nodata is not None and elev == nodata:
elev = 0.0
meta[zone_id]["elev"] = elev
# Level 2 and 3: zonal mean, then max, if point sample failed
stats = zonal_stats(land_shapes, dem_raster, stats=["mean", "max"], nodata=nodata)
for zone_id, stat in zip(zone_ids, stats):
if meta[zone_id]["elev"] <= 0 and stat.get("mean") is not None:
meta[zone_id]["elev"] = float(stat["mean"])
if meta[zone_id]["elev"] <= 0 and stat.get("max") is not None:
meta[zone_id]["elev"] = float(stat["max"])
# Level 4: nearest valid neighbour
positive = [zid for zid in zone_ids if meta[zid]["elev"] >= MIN_VALID_ELEV]
for zone_id in zone_ids:
if meta[zone_id]["elev"] >= MIN_VALID_ELEV or not positive:
continue
nearest = min(
positive,
key=lambda zid: (meta[zid]["lat"] - lat) ** 2
+ (meta[zid]["lon"] - lon) ** 2,
)
meta[zone_id]["elev"] = meta[nearest]["elev"]
Before this four-level fallback was added, many coastal Denmark grid cells
got ELEV = 0 because their centroid fell in the sea and the DEM returned
nodata there. The fallback ensures every zone ends up with a physically
plausible elevation.
The normalize_srad function
def normalize_srad(series: pd.Series) -> pd.Series:
max_val = float(series.max()) if not series.empty else 0.0
if max_val > 10000:
return series / 1_000_000.0
if max_val > 100:
return series * 0.0864
return series
This function handles solar radiation unit ambiguity robustly.
ISIMIP3b data sometimes arrives as W/m² and sometimes the unit conversion in Script 05 already applied the 0.0864 factor. The maximum value in the series reveals which case we are in:
- If max > 10,000: the values are in J/m²/day (raw energy totals). Divide by 1,000,000 to get MJ/m²/day.
- If max > 100: the values are in W/m². Multiply by 0.0864 to convert to MJ/m²/day.
- Otherwise: values are already in MJ/m²/day. Return unchanged.
This defensive approach prevents radiation values 1,000 times too large from silently corrupting the WTH files.
The write_wth function — building the WTH header
tavg = (zone_df["TMAX"] + zone_df["TMIN"]) / 2.0
tav = round(float(tavg.mean()), 1)
monthly = (
zone_df.assign(month=pd.to_datetime(zone_df["date"]).dt.month)
.groupby("month")[["TMAX", "TMIN"]]
.mean()
)
monthly_tavg = (monthly["TMAX"] + monthly["TMIN"]) / 2.0
amp = round(float(monthly_tavg.max() - monthly_tavg.min()), 1)
TAV and AMP are computed per zone from the actual zone data. These are therefore slightly different from the domain-average TAV/AMP in Script 06. The values written to the WTH header are the per-zone values here.
with open(output_path, "w", encoding="ascii", newline="\n") as h:
h.write(f"*WEATHER DATA : {zone_id:04d} {start_year}\n")
h.write("@ INSI LAT LONG ELEV TAV AMP REFHT WNDHT\n")
h.write(
f"{zone_id:>6}"
f"{zone_meta['lat']:>9.2f}"
f"{zone_meta['lon']:>9.2f}"
f"{zone_meta['elev']:>6.0f}"
f"{tav:>6.1f}"
f"{amp:>6.1f}"
f"{-99:>6}"
f"{-99:>6}\n"
)
Two things matter in this block:
encoding="ascii"— DSSAT reads weather files as plain ASCII. Using UTF-8 or Latin-1 can cause invisible character problems.newline="\n"— DSSAT expects Unix-style line endings. If you let Python use the platform default on Windows, you get\r\nline endings, which can confuse DSSAT's fixed-width parser.
The format strings use right-aligned field widths (>6, >9.2f, etc.) to
match the exact column positions expected by DSSAT. Even one character off
can cause DSSAT to misread a value.
-99 for REFHT and WNDHT is the DSSAT missing-value code, indicating
that reference height for humidity and wind height are not provided. DSSAT
handles this gracefully.
Writing the daily data rows
h.write("@DATE SRAD TMAX TMIN RAIN\n")
for year, doy, srad, tmax, tmin, rain in zone_df[
["year", "doy", "SRAD", "TMAX", "TMIN", "RAIN"]
].itertuples(index=False, name=None):
h.write(
f"{int(year) % 100:02d}{int(doy):03d}"
f"{float(srad):6.1f}"
f"{float(tmax):6.1f}"
f"{float(tmin):6.1f}"
f"{float(rain):6.1f}\n"
)
int(year) % 100 converts a 4-digit year to a 2-digit year: 1971 becomes
71, 2031 becomes 31. DSSAT's @DATE field is YYDDD format — two-digit
year plus three-digit day-of-year. So 1 January 1971 becomes 71001.
itertuples is used instead of iterrows because it is significantly faster
when writing many rows. The name=None argument suppresses the named-tuple
wrapper for a small additional speed gain.
Building the WTH filename
filename = f"{int(zone_id):04d}{years[0] % 100:02d}{len(years):02d}.WTH"
For zone 1, starting in 1971, spanning 44 years: 00017144.WTH.
0001— zone ID, zero-padded to 4 digits71— two-digit start year (1971 % 100)44— number of years in the file
This naming convention is consistent with how DSSAT weather files are typically named.
Script 09 — Validate outputs
File: 09_validate_outputs.py
#!/usr/bin/env python3
from pathlib import Path
base = Path("/home/<user>/DEN/denmark/deliverables")
for label, pattern in [
("parquet", "parquet_zonal_stats/**/*.parquet"),
("wth", "wth/**/*.WTH"),
("tav_amp", "tav_amp/**/*.csv"),
]:
print(label, len(list(base.rglob(pattern.split("**/")[-1]))))
What this script does
It counts the deliverables produced.
After a full production run, running this script should print:
parquet 80
wth 1160
tav_amp 20
Why those numbers
- 80 parquet files = 5 models × 4 scenarios × 4 variables
- 1160 WTH files = 5 models × 4 scenarios × 58 Denmark grid cells
- 20 TAV/AMP files = 5 models × 4 scenarios
If any number is wrong, you know which layer of the pipeline to investigate.
If parquet is 76 instead of 80, four Parquet jobs failed. If wth is 1102
instead of 1160, one model-scenario combination's 58 WTH files are missing.
The rglob trick
base.rglob(pattern.split("**/")[-1])
split("**/")[-1] extracts just the filename extension pattern
(e.g. *.parquet). Then rglob searches recursively for all files
matching that pattern. This avoids needing to know the exact directory depth,
which changes between models and scenarios.
Script 10 — Run one model-scenario combination
File: 10_run_model_scenario_weather.sh
#!/bin/bash
set -euo pipefail
export PATH="$HOME/.local/bin:$PATH"
unset PYTHONPATH || true
GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"
GEO_ENV_ROOT="/home/<user>/.conda/envs/mamba_env/envs/geo_env"
if [ -d "$GEO_ENV_ROOT/share/proj" ]; then
export PROJ_LIB="$GEO_ENV_ROOT/share/proj"
fi
if [ $# -ne 2 ]; then
echo "Usage: $0 MODEL SCENARIO"
exit 1
fi
MODEL="$1"
SCENARIO="$2"
BASE="/home/<user>/DEN/denmark"
echo "=================================================="
echo "DENMARK WEATHER PIPELINE"
echo "MODEL: $MODEL"
echo "SCENARIO: $SCENARIO"
echo "START: $(date)"
echo "=================================================="
read START_YEAR END_YEAR <<< "$($GEO_PY - <<PY
from pathlib import Path
import xarray as xr
base = Path("/home/<user>/DEN")
model = "$MODEL"
scenario = "$SCENARIO"
p = base / model / "processed" / f"{model}_{scenario}_pr_denmark.nc"
ds = xr.open_dataset(p)
t = ds["time"].values
print(str(t[0])[:4], str(t[-1])[:4])
ds.close()
PY
)"
$GEO_PY "$BASE/scripts/06_compute_tav_amp.py" \
--model "$MODEL" \
--scenario "$SCENARIO" \
--start-year "$START_YEAR" \
--end-year "$END_YEAR"
for VAR in pr tasmax tasmin rsds; do
echo "Extracting parquet for $MODEL / $SCENARIO / $VAR"
$GEO_PY "$BASE/scripts/07_extract_zonal_stats_parquet.py" \
--model "$MODEL" \
--scenario "$SCENARIO" \
--variable "$VAR" \
--start-year "$START_YEAR" \
--end-year "$END_YEAR"
done
PERIOD="${START_YEAR}_${END_YEAR}"
echo "Generating WTH files for $MODEL / $SCENARIO / $PERIOD"
$GEO_PY "$BASE/scripts/08_generate_wth_files.py" \
--model "$MODEL" \
--scenario "$SCENARIO" \
--period "$PERIOD"
echo "DONE: $MODEL / $SCENARIO at $(date)"
What this script does
This is the main per-job orchestration script.
When you call:
./10_run_model_scenario_weather.sh GFDL_ESM4 historical
it runs Scripts 06, 07 (×4 variables), and 08 in sequence for that one model-scenario combination.
Line-by-line breakdown
GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"
GEO_ENV_ROOT="/home/<user>/.conda/envs/mamba_env/envs/geo_env"
if [ -d "$GEO_ENV_ROOT/share/proj" ]; then
export PROJ_LIB="$GEO_ENV_ROOT/share/proj"
fi
PROJ_LIB tells GDAL and rasterio where to find the PROJ projection
database. When this is not set correctly, geopandas operations involving
coordinate reference systems fail with cryptic messages about missing proj.db.
On the WUR HPC, this path inside the conda environment contains the correct PROJ database. Setting it here prevents the problem.
if [ $# -ne 2 ]; then
echo "Usage: $0 MODEL SCENARIO"
exit 1
fi
$# is the number of arguments passed to the script. If not exactly 2, a
usage message is printed and the script exits. This catches typos before any
long-running computation starts.
read START_YEAR END_YEAR <<< "$($GEO_PY - <<PY
from pathlib import Path
import xarray as xr
base = Path("/home/<user>/DEN")
model = "$MODEL"
scenario = "$SCENARIO"
p = base / model / "processed" / f"{model}_{scenario}_pr_denmark.nc"
ds = xr.open_dataset(p)
t = ds["time"].values
print(str(t[0])[:4], str(t[-1])[:4])
ds.close()
PY
)"
This is the most complex line in the script. It:
- Opens a Python heredoc that reads the precipitation NetCDF for this model-scenario
- Extracts the first and last timestamps and prints their 4-digit years
- Captures that printed output back into the shell
- Uses
read START_YEAR END_YEARto split the two space-separated years into two shell variables
The result: START_YEAR=1971, END_YEAR=2014 for the historical period —
without hardcoding those years anywhere.
This matters because the historical and future periods differ:
- historical: 1971–2014
- ssp scenarios: 2031–2100
Using the NetCDF itself as the source of truth means the same script works for all scenarios without modification.
$GEO_PY "$BASE/scripts/06_compute_tav_amp.py" \
--model "$MODEL" \
--scenario "$SCENARIO" \
--start-year "$START_YEAR" \
--end-year "$END_YEAR"
Script 06 runs first because TAV and AMP are conceptually a summary of the full period, and computing them at this stage is a useful early sanity check before the heavier zonal extraction runs.
for VAR in pr tasmax tasmin rsds; do
echo "Extracting parquet for $MODEL / $SCENARIO / $VAR"
$GEO_PY "$BASE/scripts/07_extract_zonal_stats_parquet.py" \
--model "$MODEL" \
--scenario "$SCENARIO" \
--variable "$VAR" \
--start-year "$START_YEAR" \
--end-year "$END_YEAR"
done
Script 07 runs four times — once per variable. The echo before each call
means the Slurm log file shows exactly which variable is being processed
and when.
PERIOD="${START_YEAR}_${END_YEAR}"
echo "Generating WTH files for $MODEL / $SCENARIO / $PERIOD"
$GEO_PY "$BASE/scripts/08_generate_wth_files.py" \
--model "$MODEL" \
--scenario "$SCENARIO" \
--period "$PERIOD"
Script 08 runs last because it depends on the Parquet files from Script 07.
PERIOD is assembled from the same variables used in the Parquet output
paths, so the directory lookup in Script 08 always matches what Script 07
wrote.
Script 11 — Submit all jobs to Slurm
File: 11_submit_denmark_weather.sh
#!/bin/bash
set -euo pipefail
BASE="/home/<user>/DEN/denmark"
LOGDIR="$BASE/logs"
mkdir -p "$LOGDIR"
MODELS=("GFDL_ESM4" "IPSL-CM6A-LR" "MPI-ESM1-2-HR" "MRI-ESM2-0" "UKESM1-0-LL")
SCENARIOS=("historical" "ssp126" "ssp370" "ssp585")
MODELS_TO_RUN=("${MODELS[@]}")
SCENARIOS_TO_RUN=("${SCENARIOS[@]}")
if [ $# -ge 1 ] && [ -n "${1:-}" ]; then
IFS=',' read -r -a MODELS_TO_RUN <<< "$1"
fi
if [ $# -ge 2 ] && [ -n "${2:-}" ]; then
IFS=',' read -r -a SCENARIOS_TO_RUN <<< "$2"
fi
echo "Submitting DEM preparation job..."
DEM_MERGED="$BASE/deliverables/validation/denmark_dem_merged.tif"
DEPENDENCY_ARGS=()
if [ -f "$DEM_MERGED" ]; then
echo "DEM already available: $DEM_MERGED"
else
DEM_JOB=$(sbatch \
--parsable \
--job-name=den_dem \
--partition=cpuq \
--account=pr_climate \
--time=01:00:00 \
--cpus-per-task=2 \
--mem=8G \
--output="$LOGDIR/dem-%j.out" \
--error="$LOGDIR/dem-%j.err" \
--wrap="$BASE/scripts/04_prepare_dem.sh")
echo "DEM job: $DEM_JOB"
DEPENDENCY_ARGS=(--dependency="afterok:${DEM_JOB}")
fi
for MODEL in "${MODELS_TO_RUN[@]}"; do
for SCENARIO in "${SCENARIOS_TO_RUN[@]}"; do
JOBID=$(sbatch \
--parsable \
"${DEPENDENCY_ARGS[@]}" \
--job-name="den_${MODEL}_${SCENARIO}" \
--partition=cpuq \
--account=pr_climate \
--time=04:00:00 \
--cpus-per-task=4 \
--mem=24G \
--output="$LOGDIR/${MODEL}_${SCENARIO}-%j.out" \
--error="$LOGDIR/${MODEL}_${SCENARIO}-%j.err" \
--wrap="$BASE/scripts/10_run_model_scenario_weather.sh $MODEL $SCENARIO")
echo "$MODEL $SCENARIO $JOBID"
done
done
What this script does
It submits all 20 model-scenario combinations to the HPC cluster as separate Slurm jobs, all running in parallel.
It also handles the DEM job as a mandatory prerequisite: if the merged DEM does not yet exist, it submits the DEM job first and makes all weather jobs depend on it.
Line-by-line breakdown
MODELS=("GFDL_ESM4" "IPSL-CM6A-LR" "MPI-ESM1-2-HR" "MRI-ESM2-0" "UKESM1-0-LL")
SCENARIOS=("historical" "ssp126" "ssp370" "ssp585")
MODELS_TO_RUN=("${MODELS[@]}")
SCENARIOS_TO_RUN=("${SCENARIOS[@]}")
Two arrays define all models and scenarios. They are immediately copied to
MODELS_TO_RUN and SCENARIOS_TO_RUN. Those copies are what the loop
actually iterates. This pattern allows the full list to serve as
documentation while the run list can be overridden.
if [ $# -ge 1 ] && [ -n "${1:-}" ]; then
IFS=',' read -r -a MODELS_TO_RUN <<< "$1"
fi
if [ $# -ge 2 ] && [ -n "${2:-}" ]; then
IFS=',' read -r -a SCENARIOS_TO_RUN <<< "$2"
fi
Optional arguments let you run a subset. For example:
./11_submit_denmark_weather.sh "MPI-ESM1-2-HR" "ssp370,ssp585"
submits only the two broken-and-repaired combinations instead of all 20.
IFS=',' sets the field separator so read -a splits on commas rather than
spaces. ${1:-} uses an empty default so the variable is considered set even
if the argument is empty, avoiding the -u error.
DEM_MERGED="$BASE/deliverables/validation/denmark_dem_merged.tif"
DEPENDENCY_ARGS=()
if [ -f "$DEM_MERGED" ]; then
echo "DEM already available: $DEM_MERGED"
else
DEM_JOB=$(sbatch \
--parsable \
--job-name=den_dem \
...
--wrap="$BASE/scripts/04_prepare_dem.sh")
echo "DEM job: $DEM_JOB"
DEPENDENCY_ARGS=(--dependency="afterok:${DEM_JOB}")
fi
The DEM check runs before the main loop. If the merged DEM already exists,
DEPENDENCY_ARGS stays as an empty array and the weather jobs start
immediately. If the DEM does not exist, sbatch --parsable returns only the
job ID number, which is captured in DEM_JOB. The dependency array then
holds --dependency=afterok:12345, meaning every weather job will only start
after job 12345 finishes successfully.
DEM_JOB=$(sbatch \
--parsable \
--job-name=den_dem \
--partition=cpuq \
--account=pr_climate \
--time=01:00:00 \
--cpus-per-task=2 \
--mem=8G \
--output="$LOGDIR/dem-%j.out" \
--error="$LOGDIR/dem-%j.err" \
--wrap="$BASE/scripts/04_prepare_dem.sh")
The DEM job is allocated 2 CPUs, 8 GB RAM, and 1 hour. The %j in the log
filenames is replaced by the Slurm job ID, so each run gets its own log file.
JOBID=$(sbatch \
--parsable \
"${DEPENDENCY_ARGS[@]}" \
--job-name="den_${MODEL}_${SCENARIO}" \
--partition=cpuq \
--account=pr_climate \
--time=04:00:00 \
--cpus-per-task=4 \
--mem=24G \
--output="$LOGDIR/${MODEL}_${SCENARIO}-%j.out" \
--error="$LOGDIR/${MODEL}_${SCENARIO}-%j.err" \
--wrap="$BASE/scripts/10_run_model_scenario_weather.sh $MODEL $SCENARIO")
echo "$MODEL $SCENARIO $JOBID"
Each weather job gets 4 CPUs, 24 GB RAM, and 4 hours. The memory allocation covers loading climate NetCDFs (which can be several GB) plus the geopandas and xarray operations.
"${DEPENDENCY_ARGS[@]}" expands as either nothing (if DEM was already
present) or --dependency=afterok:JOBID (if it was just submitted). This
works because Bash expands an empty array to nothing rather than to an empty
string.
The output files in detail
A WTH file header
*WEATHER DATA : 0001 1971
@ INSI LAT LONG ELEV TAV AMP REFHT WNDHT
1 54.95 8.83 5 8.7 15.7 -99 -99
| Field | Value | Meaning |
|---|---|---|
INSI | 1 | Station identifier (zone_id) |
LAT | 54.95 | Latitude in decimal degrees |
LONG | 8.83 | Longitude in decimal degrees |
ELEV | 5 | Elevation in metres above sea level |
TAV | 8.7 | Mean annual temperature (°C) |
AMP | 15.7 | Mean annual amplitude (°C) |
REFHT | -99 | Humidity measurement height (not provided) |
WNDHT | -99 | Wind measurement height (not provided) |
A WTH daily row
71001 1.2 2.6 -4.7 0.0
| Columns 1–5 | Col 6–11 | Col 12–17 | Col 18–23 | Col 24–29 |
|---|---|---|---|---|
71001 (YYDDD) | 1.2 SRAD | 2.6 TMAX | -4.7 TMIN | 0.0 RAIN |
The date 71001 means year 1971, day 1 (1 January).
A tav_amp CSV
model,scenario,tav_degC,amp_degC
GFDL_ESM4,historical,7.951,16.528
Problems we hit and what fixed them
1. Wrong Python environment on HPC
Symptom: ModuleNotFoundError: No module named 'geopandas'
Cause: The job was using the system Python rather than the geo_env conda environment.
Fix: Specify the full path to the Python binary in every script:
GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"
Never use python3 or python on HPC without knowing which environment
they resolve to.
2. PROJ database not found
Symptom: ProjError: proj.db not found in proj search path
Cause: PROJ_LIB was not set, so geopandas could not find the
coordinate reference system database.
Fix:
GEO_ENV_ROOT="/home/<user>/.conda/envs/mamba_env/envs/geo_env"
if [ -d "$GEO_ENV_ROOT/share/proj" ]; then
export PROJ_LIB="$GEO_ENV_ROOT/share/proj"
fi
3. Corrupted processed NetCDF files
Symptom: MPI-ESM1-2-HR ssp370 WTH files were missing.
Cause: Two processed NetCDF files had been damaged:
MPI-ESM1-2-HR_ssp370_tasmin_denmark.nc
MPI-ESM1-2-HR_ssp370_rsds_denmark.nc
Fix: Rebuild them from raw source files using the same merge-clip-convert logic as Script 05. Once rebuilt, run Script 10 for that model-scenario only:
./10_run_model_scenario_weather.sh MPI-ESM1-2-HR ssp370
4. Zero-elevation coastal cells
Symptom: Many coastal Denmark WTH files had ELEV = 0.
Cause: The grid-cell centroid fell in the sea. The DEM returned its nodata value there. The script was treating that as zero instead of triggering a fallback.
Fix: The four-level elevation fallback in load_zone_metadata inside
Script 08 (described above).
5. Wrong line endings on Windows
Symptom: DSSAT could not parse the WTH file or returned strange phenology results.
Cause: The WTH file was written with Windows \r\n line endings because
it was generated on a Windows machine.
Fix: Always open WTH files with newline="\n" in Python:
with open(output_path, "w", encoding="ascii", newline="\n") as h:
Final validated deliverable counts
| Deliverable | Count | Formula |
|---|---|---|
| Processed NetCDF files | 100 | 5 models × 4 scenarios × 5 variables |
| Parquet zonal files | 80 | 5 models × 4 scenarios × 4 WTH variables |
| TAV/AMP CSV files | 20 | 5 models × 4 scenarios |
| WTH files | 1160 | 5 models × 4 scenarios × 58 grid cells |
All 100 processed NetCDF files passed unit and range checks in
processed_units_validation.csv:
prmean: ~2.3–2.6 mm/day (plausible for Denmark)tasmean: ~8–11 °C (plausible for Denmark)rsdsmean: ~10–11 MJ/m²/day (plausible for Denmark)
How to re-run from scratch
If you need to rebuild everything, the sequence is:
# Step 1: download (if data not already present)
./01_download_isimip.sh
# Step 2: merge, clip, convert (if processed NetCDFs not present)
# (runs the underlying merge_denmark.sh for each model/scenario/variable)
./05_merge_clip_convert_climate.sh
# Step 3: submit all weather jobs to Slurm
./11_submit_denmark_weather.sh
# Step 4: check counts once jobs complete
python 09_validate_outputs.py
To re-run only one broken model-scenario without submitting everything:
./10_run_model_scenario_weather.sh MPI-ESM1-2-HR ssp370
Script 14 — Fetch soil profiles and build DN.SOL
File: 14_fetch_soil_and_build_sol.py
This script is the soil counterpart to the weather pipeline.
It queries the MSU Decision Support and Informatics Soil Query API for each
of the 60 Denmark grid centroids and assembles a complete DSSAT soil library
file DN.SOL.
The API
The API is hosted by Michigan State University's Decision Support and Informatics group:
https://dsiweb.cse.msu.edu/soil-query-api/soil
It holds 1,984,797 global soil profiles sourced from ISRIC SoilGrids combined
with the HC27 pedotransfer functions. A GET request with lat, lon, and
format=json returns a single nearest-match profile.
GET /soil-query-api/soil?lat=54.751&lon=8.750&format=json
Example response (truncated):
{
"profile": {
"id": "DE01825304",
"location": { "lat": 54.792, "lon": 8.708, "country_code": "DE" },
"site": {
"country_code_alpha3": "DEU",
"scs_family": "HC_GEN0011",
"texture": "Loam",
"max_depth_cm": 200
},
"properties": {
"scom": "BK", "salb": 0.1, "slu1": 6.0, "sldr": 0.5,
"slro": 75.0, "slnf": 1.0, "slpf": 1.0,
"smhb": "SA001", "smpx": "SA001", "smke": "SA001"
},
"layers": [
{ "slb": 5, "slmh": "A", "slll": 0.115, "sdul": 0.245, "ssat": 0.395,
"srgf": 1.0, "ssks": 1.05, "sbdm": 1.26, "sloc": 4.01,
"slcl": 18.87, "slsi": 40.46, "slcf": null, "slni": 0.12,
"slhw": 6.13, "slhb": null, "scec": 18.4, "sadc": null },
...5 more layers to 200 cm
],
"metadata": { "source": "ISRIC soilgrids + HC27", "distance_km": 5.39 }
}
}
The API also supports format=sol, which returns the profile already
formatted as a DSSAT SOL block. The script uses format=json instead so it
can assign Denmark-specific profile IDs (DK0001 to DK0060) and add a
proper file header.
About the API — source code and data origin
The full API source code is publicly available at:
https://github.com/eusojk/soil-query
It is a Rust application built with Axum and SQLite, deployed on Railway.
The underlying data is from the 2015 Harvard Dataverse publication:
Han, Ines, and Koo — Global High-Resolution Soil Profile Database for Crop Modeling Applications
The database holds approximately 2 million profiles from 225 countries at roughly 10 km resolution, combining ISRIC SoilGrids with HC27 pedotransfer functions. Spatial lookups use an R-tree index and Haversine distance. Typical query latency is under 50 milliseconds.
The three API endpoints are:
| Endpoint | What it does |
|---|---|
GET /health | Returns {"status":"ok","version":"0.1.0","profiles":1984797} |
GET /soil?lat=&lon=&format=json | Returns nearest profile as JSON |
GET /soil?lat=&lon=&format=sol | Returns nearest profile as DSSAT SOL text |
GET /definitions | Returns field abbreviation descriptions |
The format=sol option is convenient for quick inspection of one profile but
returns the API's own profile ID (e.g. DE01825304), not our Denmark grid ID.
The script uses format=json so it can assign the correct DK0001–DK0060
identifiers.
One critical header requirement
A plain Python requests.get without extra headers returns an HTML page
instead of JSON. The CDN protecting the server blocks requests that do not
look like browser traffic.
The fix is two lines:
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0 Safari/537.36"
),
"Referer": "https://dsiweb.cse.msu.edu/",
"Accept": "application/json",
}
Without User-Agent and Referer, the request gets a 406 or an HTML page.
This is not documented — it was found by probing the server directly.
The full script
#!/usr/bin/env python3
"""
14_fetch_soil_and_build_sol.py
-------------------------------
Query the MSU DSI Soil Query API for each Denmark 0.5-degree grid centroid
and assemble a complete DSSAT-format DN.SOL file.
Usage (local Windows):
python 14_fetch_soil_and_build_sol.py
Usage (HPC):
/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python \\
/home/<user>/DEN/denmark/scripts/14_fetch_soil_and_build_sol.py
Inputs:
denmark_grid_05deg_centroids.csv (grid_id, lon, lat — 60 rows)
Outputs:
deliverables/soil/DN.SOL DSSAT soil library
deliverables/soil/soil_fetch_report.csv per-centroid QC report
"""
from __future__ import annotations
import csv
import sys
import time
from pathlib import Path
import requests
# ── Detect HPC vs local ────────────────────────────────────────────────────────
_HPC_CENTROIDS = Path("/home/<user>/DEN/denmark_grid_05deg_centroids.csv")
_LOCAL_CENTROIDS = Path(
r"C:\Users\chich\Downloads\denmark_wth_bundle\denmark_grid_05deg_centroids.csv"
)
if _HPC_CENTROIDS.exists():
CENTROIDS_CSV = _HPC_CENTROIDS
_BASE_DELIVERABLES = Path("/home/<user>/DEN/denmark/deliverables")
elif _LOCAL_CENTROIDS.exists():
CENTROIDS_CSV = _LOCAL_CENTROIDS
_BASE_DELIVERABLES = Path(__file__).resolve().parent.parent / "deliverables"
else:
sys.exit(
"Cannot find denmark_grid_05deg_centroids.csv.\n"
f"Expected at:\n {_HPC_CENTROIDS}\n {_LOCAL_CENTROIDS}"
)
OUT_SOL = _BASE_DELIVERABLES / "soil" / "DN.SOL"
OUT_REPORT = _BASE_DELIVERABLES / "soil" / "soil_fetch_report.csv"
# ── API ────────────────────────────────────────────────────────────────────────
API_URL = "https://dsiweb.cse.msu.edu/soil-query-api/soil"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0 Safari/537.36"
),
"Referer": "https://dsiweb.cse.msu.edu/",
"Accept": "application/json",
}
REQUEST_TIMEOUT = 30
SLEEP_BETWEEN = 0.6
MAX_RETRIES = 3
MAX_DISTANCE_KM = 50.0
MISSING = -99.0
# ── DSSAT SOL formatting ───────────────────────────────────────────────────────
def _fmt(val, width: int, dec: int) -> str:
"""Right-align a numeric value; substitute -99.0 for None."""
v = MISSING if val is None else float(val)
return f"{v:{width}.{dec}f}"
def build_sol_block(grid_id: int, lon: float, lat: float, profile: dict) -> str:
loc = profile.get("location", {})
site = profile.get("site", {})
props = profile.get("properties", {})
layers = profile.get("layers", [])
meta = profile.get("metadata", {})
profile_id = f"DK{grid_id:04d}"
country_a3 = site.get("country_code_alpha3", "DNK")
texture = site.get("texture", "Unknown")
max_depth = site.get("max_depth_cm", -99)
source = meta.get("source", "MSU DSI API")
api_lat = loc.get("lat", lat)
api_lon = loc.get("lon", lon)
scs_family = site.get("scs_family", "-99")
scom = props.get("scom", "-99")
salb = props.get("salb", MISSING)
slu1 = props.get("slu1", MISSING)
sldr = props.get("sldr", MISSING)
slro = props.get("slro", MISSING)
slnf = props.get("slnf", MISSING)
slpf = props.get("slpf", MISSING)
smhb = props.get("smhb", "SA001")
smpx = props.get("smpx", "SA001")
smke = props.get("smke", "SA001")
lines: list[str] = []
# Profile header
lines.append(
f"*{profile_id:<10} {country_a3:<16} {texture:<8} {int(max_depth):>4}"
f" {source}"
)
# Site
lines.append("@SITE COUNTRY LAT LONG SCS Family")
lines.append(
f" {'-99':<12} {country_a3:<16}"
f" {api_lat:>7.3f} {api_lon:>7.3f} {scs_family}"
)
# Properties
lines.append(
"@ SCOM SALB SLU1 SLDR SLRO SLNF SLPF SMHB SMPX SMKE"
)
lines.append(
f" {scom:>4}"
f"{float(salb):>6.2f}"
f"{float(slu1):>6.2f}"
f"{float(sldr):>6.2f}"
f"{float(slro):>6.2f}"
f"{float(slnf):>6.2f}"
f"{float(slpf):>6.2f}"
f" {smhb:>5}"
f" {smpx:>5}"
f" {smke:>5}"
)
# Layers
lines.append(
"@ SLB SLMH SLLL SDUL SSAT SRGF SSKS SBDM"
" SLOC SLCL SLSI SLCF SLNI SLHW SLHB SCEC SADC"
)
for lyr in layers:
lines.append(
f"{int(lyr.get('slb', 0)):>5} {str(lyr.get('slmh', '-99')):<5}"
f"{_fmt(lyr.get('slll'), 6, 3)}"
f"{_fmt(lyr.get('sdul'), 6, 3)}"
f"{_fmt(lyr.get('ssat'), 6, 3)}"
f"{_fmt(lyr.get('srgf'), 6, 2)}"
f"{_fmt(lyr.get('ssks'), 6, 2)}"
f"{_fmt(lyr.get('sbdm'), 6, 2)}"
f"{_fmt(lyr.get('sloc'), 6, 2)}"
f"{_fmt(lyr.get('slcl'), 6, 2)}"
f"{_fmt(lyr.get('slsi'), 6, 2)}"
f"{_fmt(lyr.get('slcf'), 6, 1)}"
f"{_fmt(lyr.get('slni'), 6, 2)}"
f"{_fmt(lyr.get('slhw'), 6, 2)}"
f"{_fmt(lyr.get('slhb'), 6, 1)}"
f"{_fmt(lyr.get('scec'), 6, 2)}"
f"{_fmt(lyr.get('sadc'), 6, 1)}"
)
return "\n".join(lines)
# ── API fetch with retry ───────────────────────────────────────────────────────
def fetch_profile(lat: float, lon: float) -> dict | None:
params = {"lat": round(lat, 6), "lon": round(lon, 6), "format": "json"}
for attempt in range(1, MAX_RETRIES + 1):
try:
resp = requests.get(
API_URL, params=params, headers=HEADERS, timeout=REQUEST_TIMEOUT
)
if 400 <= resp.status_code < 500:
print(f"HTTP {resp.status_code} (no retry)")
return None
resp.raise_for_status()
return resp.json().get("profile")
except requests.exceptions.Timeout:
print(f"timeout (attempt {attempt}/{MAX_RETRIES})", end=" ")
except requests.exceptions.JSONDecodeError:
print("bad JSON — likely got HTML; check headers")
return None
except requests.exceptions.RequestException as exc:
print(f"{exc} (attempt {attempt}/{MAX_RETRIES})", end=" ")
if attempt < MAX_RETRIES:
time.sleep(2.0 * attempt)
return None
# ── Main ───────────────────────────────────────────────────────────────────────
def main() -> None:
# 1. Read centroids
centroids: list[dict] = []
with open(CENTROIDS_CSV, newline="", encoding="utf-8") as fh:
for row in csv.DictReader(fh):
centroids.append(
{"grid_id": int(row["grid_id"]),
"lon": float(row["lon"]),
"lat": float(row["lat"])}
)
print(f"Loaded {len(centroids)} centroids from {CENTROIDS_CSV.name}")
print(f"Output SOL : {OUT_SOL}")
print(f"Output report: {OUT_REPORT}\n")
# 2. Prepare output directory
OUT_SOL.parent.mkdir(parents=True, exist_ok=True)
# 3. Fetch profiles
sol_blocks: list[str] = []
report_rows: list[dict] = []
for i, c in enumerate(centroids, 1):
gid, lat, lon = c["grid_id"], c["lat"], c["lon"]
print(
f"[{i:>3}/{len(centroids)}] grid={gid:>3} "
f"lat={lat:.5f} lon={lon:.5f} ",
end="", flush=True,
)
profile = fetch_profile(lat, lon)
if profile is None:
print("FAILED — skipped")
report_rows.append(dict(
grid_id=gid, lat=lat, lon=lon,
api_lat=None, api_lon=None,
distance_km=None, n_layers=0,
texture=None, max_depth_cm=None, status="API_FAILED",
))
continue
dist = profile.get("metadata", {}).get("distance_km", 0.0)
n_layers = len(profile.get("layers", []))
texture = profile.get("site", {}).get("texture", "Unknown")
max_depth = profile.get("site", {}).get("max_depth_cm", None)
api_lat = profile.get("location", {}).get("lat", lat)
api_lon = profile.get("location", {}).get("lon", lon)
status = "OK" if dist <= MAX_DISTANCE_KM else "FAR"
print(
f"dist={dist:>6.1f} km layers={n_layers} "
f"texture={texture:<10} [{status}]"
)
sol_blocks.append(build_sol_block(gid, lon, lat, profile))
report_rows.append(dict(
grid_id=gid, lat=lat, lon=lon,
api_lat=round(api_lat, 4), api_lon=round(api_lon, 4),
distance_km=round(dist, 3), n_layers=n_layers,
texture=texture, max_depth_cm=max_depth, status=status,
))
time.sleep(SLEEP_BETWEEN)
# 4. Write DN.SOL
with open(OUT_SOL, "w", encoding="ascii", newline="\n") as fh:
fh.write(
"*SOILS: Denmark ISIMIP 0.5-deg Grid "
"[ISRIC SoilGrids + HC27 via MSU DSI Soil Query API]\n\n"
)
for block in sol_blocks:
fh.write(block)
fh.write("\n\n")
# 5. Write QC report
fields = [
"grid_id", "lat", "lon", "api_lat", "api_lon",
"distance_km", "n_layers", "texture", "max_depth_cm", "status",
]
with open(OUT_REPORT, "w", newline="", encoding="utf-8") as fh:
writer = csv.DictWriter(fh, fieldnames=fields)
writer.writeheader()
writer.writerows(report_rows)
# 6. Summary
n_ok = sum(1 for r in report_rows if r["status"] == "OK")
n_far = sum(1 for r in report_rows if r["status"] == "FAR")
n_failed = sum(1 for r in report_rows if r["status"] == "API_FAILED")
print(f"\n{'='*60}")
print(f"Profiles fetched : {n_ok + n_far} / {len(centroids)}")
print(f" within {MAX_DISTANCE_KM:.0f} km (OK) : {n_ok}")
print(f" beyond {MAX_DISTANCE_KM:.0f} km (FAR) : {n_far}")
print(f"Failed / skipped : {n_failed}")
print(f"SOL file : {OUT_SOL}")
print(f"QC report : {OUT_REPORT}")
if n_far:
print("\nFAR profiles (nearest match > 50 km):")
for r in report_rows:
if r["status"] == "FAR":
print(
f" grid_id={r['grid_id']:>3} "
f"lat={r['lat']:.3f} lon={r['lon']:.3f} "
f"dist={r['distance_km']:.1f} km"
)
print("=" * 60)
if __name__ == "__main__":
main()
Line-by-line breakdown
The path detection block
_HPC_CENTROIDS = Path("/home/<user>/DEN/denmark_grid_05deg_centroids.csv")
_LOCAL_CENTROIDS = Path(
r"C:\Users\chich\Downloads\denmark_wth_bundle\denmark_grid_05deg_centroids.csv"
)
if _HPC_CENTROIDS.exists():
CENTROIDS_CSV = _HPC_CENTROIDS
_BASE_DELIVERABLES = Path("/home/<user>/DEN/denmark/deliverables")
elif _LOCAL_CENTROIDS.exists():
CENTROIDS_CSV = _LOCAL_CENTROIDS
_BASE_DELIVERABLES = Path(__file__).resolve().parent.parent / "deliverables"
else:
sys.exit("Cannot find denmark_grid_05deg_centroids.csv. ...")
The same script runs on both the HPC and a Windows laptop without editing. It checks the HPC path first. If that does not exist it tries the local path. If neither exists it exits immediately with a clear message rather than crashing deep inside the fetch loop.
Path(__file__).resolve().parent.parent navigates from the script file
(scripts/) up two levels to the project root (denmark/) and then into
deliverables/ — so the output paths stay consistent regardless of where you
launch the script from.
The _fmt helper
def _fmt(val, width: int, dec: int) -> str:
v = MISSING if val is None else float(val)
return f"{v:{width}.{dec}f}"
Every soil field in the API response can be null (when a property was not
measured or estimated). DSSAT uses -99.0 as its missing-value code. This
single helper applies that rule uniformly and right-aligns the value in a
fixed-width field, keeping the SOL file columns aligned.
Without this, you would need an if val is None branch for every single
field in every layer. With it, the layer formatter is a single readable block.
The build_sol_block function
This function converts one API response dictionary into a DSSAT SOL profile block. It has four sections matching the DSSAT format exactly.
Profile header line:
lines.append(
f"*{profile_id:<10} {country_a3:<16} {texture:<8} {int(max_depth):>4}"
f" {source}"
)
The * at the start marks a profile header to DSSAT. profile_id is
DK0001 to DK0060 — the DK prefix identifies Denmark and the 4-digit
number matches the grid_id in the centroids CSV and in the WTH filenames.
This means every soil profile can be linked back to its WTH file by number.
Properties line:
lines.append(
f" {scom:>4}"
f"{float(salb):>6.2f}"
f"{float(slu1):>6.2f}"
...
)
SALB is albedo (dimensionless), SLU1 is the upper limit of soil water
in the first layer (mm), SLDR is the drainage rate (fraction per day),
SLRO is the runoff curve number. These site-level properties come directly
from the API and do not need derivation.
Layer lines:
lines.append(
f"{int(lyr.get('slb', 0)):>5} {str(lyr.get('slmh', '-99')):<5}"
f"{_fmt(lyr.get('slll'), 6, 3)}"
f"{_fmt(lyr.get('sdul'), 6, 3)}"
...
)
SLB is the bottom depth of the layer in cm. SLMH is the master horizon
designation (A, AB, B, BC, C etc.). All remaining fields are hydraulic and
chemical properties. The column widths — 5 for SLB, 6 for hydraulic values
— match the DSSAT .SOL specification exactly. Even one character off
causes DSSAT to misread the field.
The fetch_profile function
if 400 <= resp.status_code < 500:
print(f"HTTP {resp.status_code} (no retry)")
return None
resp.raise_for_status()
return resp.json().get("profile")
4xx responses are not retried. A 406 means the server rejected the request format. A 404 means no profile exists near that coordinate (open ocean, for example). Neither is worth retrying.
5xx responses and timeouts raise exceptions, which are caught and retried
up to MAX_RETRIES times with increasing wait between attempts.
resp.json().get("profile") extracts the nested profile object. If the
response structure ever changes and "profile" is missing, this returns
None cleanly instead of raising a KeyError.
The main loop
for i, c in enumerate(centroids, 1):
...
print(
f"[{i:>3}/{len(centroids)}] grid={gid:>3} "
f"lat={lat:.5f} lon={lon:.5f} ",
end="", flush=True,
)
profile = fetch_profile(lat, lon)
...
time.sleep(SLEEP_BETWEEN)
end="" keeps the cursor on the same line so the result (dist=... [OK])
prints alongside the coordinate it refers to, giving a compact one-line-per-
centroid log.
flush=True forces the output buffer to write immediately. Without it, when
running in a Slurm job the log file shows nothing until the script finishes.
time.sleep(SLEEP_BETWEEN) waits 0.6 seconds between every request. The API
serves 1.9 million profiles to many users. Being polite costs you 36 seconds
on 60 centroids and keeps the service available for everyone.
Writing the SOL file
with open(OUT_SOL, "w", encoding="ascii", newline="\n") as fh:
fh.write(
"*SOILS: Denmark ISIMIP 0.5-deg Grid "
"[ISRIC SoilGrids + HC27 via MSU DSI Soil Query API]\n\n"
)
for block in sol_blocks:
fh.write(block)
fh.write("\n\n")
encoding="ascii" and newline="\n" are the same two rules as Script 08.
DSSAT reads SOL files as plain ASCII with Unix line endings. Writing on
Windows with the default settings would produce \r\n endings that confuse
DSSAT's fixed-width parser.
The *SOILS: line is the file-level header. DSSAT skips it but it tells a
human reader what is in the file, where the data came from, and what
pedotransfer approach was used.
Two blank lines ("\n\n") after each block separate profiles clearly.
SOL format strings verified against the Rust source
Reading the upstream Rust source at
crates/soil-query/src/parser.rs reveals the exact format strings used by
to_sol_format(). The Python script translates these directly.
Profile header line:
#![allow(unused)] fn main() { // Rust format!("*{:<14} {:<10} {:>10} {:>5} {}\n", id, country_a3, texture, depth, source) }
# Python equivalent
f"*{profile_id:<14} {country_a3:<10} {texture:>10} {int(max_depth):>5} {source}"
Site data line:
#![allow(unused)] fn main() { // Rust — note: uses 2-letter location.country_code, NOT 3-letter country_code_alpha3 format!(" -99 {:>2} {:>11.3} {:>9.3} {}\n", location.country_code, lat, lon, scs_family) }
# Python equivalent — country_2 = loc.get("country_code"), NOT country_a3
f" -99 {country_2:>2} {api_lat:>11.3f} {api_lon:>9.3f} {scs_family}"
Properties line:
#![allow(unused)] fn main() { // Rust format!(" {} {:5.2} {:5.2} {:5.2} {:5.2} {:5.2} {:5.2} {} {} {}\n", scom, salb, slu1, sldr, slro, slnf, slpf, smhb, smpx, smke) }
Layer line — two different formatters:
#![allow(unused)] fn main() { // Rust — fmt_layer_value (3 decimals) only for SLLL, SDUL, SSAT let fmt_layer_value = |opt: Option<f64>| match opt { Some(val) => format!("{:5.3}", val), None => "-99.0".to_string(), // ← 5-char literal, NOT -99.00 }; // Rust — fmt_optional (2 decimals) for all other layer fields let fmt_optional = |opt: Option<f64>| match opt { Some(val) => format!("{:5.2}", val), None => "-99.0".to_string(), // ← 5-char literal, NOT -99.00 }; }
# Python equivalent
MISSING_STR = "-99.0" # must be 5 chars — "-99.00" breaks DSSAT
def _fmt_water(val) -> str: # SLLL, SDUL, SSAT only
return MISSING_STR if val is None else f"{float(val):5.3f}"
def _fmt(val) -> str: # everything else
return MISSING_STR if val is None else f"{float(val):5.2f}"
The critical insight is that None maps to the string "-99.0", not to
f"{-99.0:5.2f}" which would produce "-99.00" — six characters. That
extra character shifts every column to the right and causes DSSAT to misread
every value in the layer.
Which fields get 3 decimals vs 2:
| Field | Decimals | Reason |
|---|---|---|
| SLLL, SDUL, SSAT | 3 | Volumetric water content needs 3 sig figs (e.g. 0.115) |
| SRGF, SSKS, SBDM, SLOC, SLCL, SLSI, SLCF, SLNI, SLHW, SLHB, SCEC, SADC | 2 | 2 sig figs sufficient |
Country code confusion — why there are two:
country_a3 = site.get("country_code_alpha3", "DNK") # 3-letter → profile header
country_2 = loc.get("country_code", "DK") # 2-letter → site data line
The Rust SiteProperties.country_code_alpha3 is 3-letter (from the SOL
header line). The Location.country_code is 2-letter ISO (from the database
record). The Rust to_sol_format() uses:
self.site.country_code_alpha3on the profile header lineself.location.country_codeon the site data line
Denmark centroids often resolve to profiles stored as DE (Germany) because
the database does not store many profiles inside Denmark's borders and the
nearest land profile may be across the German border. That is expected — the
2-letter code reflects where the profile record came from, not where we
queried.
What the output looks like
DN.SOL (first two profiles shown):
*SOILS: Denmark ISIMIP 0.5-deg Grid [ISRIC SoilGrids + HC27 via MSU DSI Soil Query API]
*DK0001 DNK Loam 200 ISRIC soilgrids + HC27
@SITE COUNTRY LAT LONG SCS Family
-99 DNK 54.792 8.708 HC_GEN0011
@ SCOM SALB SLU1 SLDR SLRO SLNF SLPF SMHB SMPX SMKE
BK 0.10 6.00 0.50 75.00 1.00 1.00 SA001 SA001 SA001
@ SLB SLMH SLLL SDUL SSAT SRGF SSKS SBDM SLOC SLCL SLSI SLCF SLNI SLHW SLHB SCEC SADC
5 A 0.115 0.245 0.395 1.00 1.05 1.26 4.01 18.87 40.46 -99.0 0.12 6.13 -99.0 18.40 -99.0
15 A 0.126 0.256 0.399 0.85 0.89 1.28 3.40 20.69 39.68 -99.0 0.09 6.20 -99.0 16.10 -99.0
30 AB 0.140 0.271 0.403 0.70 0.72 1.31 2.59 23.02 38.58 -99.0 0.07 6.29 -99.0 15.60 -99.0
60 BA 0.153 0.284 0.408 0.50 0.59 1.36 1.66 25.26 37.26 -99.0 0.06 6.41 -99.0 16.30 -99.0
100 B 0.153 0.282 0.407 0.38 0.60 1.42 0.96 25.27 36.68 -99.0 0.05 6.55 -99.0 16.50 -99.0
200 BC 0.142 0.270 0.402 0.05 0.72 1.47 0.55 23.51 36.27 -99.0 0.05 6.73 -99.0 16.40 -99.0
*DK0002 DNK Sandy loam 200 ISRIC soilgrids + HC27
...
soil_fetch_report.csv (first rows):
| grid_id | lat | lon | api_lat | api_lon | distance_km | n_layers | texture | max_depth_cm | status |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 54.75077 | 8.75 | 54.792 | 8.708 | 5.391 | 6 | Loam | 200 | OK |
| 2 | 54.75077 | 9.25 | 54.792 | 9.247 | 3.263 | 6 | Sandy loam | 200 | OK |
What the DSSAT soil fields mean
| Field | Full name | Unit |
|---|---|---|
SLLL | Lower limit (wilting point) | cm³/cm³ |
SDUL | Drained upper limit (field capacity) | cm³/cm³ |
SSAT | Saturation upper limit | cm³/cm³ |
SRGF | Root growth factor | 0–1 |
SSKS | Saturated hydraulic conductivity | cm/h |
SBDM | Bulk density | g/cm³ |
SLOC | Organic carbon | % |
SLCL | Clay content | % |
SLSI | Silt content | % |
SLCF | Coarse fragments | % |
SLNI | Total nitrogen | % |
SLHW | pH in water | — |
SLHB | pH in buffer | — |
SCEC | Cation exchange capacity | cmol/kg |
SALB | Albedo | fraction |
SLRO | Runoff curve number | — |
How to link soil profiles to WTH files
The profile ID DK0001 in DN.SOL corresponds to grid_id = 1 in the
centroids CSV. That same grid_id produces WTH filename prefix 0001:
00017144.WTH ← zone 1, start year 71, 44 years
And in the WTH file header:
@ INSI ...
1 54.95 ...
The INSI value 1 matches the zone. In the DSSAT experiment file, you
link WSTA = 0001 (the weather station) and SOIL_ID = DK0001 (the soil
profile) to run a simulation at Denmark grid cell 1.
Running the script
On the HPC:
GEO_PY="/home/<user>/.conda/envs/mamba_env/envs/geo_env/bin/python"
$GEO_PY /home/<user>/DEN/denmark/scripts/14_fetch_soil_and_build_sol.py
Locally (Windows):
python 14_fetch_soil_and_build_sol.py
Expected console output:
Loaded 60 centroids from denmark_grid_05deg_centroids.csv
Output SOL : .../deliverables/soil/DN.SOL
Output report: .../deliverables/soil/soil_fetch_report.csv
[ 1/ 60] grid= 1 lat=54.75077 lon=8.75000 dist= 5.4 km layers=6 Loam [OK]
[ 2/ 60] grid= 2 lat=54.75077 lon=9.25000 dist= 3.3 km layers=6 Sandy loam [OK]
...
[ 60/ 60] grid= 60 lat=57.75086 lon=10.75000 dist= 8.1 km layers=6 Loam [OK]
============================================================
Profiles fetched : 60 / 60
within 50 km (OK) : 58
beyond 50 km (FAR) : 2
Failed / skipped : 0
SOL file : .../deliverables/soil/DN.SOL
QC report : .../deliverables/soil/soil_fetch_report.csv
============================================================
What FAR means and what to do about it
A FAR flag means the nearest profile in the database was more than 50 km
from the Denmark centroid. This typically happens for grid cells that are
mostly open sea — the API found the nearest land-based soil record but that
record belongs to a coastal point or an island some distance away.
For those cells, you have three choices:
- Accept the profile — the soil at 60 km distance may still be representative of the coastal fringe of Denmark.
- Use the nearest
OKprofile from an adjacent grid cell instead. - Mark those cells as excluded and do not run DSSAT simulations for them.
The soil_fetch_report.csv lists all FAR cases with their actual distance,
so you can make an informed decision for each one.
Why some centroids fall in water
If you plot the 60 Denmark grid cells on a map you will notice that several centroids land in open sea — most visibly the cells around Bornholm (grid IDs 10 and 11, lon ~14.75–15.25 E, lat ~54.75 N) and the Kattegat cells (IDs 22 and 23, lon ~14.75–15.25 E, lat ~55.25 N).
This is not a bug. It is a direct consequence of how the grid was built.
How cells with water centroids enter the grid
denmark_grid_05deg.shp is built by overlaying a regular 0.5-degree global
grid on Denmark's administrative and territory boundary, then keeping every
cell that has any geometric overlap with Danish territory — even a single
pixel of coastline is enough to include the whole 0.5-degree polygon.
Bornholm island sits at roughly 14.9 E, 55.1 N, surrounded by open Baltic
Sea. Its 0.5-degree grid polygon is approximately 55 km × 35 km. The centroid
of that large polygon falls in the sea west of the island. But because the
polygon does touch Bornholm's coastline, the cell is included and gets a
grid_id. The same logic applies to any other cell that clips a sliver of
Danish coastal territory or outlying island.
Grid polygon (0.5° × 0.5°)
┌─────────────────────────────────┐
│ │
│ Open Baltic Sea │
│ │
│ × ← centroid in water │
│ │
│ ███ Bornholm │ ← overlap keeps this cell
└─────────────────────────────────┘
The centroid is just the geometric centre of the polygon. It does not know whether it is over land or sea.
How the pipeline handled water-centroid cells at each stage
Stage 1 — Climate (WTH files)
ISIMIP3b climate grids cover the entire globe, land and ocean alike.
Every 0.5-degree cell in the world has values for temperature, precipitation,
radiation, and wind. When 07_extract_zonal_stats_parquet.py does its
point-in-polygon extraction, it finds NetCDF grid points that fall inside the
polygon — and for a mostly-sea polygon it averages ocean climate signal. A
complete parquet file is produced and 08_generate_wth_files.py writes a
valid .WTH file. DSSAT reads it without complaint.
The ocean-influenced climate signal for Bornholm cells is actually reasonable: Bornholm's climate is genuinely maritime — warmer winters, cooler summers, higher humidity than inland Denmark. Using the sea-cell climate to represent the island is defensible.
Stage 2 — Soil (SOL file)
The MSU Soil API performs a nearest-neighbour spatial lookup. When the query
point is in the sea, the API walks outward through its R-tree index until it
finds a real soil profile on land. For cells 10 and 11 it found profiles
24–30 km away — still within the 50 km acceptance threshold — and returned
them as OK. The assigned profiles come from a land pixel near the coast or
on Bornholm itself, so they are geologically reasonable.
The soil_fetch_report.csv records the actual distance_km between your
query point and the profile that was returned:
grid_id,lat,lon,api_lat,api_lon,distance_km,...,status
10,54.75077,14.74999,54.958,15.042,29.67,...,OK
11,54.75077,15.24999,54.958,15.125,24.39,...,OK
A distance of 25–30 km means the profile came from a land location about that far from your water centroid — entirely plausible for a coastal island.
What this means for your DSSAT runs
Having a cell in the dataset does not mean you must simulate crops there. The table below gives a practical decision framework:
| Cell type | Verdict | Rationale |
|---|---|---|
| Centroid in water, polygon covers a real island (e.g. Bornholm) | Keep | The polygon contains farmland; climate and soil are representative |
| Centroid in water, polygon covers a narrow coastal strip | Review | Climate is ocean-dominated; check whether the strip has agricultural land |
| Centroid is mostly open sea, minimal land | Exclude | Running DSSAT there produces output but it has no agronomic meaning |
A clean way to filter after the fact: cross the 60 cells with an agricultural land-cover raster (e.g. ESA CCI land cover or MODIS MCD12Q1) and compute the cropland fraction per cell. Cells where cropland fraction is below a threshold (for example 5 %) can be dropped from the ensemble analysis without affecting the representativeness of the results for actual Danish farmland.
The pipeline deliberately includes all overlapping cells and leaves this scientific decision to the analyst. The numbered scripts produce outputs; the analyst decides which outputs are meaningful.
Beginner takeaway
The full pipeline is more than a collection of scripts.
It is a chain of decisions:
- which variables to download
- which units to convert to
- how to handle spatial averaging
- how to handle missing elevation
- how to format the output so DSSAT can read it
Each script in the chain has one clear job.
When something breaks, the numbering tells you where to look. The logs from each Slurm job tell you what went wrong. The validate script tells you whether the outputs are complete.
That combination of numbered scripts, explicit logging, and a final count check is the discipline that makes a pipeline reproducible rather than just runnable once.
Denmark ISIMIP Technical Record
This chapter is the detailed project log for the Denmark ISIMIP work.
It is meant to answer the practical questions a future researcher will ask:
- where did the climate data come from
- what was already present
- which scripts were used
- what each script produced
- what broke and how it was repaired
- how the final weather and soil products were validated
- how the results were proven in a local DSSAT run
This is not just a conceptual workflow. It is the concrete record of the work that was actually carried out across the HPC project tree and the local DSSAT installation.
Project locations
Remote HPC project
/home/<user>/DEN/home/<user>/DEN/denmark/home/<user>/DEN/soil_data
Ethiopia soil reference on HPC
/home/<user>/Ethiopia/home/<user>/Ethiopia/soil_data
Local DSSAT environment
C:\DSSAT48C:\DSSAT48\PeanutC:\DSSAT48\WeatherC:\DSSAT48\Soil
Local documentation workspace
C:\Users\chich\Downloads\cropmodelling
The main objective
The practical Denmark objective was to take ISIMIP climate inputs and build a reproducible chain ending in DSSAT-ready weather files, while also testing a first Denmark soil linkage strongly enough to prove the files could run in a real DSSAT example.
By the end of the work, the project had:
- processed Denmark climate NetCDF files in crop-model units
- parquet zonal outputs
- TAV and AMP outputs
- DSSAT
.WTHfiles - repaired DEM-based elevation metadata
- a Denmark soil core sampled from ISIMIP soil input data
- a first Denmark
.SOLfile for local smoke testing - a local DSSAT experiment proving Denmark weather and soil could execute
Climate data source
The Denmark climate workflow was based on ISIMIP data already staged under the
model folders in /home/<user>/DEN.
The project worked with five climate models:
GFDL_ESM4IPSL-CM6A-LRMPI-ESM1-2-HRMRI-ESM2-0UKESM1-0-LL
The scenario set used in the Denmark pipeline was:
historicalssp126ssp370ssp585
The climate grid logic was based on the ISIMIP global 0.5 degree grid, which
is why the Denmark project also used a 0.5 degree Denmark grid.
Soil data source
The soil source used for the Denmark test build came from ISIMIP3a soil input data.
Two upstream soil products were considered:
ISIMIP3a/InputData/geo_conditions/soil/hwsd_soil_data_all_land.ncISIMIP3a/InputData/geo_conditions/soil/hwsd_soil_data_on_cropland.nc
For crop modeling, the preferred source became the cropland-focused product:
ISIMIP3a/InputData/geo_conditions/soil/hwsd_soil_data_on_cropland.nc
The reasoning was simple:
all_landrepresents dominant soil across land in a grid cellon_croplandrepresents soil conditions more relevant to agricultural use
The Denmark soil core was therefore built from the cropland product and the all-land product was kept only as a comparison reference.
What already existed before restructuring
Before the wrapper was created, Denmark already had many key ingredients in place:
- shapefiles under
/home/<user>/DEN/shapefiles - DEM tiles under
/home/<user>/DEN/COP30 - processed climate files under each model folder
- legacy zonal outputs under
/home/<user>/DEN/zonal_stats
Important existing shapefiles included:
/home/<user>/DEN/shapefiles/denmark_grid_05deg.shp/home/<user>/DEN/shapefiles/denmark_extent.shp
The large climate data were not moved during cleanup because copying them would have been slow, heavy, and unnecessary on HPC.
The wrapper created for the Denmark project
To make the work shareable and reproducible, a clean wrapper was created at:
/home/<user>/DEN/denmark
Inside that wrapper, the project was organized into:
/home/<user>/DEN/denmark/scripts/home/<user>/DEN/denmark/docs/home/<user>/DEN/denmark/config/home/<user>/DEN/denmark/deliverables/home/<user>/DEN/denmark/logs
This wrapper separated:
- heavy source inputs already present in the parent project
- workflow logic
- validation artifacts
- final deliverables for sharing
Denmark weather workflow scripts
The numbered weather workflow scripts in the Denmark wrapper were:
/home/<user>/DEN/denmark/scripts/01_download_isimip.sh/home/<user>/DEN/denmark/scripts/02_prepare_shapefiles_and_grid.sh/home/<user>/DEN/denmark/scripts/03_create_denmark_grid.py/home/<user>/DEN/denmark/scripts/04_prepare_dem.sh/home/<user>/DEN/denmark/scripts/05_merge_clip_convert_climate.sh/home/<user>/DEN/denmark/scripts/06_compute_tav_amp.py/home/<user>/DEN/denmark/scripts/07_extract_zonal_stats_parquet.py/home/<user>/DEN/denmark/scripts/08_generate_wth_files.py/home/<user>/DEN/denmark/scripts/09_validate_outputs.py/home/<user>/DEN/denmark/scripts/10_run_model_scenario_weather.sh/home/<user>/DEN/denmark/scripts/11_submit_denmark_weather.sh/home/<user>/DEN/denmark/scripts/12_validate_weather_and_soil.py
What each Denmark weather script did
01_download_isimip.sh
This is the top-level climate acquisition stage.
In practice, the Denmark project was able to proceed without a fresh bulk
download because the climate data were already present under the model folders
inside /home/<user>/DEN.
02_prepare_shapefiles_and_grid.sh
This represents the shapefile staging step, linking the Denmark extent and the
Denmark 0.5 degree climate grid.
03_create_denmark_grid.py
This is the explicit grid-construction step used to formalize the Denmark gridded geometry that later drives zonal extraction and WTH station creation.
04_prepare_dem.sh
This step prepared the DEM support raster used for WTH elevation values.
The key merged raster produced by this logic was:
/home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif
05_merge_clip_convert_climate.sh
This step transformed climate inputs into processed Denmark NetCDF files ready for crop-model use.
Its logic included:
- merging yearly or chunked files
- clipping to Denmark extent
- converting variables into DSSAT-appropriate units
The target units were:
prtomm/daytastodegCtasmaxtodegCtasmintodegCrsdstoMJ m-2 day-1
06_compute_tav_amp.py
This script computed the two climatological values needed in DSSAT weather headers:
TAVAMP
Outputs were written under:
/home/<user>/DEN/denmark/deliverables/tav_amp/{MODEL}/{SCENARIO}/{PERIOD}/tav_amp.csv
07_extract_zonal_stats_parquet.py
This step converted processed climate grids into zone-level parquet outputs following the same general logic used in the Ethiopia workflow.
Outputs were written under:
/home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/pr/pr.parquet/home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/tasmax/tasmax.parquet/home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/tasmin/tasmin.parquet/home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/{MODEL}/{SCENARIO}/{PERIOD}/rsds/rsds.parquet
08_generate_wth_files.py
This script assembled final DSSAT weather files by combining:
- rainfall from
pr - maximum temperature from
tasmax - minimum temperature from
tasmin - solar radiation from
rsds TAVAMP- DEM-based elevation
Outputs were written under:
/home/<user>/DEN/denmark/deliverables/wth/{MODEL}/{SCENARIO}/{PERIOD}
09_validate_outputs.py
This script was used for quick production counts of:
- parquet files
- WTH files
- TAV and AMP files
10_run_model_scenario_weather.sh
This was the scenario-level orchestration script. It allowed a single model-scenario combination to be rebuilt cleanly.
11_submit_denmark_weather.sh
This was the cluster batch-submission wrapper used to launch the Denmark weather production across the full model-scenario matrix.
12_validate_weather_and_soil.py
This script validated two things together:
- the final weather files
- the Denmark soil core
It wrote:
/home/<user>/DEN/soil_data/reports/wth_validation.csv/home/<user>/DEN/soil_data/reports/soil_core_validation.csv/home/<user>/DEN/soil_data/reports/weather_soil_validation_summary.json
The main weather processing chain
The Denmark weather chain was:
- use the processed climate files already staged in
/home/<user>/DEN/{MODEL}/processed - confirm and repair unit conversions where needed
- prepare a merged Denmark DEM
- compute
TAVandAMP - generate parquet zonal outputs
- generate DSSAT
.WTHfiles - validate unit metadata, counts, and final file sanity
Major problems and repairs during the weather build
Batch execution and dependency issues
Early submission attempts left jobs pending behind dependencies or failing for environment reasons. The wrapper scripts were tightened so they launched with a more explicit environment and clearer control flow.
Corrupted processed climate files
One real data failure affected MPI-ESM1-2-HR / ssp370.
Two processed files were corrupted and had to be regenerated:
/home/<user>/DEN/MPI-ESM1-2-HR/processed/MPI-ESM1-2-HR_ssp370_tasmin_denmark.nc/home/<user>/DEN/MPI-ESM1-2-HR/processed/MPI-ESM1-2-HR_ssp370_rsds_denmark.nc
After they were rebuilt from raw source, the missing Denmark WTH set completed successfully.
Mislabelled temperature metadata
Unit values across the processed climate outputs were checked after production.
The physical values were correct, but two files had temperature metadata still labelled as Kelvin even though the values were already Celsius. Those metadata labels were fixed so the final processed set was internally consistent.
Zero-elevation WTH headers
Many coastal or water-dominated Denmark grid cells initially ended up with
ELEV = 0.
The elevation logic in /home/<user>/DEN/denmark/scripts/08_generate_wth_files.py
was improved so the fallback order became:
- DEM sample at land-intersection centroid
- land-polygon mean
- land-polygon max
- nearest grid cell with valid land-based elevation
After this repair, the final Denmark WTH outputs no longer carried zero elevations in the production set.
Final Denmark weather deliverables
The weather production finished with:
80parquet files20TAV/AMP files1160WTH files
This corresponds to:
5models4scenarios20model-scenario combinations58WTH files per combination
Important deliverables included:
/home/<user>/DEN/denmark/deliverables/wth/home/<user>/DEN/denmark/deliverables/parquet_zonal_stats/home/<user>/DEN/denmark/deliverables/tav_amp/home/<user>/DEN/denmark/deliverables/validation/processed_units_validation.csv/home/<user>/DEN/denmark/deliverables/validation/denmark_dem_merged.tif/home/<user>/DEN/denmark_wth_bundle.tar.gz
Processed climate validation
All processed Denmark climate files were checked across:
prtastasmaxtasminrsds
The final validation result was:
100/100processed NetCDF files passed
The expected final units were:
pr:mm/daytas:degCtasmax:degCtasmin:degCrsds:MJ m-2 day-1
Denmark soil build
The soil side did not begin as a full DSSAT .SOL project. It began with a
grid-aligned soil core sampled from ISIMIP soil input data.
The local and remote staging paths used for the Denmark soil work included:
/home/<user>/DEN/soil_data/raw/home/<user>/DEN/soil_data/processed/home/<user>/DEN/soil_data/reports
First downloaded soil source
The first downloaded source was:
/home/<user>/DEN/soil_data/raw/hwsd_soil_data_all_land.nc
This was used to inspect the available fields and to test whether the Denmark
grid aligned naturally with the ISIMIP 0.5 degree soil grid.
Preferred soil source for crop modeling
The final preferred source became:
/home/<user>/DEN/soil_data/raw/hwsd_soil_data_on_cropland.nc
because it better represents agricultural land within each cell.
Soil properties extracted from ISIMIP
The core ISIMIP soil variables used in the Denmark build were:
texture_classmu_globalsoil_phsoil_caco3bulk_densitycec_soilocroot_obstaclesimpermeable_layerawcsandsiltclaygravelecebs_soilissoil
Denmark soil outputs produced on HPC
The two main processed Denmark soil-core tables created were:
/home/<user>/DEN/soil_data/processed/denmark_soil_core_all_land.csv/home/<user>/DEN/soil_data/processed/denmark_soil_core_cropland.csv
The main reports created were:
/home/<user>/DEN/soil_data/reports/denmark_soil_core_summary.csv/home/<user>/DEN/soil_data/reports/denmark_soil_core_cropland_summary.csv/home/<user>/DEN/soil_data/reports/denmark_soil_all_land_vs_cropland.csv
The cropland-core build became the preferred source of truth.
Denmark soil-core validation status
The grid-aligned Denmark cropland core reached:
60rows55direct land-centroid matches5nearest-valid fallbacks for tiny coastal cells0missingsoil_ph0missingbulk_density0missingawc
This means the soil core was usable as an upstream Denmark soil backbone, even though it was not yet a full layered DSSAT soil library by itself.
Weather and soil joint validation
The combined Denmark weather and soil validation stage produced:
/home/<user>/DEN/soil_data/reports/wth_validation.csv/home/<user>/DEN/soil_data/reports/soil_core_validation.csv/home/<user>/DEN/soil_data/reports/weather_soil_validation_summary.json
The main outcome was:
1160/1160WTH files parsed successfully60/60Denmark soil-core rows passed the sanity checks
That proved:
- the weather files were structurally usable
- the soil core was numerically sane and aligned to the Denmark grid
It did not yet prove a final Denmark .SOL because the local DSSAT smoke test
had not been performed at that stage.
Local DSSAT smoke-test build
To prove that Denmark weather and soil could actually be executed by DSSAT, a
local smoke test was built in the Windows DSSAT installation at C:\DSSAT48.
Baseline example used for local DSSAT
The first step was to run the unmodified peanut example:
C:\DSSAT48\Peanut\UFGA7901.PNX
This baseline run succeeded using:
C:\DSSAT48\DSCSM048.EXE MPN A UFGA7901.PNX
That confirmed the local DSSAT installation was healthy before the Denmark files were introduced.
Local Denmark staging files
The Denmark local smoke-test staging area was:
C:\Users\chich\Downloads\DEN\local_test\sourceC:\Users\chich\Downloads\DEN\local_test\reports
The remote Denmark files copied locally for this setup included:
- a validated Denmark WTH source file from HPC
- the Denmark cropland soil core CSV
Local preparation script
The main local setup script written for this proof was:
C:\Users\chich\Downloads\DEN\denmark\scripts\13_prepare_local_dssat_smoke_test.py
This script:
- created a local single-year DSSAT weather file from the Denmark WTH export
- built a first Denmark soil library at
C:\DSSAT48\Soil\DN.SOL - synchronized Denmark profiles into
C:\DSSAT48\Soil\SOIL.SOL - created a copied DSSAT peanut experiment linked to Denmark weather and soil
- wrote a local setup manifest
Local DSSAT smoke-test outputs
The key local outputs were:
C:\DSSAT48\Weather\DK017101.WTHC:\DSSAT48\Soil\DN.SOLC:\DSSAT48\Peanut\UFGA7103.PNXC:\Users\chich\Downloads\DEN\local_test\reports\local_dssat_smoke_setup.json
Local DSSAT test logic
The test was intentionally conservative:
- keep the stock peanut example structure
- change only the weather and soil linkage to Denmark
- shift the experiment dates into a Denmark-compatible weather year
- refine the setup until DSSAT accepted the files and ran
What happened in the local Denmark run
The local Denmark-linked peanut example eventually ran successfully.
That proved:
- the Denmark weather file was readable by DSSAT
- the Denmark soil file was readable by DSSAT
- DSSAT could execute an experiment using Denmark weather and the Denmark soil profile
What the local smoke test did not prove
The smoke test used peanut because the objective was technical proof of file execution, not agronomic realism.
The final run still showed freeze-related crop failure later in the season, which is expected for a peanut example under Denmark conditions.
So the smoke test proved technical integration, not that peanut is an appropriate Denmark production crop.
Why the Ethiopia project still matters
The Ethiopia soil project remains the best architectural reference for what the Denmark soil side should become.
The key confirmed Ethiopia files were:
/home/<user>/Ethiopia/soil_data/docs/soil_first_workflow.md/home/<user>/Ethiopia/soil_data/docs/soil_hydraulics_runoff.md/home/<user>/Ethiopia/soil_data/processed/soil_with_slro.csv/home/<user>/Ethiopia/soil_data/processed/horizon.csv/home/<user>/Ethiopia/soil_data/dssat/ET.SOL/home/<user>/Ethiopia/soil_data/stics/sols.xml/home/<user>/Ethiopia/soil_data/apsim/soils/*.json
The design lesson is that Denmark should eventually have:
- one authoritative processed soil backbone
- model-specific exports derived from that backbone
- QC notes and fallback logic documented explicitly
Recommended Denmark soil target structure
The future Denmark soil structure should look like:
/home/<user>/DEN/soil_data/raw/home/<user>/DEN/soil_data/processed/home/<user>/DEN/soil_data/docs/home/<user>/DEN/soil_data/reports/home/<user>/DEN/soil_data/dssat/home/<user>/DEN/soil_data/apsim/home/<user>/DEN/soil_data/stics
At minimum, the authoritative processed tables should eventually include:
final_soil_profile_all_points.csvhydraulic_parameters_from_profile.csvsoil_with_slro.csvhorizon.csv
And the model-specific outputs should include:
DN.SOL- APSIM soil exports
- STICS soil exports
Final result summary
The Denmark climate and soil work reached a point where the pipeline is now fully traceable.
The project now has:
- clearly identified upstream climate and soil sources
- a named HPC wrapper structure
- numbered workflow scripts
- documented repairs
- validated climate outputs
- validated grid-aligned soil core outputs
- a first Denmark DSSAT soil library
- a successful local DSSAT smoke test linking Denmark weather and soil
The most important practical conclusion is this:
the Denmark inputs are no longer just files sitting in folders. They are now a documented, validated workflow with named scripts, named outputs, and a clear record from source data to runnable DSSAT proof.
DSSAT Soil Files Explained
Soil files are where many beginners start to feel overwhelmed.
That is understandable because soil inputs combine:
- physical structure
- hydrology
- chemistry
- initial conditions
The good news is that you do not need to understand all of pedology to work carefully with DSSAT soil files.
What a soil file is trying to represent
A DSSAT soil file tries to describe the rooting environment layer by layer.
That includes questions like:
- How deep can roots grow?
- How much water can the soil hold?
- How easily is water extracted?
- How dense is the soil?
- How much organic matter or carbon is present?
- What initial water and nitrogen conditions existed?
Why layers matter
Soils are not uniform from the surface to depth.
One layer may be:
- sandy
- shallow
- low in organic matter
- easy to dry out
Another may be:
- denser
- finer textured
- higher in water-holding capacity
DSSAT therefore uses layered soil descriptions rather than one single number for "the soil."
Common soil concepts beginners should know
Bulk density
Bulk density is a measure of how compact the soil is.
It affects rooting and water behavior.
Lower limit and drained upper limit
These help define plant-available water.
In practical terms:
- the lower limit is near the dry point where roots struggle to extract water
- the drained upper limit is near field capacity after excess water drains away
The difference between them is a major part of the plant-available water range.
Saturation
This is the water content when pore space is essentially full of water.
Organic carbon or organic matter
These influence soil fertility, structure, and in some workflows nitrogen dynamics.
Rooting depth
Even a well-watered surface may not rescue a crop if effective rooting depth is too small.
Why soil can strongly affect biomass
Soil influences:
- water supply
- nitrogen availability
- root exploration
- stress timing
That means a crop can have correct weather and decent genotype coefficients but still simulate poorly if the soil profile is unrealistic.
Soil file versus experiment-level soil information
There are two related but distinct ideas:
- the general soil profile definition
- experiment-specific initial conditions or soil analysis sections
The soil file usually provides the main profile structure. The experiment file may refine starting conditions for a particular run.
Common beginner problems
Typical issues include:
- copying a soil profile from another site without checking depth or texture
- missing lower layers
- unrealistic hydraulic values
- inconsistent initial water content
- confusion between measured values and estimated defaults
Practical questions to ask about a soil profile
Before trusting a profile, ask:
- Does the depth structure make sense for the site?
- Are the texture or hydraulic values plausible by layer?
- Is rooting depth likely to be limited?
- Are initial water and nitrogen conditions documented?
- Was the profile measured, estimated, or borrowed?
Beginner takeaway
Think of a soil file as the hidden reservoir and root environment beneath the crop.
If weather drives what the atmosphere offers, soil determines what the plant can actually access.
How to Build Soil Profiles for DSSAT
This chapter focuses on workflow.
The goal is to help a beginner move from field or laboratory information to a soil profile that DSSAT can use responsibly.
Start with honesty about where the data came from
A soil profile can be:
- directly measured in the field and laboratory
- partly measured and partly estimated
- borrowed from a similar nearby site
- assembled from published or mapped information
These are not equally strong situations.
Always document which one applies.
A practical build workflow
Step 1: define the site and profile depth
Record:
- site identifier
- coordinates if available
- effective soil depth
- any known restrictive layer
If you do not know how deep roots can realistically explore, note that as an uncertainty rather than hiding it.
Step 2: define soil layers
Choose layers that reflect the real profile as well as the resolution supported by your data.
Typical layer information includes:
- top and bottom depth
- texture or textural class
- bulk density
- hydraulic information
- carbon or organic matter
Step 3: gather hydraulic information
DSSAT needs a reasonable description of water behavior.
This often includes values related to:
- lower water limit
- drained upper limit
- saturation
If these were not measured directly, document how they were estimated.
Pedotransfer functions or standard tables may be used, but they should not be treated as exact field truth.
Step 4: define initial conditions
A soil profile alone is not the whole story.
You also need to think about the starting condition at planting, such as:
- initial soil water
- initial mineral nitrogen
- residue or previous-crop effects if relevant
These may be represented partly in the experiment file rather than only in the main soil profile.
Step 5: check consistency
Common consistency questions include:
- Are layer depths continuous?
- Do hydraulic values increase or decrease in physically impossible ways?
- Is saturation greater than drained upper limit?
- Is drained upper limit greater than lower limit?
- Is bulk density plausible for the texture?
These checks catch many silent data problems.
Step 6: write the DSSAT soil file
Once the profile is internally consistent, write it into the DSSAT format with:
- correct site identifier
- correct layer boundaries
- correct layer variables
Step 7: sanity test in a simulation
Before using the profile for serious calibration, run a simple test and inspect:
- early growth
- stress behavior
- whether the crop appears unrealistically drought-limited or unstressed
If stress patterns look impossible, revisit the soil assumptions before changing genetics.
When borrowing a profile is acceptable
Sometimes the real site profile is unavailable.
Borrowing a profile may be acceptable if:
- the soil type is genuinely similar
- the climate and management context are not wildly different
- the borrowed profile is clearly labeled as provisional
Borrowed soil should be treated as a starting point, not as a final truth.
Common calibration trap
One frequent mistake is trying to fix a soil problem by changing cultivar coefficients.
For example:
- if soil water is too available, biomass may be too high
- if soil water is too restrictive, biomass may be too low
Changing phenology or partitioning coefficients to compensate can make the model less biologically meaningful.
Beginner takeaway
Building a DSSAT soil profile is part measurement, part estimation, and part quality control.
The important thing is not pretending the file is perfect. The important thing is making every assumption visible and internally consistent.
Management Files and Experiment Sections
Weather and soil describe the environment.
Management describes what people did to the crop.
This is one of the most important beginner ideas in crop modeling:
the model does not simulate an abstract crop in abstract weather.
It simulates a managed crop in a specific field context.
What management usually includes
Management information can include:
- planting date
- planting method
- planting density
- row spacing
- fertilizer timing and amount
- irrigation timing and amount
- tillage or residue conditions
- harvest timing or harvest rules
Different projects will use different subsets of these.
Why management matters so much
Two experiments with the same cultivar, soil, and weather can still behave very differently if management differs.
Examples:
- planting two weeks earlier changes temperature and daylength exposure
- higher plant density changes canopy development and competition
- more nitrogen changes growth potential and partitioning
- irrigation changes stress timing
Where management appears in DSSAT
In many DSSAT workflows, management is encoded inside the experiment file rather than one standalone file.
That means sections such as:
- planting details
- fertilizer applications
- irrigation events
- harvest details
are part of the experimental recipe.
Planting information
Planting details often include:
- planting date
- seed or plant population
- row spacing
- planting depth
These variables strongly affect:
- establishment
- canopy development
- competition for light
- development trajectory
Fertilizer information
Fertilizer management usually specifies:
- application date
- nutrient type
- amount applied
- sometimes method or depth
Nitrogen management can strongly influence biomass and partitioning, so these entries should be checked carefully in any calibration workflow.
Irrigation information
Irrigation events tell the model when and how much water was added.
Even if your main scientific question is not about irrigation, getting these events wrong can dramatically change stress patterns.
Harvest information
Harvest sections matter because they define what outcome is being represented and when it is measured.
This becomes especially important when comparing simulated versus observed harvest biomass or stage timing.
Why experiment sections matter for paper reproduction
In published case studies, the experiment file is often where the actual treatment logic lives.
That means understanding a paper's:
- cultivar treatments
- planting dates
- nitrogen levels
- harvest schedule
often requires reading the experiment file closely, not only the article text.
Beginner takeaway
Management is the bridge between "what the environment allowed" and "what the researchers or farmers actually did."
If a simulation seems wrong, never inspect genetics alone. Management often contains the simplest explanation.
Genotype, Cultivar, and Ecotype in DSSAT
These terms are easy to confuse, especially for new modelers.
This chapter explains the practical meaning of each one in a DSSAT workflow.
Why genotype information exists
Weather, soil, and management describe the environment around the crop.
Genotype-related files describe how this crop type responds within that environment.
Without genotype coefficients, the model would have no crop-specific identity.
Species, cultivar, and ecotype
In practical DSSAT usage, beginners often encounter three levels:
- species-level information
- cultivar-level information
- ecotype-level information
The exact details vary by family, but the broad idea is:
Species
Species-level definitions often hold more general crop information shared across many cultivars.
Cultivar
Cultivar coefficients distinguish one named or coded cultivar from another.
These may affect:
- development timing
- partitioning
- morphology
- maturity behavior
Ecotype
Ecotype coefficients often describe broader adaptation patterns related to environmental response, especially development and phenology.
However, not all DSSAT families use ecotype files in the same way, and some do
not use .ECO files at all.
Why this matters for a general wrapper
The original crop-specific wrapper could assume more about file patterns.
A broader omniwrapper cannot assume:
- every crop has
.ECO - every cultivar key is stored under the same column name
- every family matches genotype files the same way
That is why registry logic and family-aware parsing matter in this repository.
What calibration usually changes
In many workflows, calibration focuses on cultivar or ecotype coefficients rather than species-level definitions.
Typical goals may include improving:
- flowering date
- maturity timing
- biomass accumulation
- stem or grain partitioning
- leaf-area development
Common beginner misunderstanding
Beginners sometimes think cultivar coefficients are just labels.
They are not.
They are part of the biological identity of the simulation.
Changing them is changing how the virtual crop behaves.
Why coefficient changes should be disciplined
Because many coefficients interact, it is easy to create a nice-looking fit that has poor biological meaning.
A better approach is to ask:
- Which process is wrong?
- Which parameter family controls that process?
- Is the change biologically plausible?
That is much safer than tuning everything at once.
Beginner takeaway
Genotype files are where the model learns what kind of crop it is simulating.
If weather and soil describe the world around the plant, genotype coefficients describe the plant's built-in behavior inside that world.
Worked Case Study: Weather, Soil, and Management Construction
This chapter is a fully worked teaching example.
It is meant to answer the beginner question:
"If someone hands me raw notes from a field experiment, how do I turn them into the pieces DSSAT actually needs?"
Important note before we start
This is an instructional case study, not a claim that every number below came from a published hemp dataset.
The purpose is to show the workflow clearly and realistically:
- what information is needed
- what assumptions are acceptable
- what should be checked before running DSSAT
The scenario
Imagine an intern receives the following field summary from a hemp trial.
Raw field notes
Site:
- site name:
North Farm Block A - latitude:
29.70 N - longitude:
82.41 W - elevation:
32 m
Crop:
- crop: industrial hemp
- cultivar:
IH Williams - purpose: fiber-focused biomass evaluation
Management notes:
- planting date:
2022-05-12 - row spacing:
0.76 m - target population:
60 plants m-2 - planting depth:
2.5 cm - nitrogen fertilizer:
60 kg N ha-1at planting120 kg N ha-1on2022-06-10
- irrigation:
20 mmon2022-05-1425 mmon2022-06-01
- biomass harvests:
2022-06-202022-07-052022-07-222022-08-15
Weather notes:
- automatic station near field
- daily data available for:
- solar radiation
- minimum temperature
- maximum temperature
- rainfall
Soil notes:
- sandy loam surface
- deeper layer becomes sandy clay loam
- effective rooting depth about
120 cm - bulk density measured for three depth ranges
- field team estimated initial soil water as "moderately moist"
At first glance, this is useful, but it is not yet a DSSAT-ready dataset.
The rest of the chapter shows how to transform it.
Step 1: separate the problem into DSSAT input categories
The raw notes need to be reorganized into four modeling categories:
- weather
- soil
- management
- genotype
For this chapter we focus on the first three, because those are the pieces most new interns struggle to construct from raw notes.
Step 2: build the weather component
What raw weather information must become
For DSSAT, the weather input must become:
- one station definition
- one daily time series
- consistent daily units
- continuous dates across the simulation period
Example weather table before DSSAT formatting
Suppose the station export begins like this:
| Date | SolarRad_MJ_m2 | Tmin_C | Tmax_C | Rain_mm |
|---|---|---|---|---|
| 2022-05-12 | 24.1 | 18.7 | 31.2 | 0.0 |
| 2022-05-13 | 22.9 | 19.2 | 30.5 | 4.3 |
| 2022-05-14 | 25.3 | 18.9 | 31.7 | 0.0 |
| 2022-05-15 | 23.8 | 20.1 | 32.0 | 1.2 |
Before writing this into a DSSAT weather file, the intern should ask:
- Are there any missing days?
- Are radiation units really
MJ m-2 day-1? - Are rainfall values daily totals?
- Are minimum and maximum temperatures ever swapped?
Weather quality-control checklist
For this example, the intern should do all of the following:
- confirm the file has every date from planting through the harvest period
- confirm there are no duplicate rows
- confirm
TMAX >= TMINfor every day - inspect whether radiation values are physically plausible
- inspect whether rainfall totals look reasonable relative to the site and season
Station metadata
The station definition should also be made explicit:
| Item | Value |
|---|---|
| Station code | NFBA |
| Latitude | 29.70 |
| Longitude | -82.41 |
| Elevation | 32 |
Even if the intern does not yet know the exact final .WTH syntax by memory,
they should already understand that DSSAT needs both:
- the station description
- the day-by-day weather values
What matters scientifically
For hemp, weather is not just a background file. It affects:
- thermal accumulation
- photoperiod context through site location
- water balance through rainfall
- biomass opportunity through solar radiation
If flowering or biomass later look wrong, weather is one of the first things to recheck.
Step 3: build the soil component
What the raw soil notes must become
The field notes said:
- sandy loam surface
- deeper layer becomes sandy clay loam
- effective rooting depth about
120 cm - bulk density measured for three depth ranges
- initial soil water moderately moist
That description must become a layered soil profile.
Example layer table before DSSAT formatting
| Layer bottom (cm) | Texture | Bulk density (g cm-3) | Lower limit | Drained upper limit | Saturation | Organic C (%) |
|---|---|---|---|---|---|---|
| 15 | Sandy loam | 1.42 | 0.08 | 0.20 | 0.39 | 1.20 |
| 30 | Sandy loam | 1.45 | 0.09 | 0.21 | 0.38 | 0.95 |
| 60 | Sandy clay loam | 1.48 | 0.12 | 0.26 | 0.40 | 0.70 |
| 90 | Sandy clay loam | 1.50 | 0.13 | 0.27 | 0.40 | 0.55 |
| 120 | Sandy clay loam | 1.52 | 0.14 | 0.28 | 0.41 | 0.45 |
This is now much closer to DSSAT-ready thinking, because the soil is expressed layer by layer rather than as a paragraph.
What the intern should check
Before this becomes a DSSAT .SOL entry, the intern should confirm:
- layer depths are continuous and increasing
- saturation is greater than drained upper limit
- drained upper limit is greater than lower limit
- bulk density values are plausible for the textures listed
- the effective rooting depth is consistent with the deepest useful layer
Initial conditions
The field note "moderately moist" is not yet numeric enough for a simulation.
The intern needs to decide whether initial soil water will come from:
- direct measured layer water contents
- estimated fractions between lower limit and drained upper limit
- default assumptions documented in the experiment notes
For teaching purposes, a careful provisional assumption might be:
- upper two layers start near
75%of plant-available water - deeper layers start near
65%of plant-available water
That is not perfect truth. It is an explicit assumption that can later be replaced with measurements.
Why this matters scientifically
If the soil profile holds too much water, the crop may look unrealistically vigorous.
If it holds too little water, the crop may show water stress too early.
That is why soil should be checked before blaming cultivar coefficients.
Step 4: build the management component
Raw management notes
From the original notes:
- planting date:
2022-05-12 - row spacing:
0.76 m - target population:
60 plants m-2 - planting depth:
2.5 cm - nitrogen:
60 kg N ha-1at planting120 kg N ha-1on2022-06-10
- irrigation:
20 mmon2022-05-1425 mmon2022-06-01
- biomass harvests:
2022-06-202022-07-052022-07-222022-08-15
Management interpretation table
| Concept | DSSAT-relevant interpretation |
|---|---|
| Planting date | planting operation date |
| Population | plants per square meter |
| Row spacing | row geometry affecting canopy setup |
| Planting depth | establishment input |
| Nitrogen applications | fertilizer events with date and amount |
| Irrigation applications | irrigation events with date and amount |
| Harvest dates | observation schedule and possibly harvest management |
A management sheet the intern can draft
Planting
| Item | Value |
|---|---|
| Planting date | 2022-05-12 |
| Population | 60 plants m-2 |
| Row spacing | 0.76 m |
| Depth | 2.5 cm |
Fertilizer
| Date | Nutrient | Amount |
|---|---|---|
| 2022-05-12 | N | 60 kg ha-1 |
| 2022-06-10 | N | 120 kg ha-1 |
Irrigation
| Date | Amount |
|---|---|
| 2022-05-14 | 20 mm |
| 2022-06-01 | 25 mm |
Harvest observations
| Date | Purpose |
|---|---|
| 2022-06-20 | destructive biomass sample |
| 2022-07-05 | destructive biomass sample |
| 2022-07-22 | destructive biomass sample |
| 2022-08-15 | destructive biomass sample |
This gives the intern a structured view before they ever touch DSSAT syntax.
Step 5: connect the pieces in the experiment logic
At this point the intern has the three major non-genetic building blocks:
- a cleaned daily weather series
- a layered soil profile
- a structured management schedule
The next conceptual step is to connect them in one experiment description:
- the experiment references the weather station
- the field references the soil profile
- the treatment references the planting, fertilizer, irrigation, and harvest logic
- the treatment also references the cultivar choice
That connection is exactly why experiment files are the "conductor" of a DSSAT run.
Step 6: define what still remains uncertain
A realistic teaching case should not hide uncertainty.
For this example, open questions might still include:
- Were hydraulic limits measured or estimated?
- Was initial soil water measured or assumed?
- Were plant populations final established counts or target planting rates?
- Did the field station sit exactly at the plot location?
These are not embarrassments. They are part of honest model documentation.
Step 7: what the intern should save as project artifacts
By the end of this construction phase, a careful intern should have:
- a cleaned weather table
- a weather metadata note
- a layered soil table
- a management summary table
- a written list of assumptions and unresolved uncertainties
That package is often more important than the final DSSAT files alone.
What this case study teaches
This example shows that "building DSSAT inputs" is really a chain of decisions:
- data collection
- unit checking
- structure definition
- assumption logging
- model-oriented formatting
That is the discipline that later makes calibration and paper reproduction trustworthy.
Installed DSSAT Setup
This path is the simplest if DSSAT is already installed on your machine.
Requirements
- a local DSSAT installation
- R
- the required R packages:
DSSATdplyrtidyrlubridate
Install the R packages if needed:
install.packages(c("DSSAT", "dplyr", "tidyr", "lubridate"))
Source the wrapper
From your local DSSAT-wrapper clone:
source("C:/path/to/DSSAT-wrapper/R/DSSAT_omniwrapper.R")
Minimal example
result <- DSSAT_omniwrapper(
model_options = list(
DSSAT_path = "C:/path/to/DSSAT48",
DSSAT_exe = "DSCSM048.EXE",
project_file = "C:/path/to/DSSAT48/Wheat/KSAS8101.WHX",
suppress_output = TRUE
),
situation = "KSAS8101_1",
var = "GSTD"
)
What the wrapper needs
At minimum, the omniwrapper expects:
DSSAT_pathproject_file
It can often infer:
- crop code
- crop folder
- default module
- genotype file names
When to provide Crop
Provide Crop when the experiment file lives outside the installed DSSAT crop
folder and the wrapper should know which installed crop directory to stage
against.
When to provide module_code
Provide module_code when you want to force an alternate engine instead of the
default module in the DSSAT profile. Examples:
MZIXM048WHAPS048TFAPS048SUOIL048SCCSP048SCSAM048
Basic validation
You can run the bundled smoke test:
Rscript C:\path\to\DSSAT-wrapper\R\test_DSSAT_omniwrapper.R
And the broader family validator:
Rscript C:\path\to\DSSAT-wrapper\R\validate_DSSAT_omniwrapper_families.R
Using GitHub-Sourced Example Data
You do not need every experiment file to live inside the DSSAT installation folder.
A common pattern is:
- keep DSSAT installed locally
- clone or download an example-data repository
- point
project_fileto the external experiment file - optionally set
Crop - optionally set
module_code
Example layout
C:/work/
DSSAT48/
dssat-csm-data/
Example: alternate maize engine
source("C:/path/to/DSSAT-wrapper/R/DSSAT_omniwrapper.R")
result <- DSSAT_omniwrapper(
model_options = list(
DSSAT_path = "C:/work/DSSAT48",
DSSAT_exe = "DSCSM048.EXE",
Crop = "Maize",
project_file = "C:/work/dssat-csm-data/Maize/UFGA8201.MZX",
module_code = "MZIXM048",
suppress_output = TRUE
),
situation = "UFGA8201_1",
var = "CWAD"
)
Example: wheat with NWHEAT
result <- DSSAT_omniwrapper(
model_options = list(
DSSAT_path = "C:/work/DSSAT48",
DSSAT_exe = "DSCSM048.EXE",
Crop = "Wheat",
project_file = "C:/work/dssat-csm-data/Wheat/KSAS8101.WHX",
module_code = "WHAPS048",
suppress_output = TRUE
),
situation = "KSAS8101_1",
var = "GSTD"
)
Why this works
The omniwrapper stages a temporary run directory and copies in the files DSSAT needs for that run:
- the experiment file
- companion
*.Aand*.Tfiles when present - genotype files from the local DSSAT installation or the project folder
That means your source-data checkout stays clean and DSSAT still runs from the local executable.
Good practice
- Keep DSSAT itself installed locally.
- Treat GitHub example repositories as input-data sources.
- Do not hard-code personal paths in scripts you plan to share.
- Use placeholders like
C:/path/to/...in public documentation.
Your First Run
This chapter shows the shortest useful path from a fresh machine to a working DSSAT run from R.
Step 1: Prepare the prerequisites
You need:
- R installed
- DSSAT installed locally
- this wrapper repository available locally
For example:
- DSSAT at
C:/path/to/DSSAT48 - the repo in any local folder
Step 2: Install R packages
Run:
install.packages(c("DSSAT", "dplyr", "tidyr", "lubridate"))
These are the minimum packages used by the wrapper core.
Step 3: Source the omniwrapper
This step assumes you have also cloned the DSSAT-wrapper module separately.
For example:
C:/path/to/DSSAT-wrapper
source("C:/path/to/DSSAT-wrapper/R/DSSAT_omniwrapper.R")
This loads the public function and the helper modules behind it.
Step 4: Choose a known-good example
A good first example is a shipped DSSAT case that is already validated.
For instance:
result <- DSSAT_omniwrapper(
model_options = list(
DSSAT_path = "C:/path/to/DSSAT48",
DSSAT_exe = "DSCSM048.EXE",
project_file = "C:/path/to/DSSAT48/Wheat/KSAS8101.WHX",
suppress_output = TRUE
),
situation = "KSAS8101_1",
var = "GSTD"
)
Step 5: Inspect the result
names(result)
names(result$sim_list)
head(result$sim_list[["KSAS8101_1"]])
At this point, the main question is not yet "does the biology look perfect?"
The main question is:
"Did the wrapper successfully identify the model context, launch DSSAT, and read the outputs back?"
Step 6: Read observations if available
obs <- DSSAT_omni_read_obs(
model_options = list(
DSSAT_path = "C:/path/to/DSSAT48",
DSSAT_exe = "DSCSM048.EXE",
project_file = "C:/path/to/DSSAT48/Wheat/KSAS8101.WHX"
),
situation = "KSAS8101_1",
read_end_season = TRUE
)
This is the bridge between "a run works" and "a run can be evaluated."
Common beginner mistakes
Using the wrong project file
Make sure the file extension and crop family make sense for the case you want to run.
Pointing only to copied crop inputs
Some crops also require install-level metadata to be present and consistent.
Confusing crop choice with module choice
Some crops have alternate engines. If a non-default engine is required, you may
need to provide module_code.
Example:
module_code = "WHAPS048"
What to do if the first run fails
Do not immediately start editing coefficients.
Instead:
- run the self-check
- verify the project file path
- verify the DSSAT install path
- verify that the required genotype files exist
- check whether the example is supposed to use a non-default engine
That sequence saves a lot of time.
Self-Check and Validation
This chapter explains how the wrapper protects itself against common failures.
Why a self-check matters
Many DSSAT failures happen before the biology starts:
- a required file is missing
- the wrong crop folder is assumed
- a genotype file is absent
- the project file lives outside the installed DSSAT crop folder
- the requested variable does not exist for that family
If we only discover those issues after a failed run, debugging becomes slow and noisy.
DSSAT_omni_self_check()
The self-check is designed to fail early and readably.
It inspects things like:
- DSSAT executable
SIMULATION.CDEDSSATPRO.V48- project file
- genotype directory
- species, cultivar, and ecotype files when relevant
- companion observation files
- situation naming consistency
- requested variable availability
This check runs before DSSAT_omniwrapper() launches DSSAT.
Why this is especially important in a multi-family wrapper
A crop-specific wrapper can rely on more fixed assumptions.
A multi-family wrapper cannot.
That means validation has to be more explicit because we are working across:
- different model engines
- different genotype patterns
- different output conventions
- different companion observation layouts
Smoke tests vs broader validation
The repo uses more than one validation layer.
Smoke tests
These confirm the core path is still alive.
Examples:
DSSAT-wrapper/R/test_DSSAT_omni_self_check.RDSSAT-wrapper/R/test_DSSAT_omniwrapper.RDSSAT-wrapper/R/test_DSSAT_omni_read_obs.R
Cross-family validation
This confirms the architecture still works across the broader supported set.
Example:
DSSAT-wrapper/R/validate_DSSAT_omniwrapper_families.R
These validation scripts are designed for other users to run on their own machines. They use:
DSSAT_PATHfor the local DSSAT installationDSSAT_CSM_DATAfor an optional local clone ofdssat-csm-data
If DSSAT_PATH is not set, the scripts try the common Windows default
C:/DSSAT48.
Why validated families matter
There is a difference between:
- "the code recognizes this family"
- "the code can launch this family"
- "the code has been validated end-to-end"
The documentation should always keep those categories separate. That is how we avoid overclaiming support.
Practical workflow for contributors
When changing wrapper internals:
- run the self-check test
- run the smoke test
- run the cross-family validator
- if you touched observation logic, run the observation regression test too
That sequence should be treated as part of development, not as an afterthought.
How Calibration Actually Works
Calibration is one of the most misunderstood parts of crop modeling.
Many beginners think calibration means:
"change numbers until the curve looks good."
That is not a reliable modeling strategy.
Calibration should be structured, biologically informed, and documented.
What calibration is
Calibration means adjusting uncertain model parameters so that simulated behavior agrees better with observed behavior for a defined dataset and question.
In crop-model work, common calibration targets include:
- emergence date
- flowering date
- maturity date
- biomass trajectory
- stem biomass
- grain yield
- leaf area
What calibration is not
Calibration is not:
- hiding a bad weather file
- compensating for a wrong soil profile by changing genetics
- forcing every variable to match perfectly
- proof that the model is universally correct
A good calibration improves fit while preserving biological meaning.
A safer calibration order
For many projects, especially beginner projects, the safest order is:
- confirm the run and inputs are correct
- calibrate timing and stage transitions
- calibrate overall biomass trajectory
- calibrate organ partitioning
- evaluate on cases not used for adjustment when possible
This order follows process logic rather than aesthetic curve fitting.
Why timing usually comes first
If flowering timing is wrong, many later outputs can be wrong for the wrong reason.
For example:
- the crop may stop vegetative growth too early
- stem biomass may be too low simply because the crop transitioned too soon
- seed filling may begin at the wrong time
That is why phenology often deserves attention before final biomass.
Why input checking comes before parameter changes
Before changing cultivar or ecotype parameters, confirm:
- weather is correct
- soil is plausible
- management matches the experiment
- observations are being read correctly
- the correct model family and module are actually running
Otherwise, you may calibrate around an avoidable input error.
Objective functions
Calibration needs a way to judge whether one parameter set is better than another.
Common criteria include:
- RMSE
- agreement indices such as Willmott's
d - bias
- visual fit of time series
No single metric tells the whole story, so it is often best to inspect both numbers and plots.
Multi-variable calibration
Real projects often fit more than one variable at once.
For hemp, this may include:
- flowering time
- total aboveground biomass
- stem biomass
- plant height
This creates tradeoffs. A parameter change that improves one variable may worsen another.
That is why calibration should be guided by process understanding, not only by one summary score.
Validation after calibration
Calibration is stronger when you can test the calibrated model on:
- another season
- another site
- another treatment set
If the fit is good only on the exact data used for tuning, the model may be overfit.
Beginner takeaway
Calibration is best thought of as a scientific workflow:
- check inputs
- define targets
- adjust plausible parameters
- evaluate carefully
- document everything
That mindset is much more valuable than chasing a single perfect metric.
Hopf Paper Case Study
This chapter explains how the wrapper can be used in a real reproduction task.
The example is the hemp paper:
Hopf et al. (2025), "Adaptation of the process-based CSM-CROPGRO model to simulate the growth and development of industrial hemp for seed and fiber production."
Why this case study matters
It is a good teaching example because it combines:
- a real publication
- real observed data files
- real genotype coefficients
- a model family that required install metadata to be consistent
- a workflow that can later inform work in other model frameworks
The four key experiment files
In the local reproduction workspace, the exact paper experiment files came from
dssat-csm-data/Hemp:
UFCI2101.HMXUFCI2201.HMXUFJA2101.HMXUFJA2201.HMX
Together they define the 15 Florida cases listed in the paper's experiment table.
The matching observation files
Each experiment also had:
- a time-course observation file
- an anthesis or summary observation file
For hemp these appeared as:
.HMT.HMA
That is exactly the kind of structure the wrapper's observation-reading logic needs to support if it is going to help with paper reproduction.
Why the observation-path fix mattered
Originally, the new wrapper could read observations only from the installed DSSAT crop folder.
That was not enough for this paper workflow, because the exact paper experiment files and observations lived in an external project directory.
The wrapper was updated so that observation files can be found relative to the
project_file directory when needed.
That small technical change is a good example of how real case studies improve general wrapper design.
First-pass reproduction logic
The reproduction workflow does this:
- define the 15 paper cases explicitly
- run each case through
DSSAT_omniwrapper() - read the matching observations
- join simulated and observed values on date
- compute first-pass metrics such as
dandRMSE - compare observed and simulated flowering dates
What the first pass can and cannot claim
It can claim:
- the paper experiments rerun successfully
- the observed and simulated data can be aligned reproducibly
- the resulting performance metrics are strong enough to justify deeper analysis
It should not yet claim:
- that every published metric has been matched exactly
- that every figure in the paper has been rebuilt identically
- that every aggregation choice used by the authors has already been reproduced
That distinction is part of good scientific communication.
Why this case study belongs in the book
Because it shows the difference between:
- building a wrapper in theory
- building one that survives contact with a real publication workflow
That is the level of evidence that helps a community trust a tool.
End-to-End Intern Exercise: From Raw Notes to a Finished Run
This exercise is designed for a beginner who knows little or nothing about:
- agronomy
- DSSAT
- hemp
- weather files
- soil profiles
The goal is not only to finish a run.
The goal is to learn how to think from raw field information all the way to a traceable simulation result.
Exercise goal
By the end of this exercise, the intern should be able to:
- organize raw field notes into DSSAT input categories
- identify missing information and explicit assumptions
- build weather, soil, and management tables in a DSSAT-oriented way
- connect those inputs to a project file
- launch a run with the wrapper
- inspect the outputs and explain what happened at a beginner level
The starting packet
Imagine the intern receives this packet.
Field note summary
Site:
- site code:
TRN1 - latitude:
29.70 N - longitude:
82.41 W - elevation:
32 m
Crop:
- crop: industrial hemp
- cultivar:
IH Williams - objective: evaluate stem and total biomass under moderate nitrogen
Management:
- planting date:
2022-05-12 - population:
60 plants m-2 - row spacing:
0.76 m - planting depth:
2.5 cm - nitrogen applied:
60 kg N ha-1on planting day120 kg N ha-1on2022-06-10
- irrigation applied:
20 mmon2022-05-1425 mmon2022-06-01
Observation dates:
2022-06-202022-07-052022-07-222022-08-15
Soil notes:
- sandy loam surface
- sandy clay loam subsoil
- profile depth
120 cm - moderate initial moisture
Weather source:
- nearby station export with daily radiation, minimum temperature, maximum temperature, and rainfall
Part 1: organize the packet
Task 1
Create a four-part note with headings:
- weather
- soil
- management
- genotype
Expected result
The intern should place:
- station coordinates and daily weather under
weather - texture, depth, and water status clues under
soil - planting, fertilizer, irrigation, and observation dates under
management - cultivar name under
genotype
Why this matters
This is the first step in learning to think like a modeler rather than like a reader of mixed field notes.
Part 2: build a weather preparation sheet
Task 2
Create a table with columns:
DateSRADTMINTMAXRAIN
Populate it from the station export and check:
- no missing dates
- no duplicate dates
TMAX >= TMIN- rainfall units are daily totals
- radiation units are consistent
Deliverable
A cleaned daily weather table plus a one-paragraph note describing:
- source of data
- date coverage
- any corrections made
Common beginner trap
Do not jump straight into a DSSAT weather file before doing the quality checks.
Part 3: build a soil preparation sheet
Task 3
Turn the soil notes into a layered table with at least five layers down to
120 cm.
Suggested columns:
Layer bottomTextureBulk densityLLDULSATOrganic C
Deliverable
A provisional soil profile table plus a note that states clearly which values were:
- measured
- estimated
- borrowed from standard references or nearby profiles
Common beginner trap
Do not pretend estimated hydraulic values are measured truths.
Part 4: build a management preparation sheet
Task 4
Create separate tables for:
- planting
- fertilizer
- irrigation
- harvest or observation schedule
Deliverable
A management packet that could later be encoded in an experiment file.
Common beginner trap
Do not mix destructive biomass harvest dates with routine standing observations without labeling the difference clearly.
Part 5: document assumptions
Task 5
Write a short list of assumptions such as:
- initial soil moisture estimated from field note "moderately moist"
- weather station assumed representative of the field
- cultivar name mapped to the intended DSSAT cultivar entry
Deliverable
A plain-language assumptions log.
Why this matters
Good model work is not about hiding uncertainty. It is about making uncertainty visible and manageable.
Part 6: prepare the run context
For the first execution exercise, the intern does not need to hand-author every DSSAT file from scratch.
Instead, the intern should work from a known project template and inspect how its file structure maps onto the prepared notes.
Task 6
Open a known hemp project file and identify:
- weather linkage
- soil linkage
- planting section
- fertilizer section
- cultivar section
- simulation controls
Deliverable
A short annotation note explaining where each of those pieces lives.
Why this step is important
This builds the bridge between the preparation sheets and the real DSSAT file ecosystem.
Part 7: launch a first run
Once the intern has a valid project file and a working DSSAT installation, they can launch a test run with the wrapper.
Example run pattern
source("C:/path/to/DSSAT-wrapper/R/DSSAT_omniwrapper.R")
result <- DSSAT_omniwrapper(
model_options = list(
DSSAT_path = "C:/path/to/DSSAT48",
DSSAT_exe = "DSCSM048.EXE",
project_file = "C:/path/to/project/Hemp/TRN1.HMX",
suppress_output = TRUE
),
situation = "TRN1_1",
var = "CWAD"
)
Task 7
Run the project and record:
- whether the wrapper returned without error
- how many rows were in the simulation output
- which variables were available
- the first and last simulated dates
Deliverable
A short run note summarizing the simulation status.
Part 8: perform a first interpretation
Task 8
Look at the simulation output and answer:
- Did the crop simulate through the intended season?
- Is the requested biomass variable present?
- Does the output look like a daily time series?
- Does the timing appear plausible relative to planting and harvest dates?
Deliverable
A one-page beginner interpretation note.
Example interpretation language
Good:
- "The run completed and produced a daily biomass series through the expected season."
- "I still need to verify whether flowering timing and biomass magnitude are realistic."
Not good:
- "The model is correct because it ran."
Part 9: optional observation comparison
If observation files are available, the intern can take one more step and read them through the wrapper:
obs <- DSSAT_omni_read_obs(
model_options = list(
DSSAT_path = "C:/path/to/DSSAT48",
DSSAT_exe = "DSCSM048.EXE",
project_file = "C:/path/to/project/Hemp/TRN1.HMX"
),
situation = "TRN1_1",
read_end_season = TRUE
)
Task 9
Check whether observation data were found and whether their dates overlap with the simulated series.
Deliverable
A note answering:
- Were observations present?
- Were they in-season, end-season, or both?
- Are the dates aligned enough for later plotting or metrics?
What a successful exercise submission should contain
A strong intern submission should include:
- an organized raw-notes summary
- a cleaned weather table
- a layered soil table
- a management table set
- an assumptions log
- a project-file annotation note
- a run summary
- a first interpretation note
What this exercise is really teaching
This is not just a software exercise.
It teaches the intern how to move across four ways of thinking:
- field observation
- structured data preparation
- simulation setup
- scientific interpretation
That transition is the heart of practical crop-model work.
Visualizing Results
Running a model is only the first half of the job.
The second half is making the results interpretable.
Why visualization matters in crop-model workflows
Good plots help you answer different kinds of questions:
- Did the model run at all?
- Are the observed and simulated values on the same scale?
- Is the mismatch systematic or occasional?
- Which treatments are hardest for the model?
- Are errors coming from timing, magnitude, or both?
Without plots, calibration can become blind.
What to visualize first
For a beginner, the most useful first figures are:
- observed vs simulated scatter plots
- time-series overlays
- flowering or stage-date comparisons
- summary metric panels
That sequence moves from broad agreement to case-specific diagnosis.
A practical pattern
For real projects, keep figure generation separate from the main run logic.
That separation makes it easier to:
- rerun figures without rerunning every simulation
- audit the data feeding each figure
- debug joins and aggregation choices
The Hopf case-study figures
In the local reproduction workspace, a focused figure script was built around the reproduction outputs rather than around older dashboard-specific code.
The idea is general:
- use the wrapper to generate clean outputs
- save joined time-series and metric tables
- build plots from those saved outputs
That pattern is more portable than a plotting script that is tightly coupled to one personal machine layout.
Common figure types in this project
Observed vs simulated panels
Best for:
- quick agreement checks
- identifying bias
- comparing multiple variables side by side
Time-series overlays
Best for:
- checking whether timing is right
- checking growth trajectory shape
- seeing whether the model diverges early or late
Stage-date alignment plots
Best for:
- phenology validation
- identifying treatment-specific timing issues
Metric summaries
Best for:
- reporting
- comparing variables at a glance
- identifying which variables need more work
Advice for contributors
Treat plotting code as part of the reproducibility workflow, not as a cosmetic extra.
That means:
- keep inputs explicit
- save derived tables
- use consistent naming
- avoid hard-coded personal directories
- document what each figure is meant to prove
Troubleshooting
This chapter collects common failure modes and how to think through them.
Problem: the wrapper cannot find a file
Likely causes:
- wrong
DSSAT_path - wrong
project_file - missing genotype files
- companion observation files not where the wrapper expects them
What to do:
- run the self-check
- inspect the reported paths
- confirm whether the project lives in the DSSAT install or in an external directory
Problem: the run exits with no PlantGro.OUT
Likely causes:
- DSSAT itself rejected the experiment
- the wrong module was used
- a required companion file is missing
- the experiment is valid only with a non-default engine
What to do:
- inspect
ERROR.OUT - inspect
WARNING.OUT - confirm the intended
module_code
Problem: observations are not found
Likely causes:
- the observation files live beside the project file, not in the installed crop folder
- file suffixes differ from what you expected
- the requested experiment name does not match the actual observation filename stem
What to do:
- check the project directory
- check the installed crop folder
- inspect the experiment stem and crop code suffix
Problem: the variable you requested is missing
Likely causes:
- that family uses a different output variable name
- the variable is not written in that output file
- the run succeeded but your request needs an alias
What to do:
- inspect the names returned by the simulation object
- check whether another family-specific variable is the right target
- extend alias handling if this is a legitimate recurring case
Problem: the model launches but the biology looks wrong
That is a different class of problem.
Now you should inspect:
- weather linkage
- soil parameters
- planting density
- fertilizer schedule
- cultivar and ecotype choice
- target variable definition
At that stage, the wrapper may be working perfectly and the scientific setup may still need revision.
Attribution and Disclaimer
Attribution
This fork builds on the original DSSAT wrapper work associated with the AgMIP Calibration Phase III effort.
The original wrapper header credits:
- Jing Qi
- Amir Souissi
- Samuel Buis
- Vakhtang Shelia
This fork extends that base with:
- a registry-driven omniwrapper
- support for additional DSSAT model families
- alternate-engine launching
- broader validation scripts
- hostable public documentation
Why attribution is kept visible
This repository is not presented as a disconnected rewrite.
It is a continuation and extension of earlier wrapper work, so preserving that lineage is part of being technically honest and community-friendly.
License note
At the time of writing, the checked upstream wrapper snapshot used for this fork
does not include an explicit LICENSE file.
Because of that, this documentation does not invent license terms on behalf of the upstream project. Users should check the current upstream repository state for any later license clarification before redistributing code more broadly.
The DSSAT model source and data repositories are separate from this wrapper repository and may have their own licensing terms.
Disclaimer
This work is shared in good faith for research and educational use, but it is provided as-is without warranty, and users are responsible for verifying suitability for their own data, workflows, and redistribution context.
pSIMS Explained for Crop Modellers
Who this is for: you understand crop models (DSSAT, STICS, APSIM) and know what a simulation needs — climate, soil, management. You are not a software developer. This page explains what pSIMS does in plain language, what files it needs, what it produces, and how to run it.
What Is pSIMS?
Normally, to run DSSAT you prepare .WTH weather files, SOIL.SOL soil profiles,
and *.X experiment files by hand. For one or two sites that is manageable.
For 1,000 sites across Ethiopia it becomes impossible.
pSIMS is an assembly line that does all of that automatically.
You give it:
- Climate data for every site (one standard file format)
- Soil data for every site (same format)
- Your experiment settings (planting date, cultivar, fertilizer — once)
- A config file saying which model to use
pSIMS then:
- Extracts climate and soil for each site from those files
- Converts them into whatever format the chosen model needs
- Runs the model for every site
- Pulls the outputs (yield, anthesis date, etc.) into one clean results file
The same climate and soil files work for DSSAT, STICS, or APSIM — you just change one line in the config to switch models.
The Assembly Line — Step by Step
When you run pSIMS, it executes these steps in order for each site:
1. Stage inputs — copy climate and soil tiles to a working folder
2. Convert climate — turn NetCDF climate tile → model weather file
(Psims2Wth for DSSAT, Psims2Met for APSIM, Psims2Stics for STICS)
3. Build experiment — merge your JSON experiment settings into the model input format
(Camp2Json → Jsons2Dssat / Jsons2Apsimx / Jsons2Stics)
4. Run model — call DSSAT / APSIM / STICS binary
5. Collect output — read model output → one standard NetCDF results file
6. Stage outputs — move results to the outputs/ folder
Each step is a small Python class. If one step fails, pSIMS prints False for that step
and stops — you can see exactly where the problem is.
Section 1 — Input Data
pSIMS needs three kinds of input.
1a. Climate data — the "tile" file
File: clim_0001_0001.tile.nc4
This is a NetCDF4 file (a scientific data format — think of it as a structured spreadsheet that holds daily time series for one grid cell). It contains:
| Variable | What it is |
|---|---|
tmax | Daily maximum air temperature (°C) |
tmin | Daily minimum air temperature (°C) |
precip | Daily precipitation (mm) |
solar | Daily solar radiation (MJ/m²/day) |
One tile file = one grid cell (e.g. 0.5° × 0.5°) for the full climate period (e.g. 1980–2010).
You do not create these by hand. They come from processed climate products (CMIP6 GCMs, ERA5, CHIRPS, etc.) that are already in pSIMS tile format. For the Ethiopia ensemble, these are prepared by the climate pipeline (ws2 scripts).
1b. Soil data — the "tile" file
File: soil_0001_0001.tile.nc4
Same NetCDF4 format, but contains soil profile properties for the same grid cell:
| Variable | What it is |
|---|---|
slt | Silt fraction per layer (%) |
cly | Clay fraction per layer (%) |
snd | Sand fraction per layer (%) |
soc | Soil organic carbon (%) |
bd | Bulk density (g/cm³) |
ph | Soil pH |
Multiple soil layers are stored along a depth dimension. For the Ethiopia ensemble, these come from SoilGrids via the soil pipeline (ws1 scripts).
1c. Experiment settings — the campaign file
File: experiment.json (for APSIM) or exp_template.json (for DSSAT)
This is where you describe your agronomy — the same information you would enter in the DSSAT experiment file or APSIM manager script, but written in JSON format.
APSIM example (simple — 18 lines):
{
"StartDate": "1980-01-01",
"EndDate": "2010-12-31",
"Crop": "Maize",
"PlantingDate":"15-Apr",
"Cultivar": "B_110",
"SowingDensity": 7.5,
"RowSpacing": 750,
"SowingDepth": 30
}
DSSAT example (detailed — ~220 lines) includes:
- Planting date, density, depth
- Fertilizer applications (date, amount, type, placement)
- Irrigation schedule
- Harvest rule
- Simulation control flags (water balance method, N balance on/off, etc.)
Crop modeller note: the JSON keys map 1-to-1 to the fields you already know from DSSAT X-files or APSIM manager. The values are the same numbers in the same units. The format is just different (JSON instead of fixed-width text).
Section 2 — The Config File (params)
File: params.dssat48.point.sample (or similar name)
This is a YAML file — essentially a list of key: value settings.
It is the only file you normally edit when setting up a run.
Here are the most important settings, explained:
# Which model to use
model: dssat48
# Where your climate tiles are
weather: samples/dssat48_point_bundle/weather
# Where your soil tiles are
soils: samples/dssat48_point_bundle/soils
# Where your experiment JSON is
campaign: samples/dssat48_point_bundle/campaign
# Path to the model executable
executable: /path/to/DSCSM048.EXE
# Grid cell size in arc-minutes (30 = 0.5 degrees)
delta: "30,30"
# How many grid cells in this run
num_lats: 1
num_lons: 1
# What output variables to extract
variables: harwt,adat,mdat,lai_max
# Reference year for the climate tiles
ref_year: 1980
# Output file name
out_file: dssat48-point
That is it. You set the paths, pick your variables, and run.
Section 3 — Output Data
File: outputs/output_0001_0001.psims.nc
This is another NetCDF4 file. It contains one value per simulation year for each variable you requested. For example, if your climate spans 1980–2010 (30 years), you get 30 yield values, 30 anthesis dates, 30 maturity dates.
| Variable | What it means | Units |
|---|---|---|
harwt | Grain yield at harvest | kg/ha |
adat | Anthesis (silking) date | day of year |
mdat | Maturity date | day of year |
lai_max | Maximum leaf area index | m²/m² |
Opening the output
In Python:
import netCDF4
nc = netCDF4.Dataset('outputs/output_0001_0001.psims.nc')
yield_kgha = nc.variables['harwt'][:] # array of 30 values
print(yield_kgha)
nc.close()
In R:
library(ncdf4)
nc <- nc_open('outputs/output_0001_0001.psims.nc')
yield <- ncvar_get(nc, 'harwt')
print(yield)
nc_close(nc)
CSV output (easier for Excel/R)
Add csv: true in your params file and pSIMS also writes a .csv next to the NetCDF.
You can open that directly in Excel or read it with read.csv() in R.
Section 4 — The Run Command, Decoded
python pysims/pysims.py \
--param params/params.dssat48.point.sample \
--campaign samples/dssat48_point_bundle/campaign \
--tlatidx 0001 \
--tlonidx 0001 \
--latidx 1 \
--lonidx 1
| Piece | What it means |
|---|---|
python pysims/pysims.py | Run the pSIMS pipeline script |
--param params/... | Use this params file (your config) |
--campaign samples/.../campaign | Use this folder for experiment JSON files |
--tlatidx 0001 | Which latitude tile to process (4-digit, zero-padded) |
--tlonidx 0001 | Which longitude tile to process |
--latidx 1 | Which row inside that tile (1-based) |
--lonidx 1 | Which column inside that tile |
For the sample data there is only one tile (0001) containing one point (1, 1).
For a 1,000-site ensemble you would loop over all tile/point combinations —
that is what the Slurm batch scripts handle automatically.
Section 5 — Can a Bachelor Intern Run This?
Honest answer: maybe, with a good setup guide and a mentor for the first session.
Here is where interns will succeed and where they will struggle:
Easy parts
- Editing the params file — it is just key-value pairs
- Reading outputs in R or Python — two lines of code
- Running the command once everything is installed — it is one command
- Understanding what each input means — it maps to familiar agronomy concepts
Hard parts
| Challenge | Why it is hard |
|---|---|
| Installation | Python virtual environment, model binaries, Java, paths — many steps, many ways to fail |
| Path errors | Relative vs absolute paths; running from the wrong directory causes cryptic errors |
| NetCDF format | Not familiar; requires a library; error messages are not beginner-friendly |
Debugging False | When a pipeline step fails with False, finding the log and interpreting the traceback requires Python familiarity |
| Tile structure | Understanding 0001/clim_0001_0001.tile.nc4 naming convention is non-obvious |
| HPC | SSH, Slurm, modules, environment variables — a second learning curve on top of pSIMS |
Verdict: An intern can run pre-configured sample data successfully on their first day. Building a new experiment from scratch (new crop, new region, new climate dataset) requires 2–3 weeks of guided work. Running the full HPC ensemble independently requires several months of experience.
Section 6 — Simplifications We Recommend
The following changes would make pSIMS significantly more intern-friendly without breaking anything.
6a. A one-line run script per model
Instead of remembering the full command with 6 flags, create small shell scripts in the pSIMS root:
# run_dssat.sh
#!/bin/bash
cd "$(dirname "$0")"
python pysims/pysims.py \
--param params/params.dssat48.point.sample \
--campaign samples/dssat48_point_bundle/campaign \
--tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1
Then an intern just types bash run_dssat.sh. Done.
Equivalent scripts: run_stics.sh, run_apsim.sh, run_all.sh.
6b. Simpler params file names
The current name params.dssat48.point.sample is not intuitive.
Rename (or add symlinks) to something like:
| Old name | Suggested name |
|---|---|
params.dssat48.point.sample | params_dssat_sample.yaml |
params.stics10.point.sample | params_stics_sample.yaml |
params.apsimx.point.linux.sample | params_apsim_linux_sample.yaml |
6c. A "quick check" script
A one-page script that reads the output NetCDF and prints a readable summary:
# check_output.py
import sys, netCDF4, numpy as np
nc = netCDF4.Dataset(sys.argv[1])
print(f"\n{'Variable':<12} {'Mean':>10} {'Min':>10} {'Max':>10}")
print("-" * 45)
for var in nc.variables:
if var in ('lat','lon','time'): continue
data = nc.variables[var][:].flatten()
data = data[~np.isnan(data)]
if len(data):
print(f"{var:<12} {data.mean():>10.1f} {data.min():>10.1f} {data.max():>10.1f}")
nc.close()
Usage: python check_output.py outputs/output_0001_0001.psims.nc
6d. Cleaner error messages
When a pipeline step returns False, pSIMS currently prints only:
0001/0001, ApsimX, run, ..., False
A small patch to pysims.py could print the last few lines of the model's stderr,
which is what you actually need to diagnose the problem.
6e. A minimal DSSAT experiment template
The current exp_template.json is 220 lines covering every possible option.
For a standard rainfed maize run most of those lines are noise.
A "minimal template" with only the 10–15 fields that actually change between experiments
would be much less intimidating as a starting point.
Summary
| What | Format | Tool to open |
|---|---|---|
| Climate input | NetCDF4 tile (.tile.nc4) | Python netCDF4, R ncdf4, Panoply |
| Soil input | NetCDF4 tile (.tile.nc4) | same |
| Experiment settings | JSON (.json) | any text editor |
| Config | YAML (.yaml or .sample) | any text editor |
| Results | NetCDF4 (.psims.nc) | Python, R, Panoply |
| Results (optional) | CSV (.csv) | Excel, R, Python |
The only files you edit for a new run:
experiment.json— your agronomy (planting date, cultivar, management)params_*.yaml— your paths, model choice, and output variables
Everything else is infrastructure that, once set up, you do not touch.
pSIMS 2.0 — Windows Setup and Run Guide
Status as of 2026-05-21: fully operational. All three models (DSSAT 4.8, STICS 10, APSIM-X) run successfully on Windows.
Quick reference — what is where
| Thing | Path |
|---|---|
| pSIMS working directory | C:\Users\<user>\Downloads\psims-release-2.0\ |
| Python (venv) | psims-release-2.0\.venv\Scripts\python.exe |
| DSSAT binary | samples\dssat48_point_bundle\refdata\DSCSM048.EXE (relative to pSIMS root) |
| STICS launcher | C:\Users\<user>\Downloads\JavaSTICS-10.4.1-STICS-10.4.1\...\JavaSticsCmd.exe |
| APSIM-X binary | C:\Program Files\APSIM<version>\bin\Models.exe |
| Sample data | psims-release-2.0\samples\ |
| Params (Windows) | psims-release-2.0\params.*.sample (in root) |
| Outputs | psims-release-2.0\outputs\ |
Important: always run from the
psims-release-2.0\directory. The params files use relative paths likesamples/...andpysims/...that only resolve correctly when you start there.
Prerequisites
These must already be installed on your machine. If any are missing, install them before trying to run a model.
1. Python virtual environment
The venv lives at psims-release-2.0\.venv\. It is created once and
reused for all runs.
To create it (first time only):
cd C:\Users\<user>\Downloads\psims-release-2.0
python -m venv .venv
.venv\Scripts\python.exe -m pip install -r requirements.txt
Verify:
.venv\Scripts\python.exe -c "import netCDF4, scipy, numpy; print('OK')"
2. DSSAT 4.8
The DSCSM048.EXE binary is bundled inside the repo at
samples\dssat48_point_bundle\refdata\DSCSM048.EXE. No separate
installation is needed for the sample test.
For a full installation, DSSAT 4.8 can be installed from the official DSSAT website. pSIMS will use the binary path set in the params file.
3. STICS 10
Install from the JavaStics package. After installation, note the path to
JavaSticsCmd.exe — it will be something like:
C:\Users\<user>\Downloads\JavaSTICS-10.4.1-STICS-10.4.1\
JavaSTICS-10.4.1-STICS-10.4.1\JavaSticsCmd.exe
The params file (params.stics.point.sample) needs both:
executable— full path toJavaSticsCmd.exeworkdir— directory containingJavaSticsCmd.exe(JavaStics must be run from its own folder)
4. APSIM-X (Next Generation)
Install APSIM Next Generation from the APSIM website. After installation, the binary will be at:
C:\Program Files\APSIM<version>\bin\Models.exe
Check params.apsimx.point.sample for the exact version path that was
verified — update it if your installed version differs.
Adjusting paths in params files
Before running, open each params file and confirm the executable paths match your local installation.
params.dssat48.point.sample — no changes needed for the sample test.
DSSAT runs from the refdata directory using the bundled EXE.
params.stics.point.sample — set these two lines:
executable: "C:\\Users\\<user>\\Downloads\\JavaSTICS-10.4.1-STICS-10.4.1\\JavaSTICS-10.4.1-STICS-10.4.1\\JavaSticsCmd.exe"
workdir: "C:\\Users\\<user>\\Downloads\\JavaSTICS-10.4.1-STICS-10.4.1\\JavaSTICS-10.4.1-STICS-10.4.1"
params.apsimx.point.sample — set:
executable: "C:\\Program Files\\APSIM<version>\\bin\\Models.exe"
Use double backslashes (\\) inside YAML quoted strings.
Running a model — template command
Open PowerShell or Git Bash, cd to the pSIMS root, then:
cd C:\Users\<user>\Downloads\psims-release-2.0
.venv\Scripts\python.exe pysims\pysims.py `
--param params.<model>.point.sample `
--campaign samples\<model>_point_bundle\campaign `
--tlatidx 0001 --tlonidx 0001 `
--latidx 1 --lonidx 1
(In Git Bash, use forward slashes and \ line continuation instead of
the PowerShell backtick.)
Running each model
DSSAT 4.8
cd C:\Users\<user>\Downloads\psims-release-2.0
.venv\Scripts\python.exe pysims\pysims.py `
--param params.dssat48.point.sample `
--campaign samples\dssat48_point_bundle\campaign `
--tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1
Expected output — every line ends with True:
0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run, ..., True
0001/0001, Camp2Json, run, ..., True
0001/0001, Psims2Wth, run, ..., True
0001/0001, Jsons2Dssat, run, ..., True
0001/0001, Dssat48, run, ..., True
0001/0001, Out2Psims, run, ..., True
0001/0001, StageOutputsSharedFS,run, ..., True
Known harmless warning (safe to ignore):
UserWarning: Real number is too long
STICS 10
cd C:\Users\<user>\Downloads\psims-release-2.0
.venv\Scripts\python.exe pysims\pysims.py `
--param params.stics.point.sample `
--campaign samples\stics_point_bundle\campaign `
--tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1
Expected output:
0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run, ..., True
0001/0001, Psims2Stics, run, ..., True
0001/0001, Jsons2Stics, run, ..., True
0001/0001, Stics, run, ..., True
0001/0001, Out2Psims, run, ..., True
0001/0001, StageOutputsSharedFS,run, ..., True
How STICS runs on Windows:
pSIMS calls JavaSticsCmd.exe --run <workspace_abs_path> maize from the
JavaStics install directory (workdir). JavaSticsCmd.exe bridges the
XML workspace format to the underlying stics_modulo binary.
APSIM-X (Next Generation)
cd C:\Users\<user>\Downloads\psims-release-2.0
.venv\Scripts\python.exe pysims\pysims.py `
--param params.apsimx.point.sample `
--campaign samples\apsimx_point_bundle\campaign `
--tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1
Expected output:
0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run, ..., True
0001/0001, Psims2Met, run, ..., True
0001/0001, Jsons2Apsimx, run, ..., True
0001/0001, ApsimX, run, ..., True
0001/0001, Out2Psims, run, ..., True
0001/0001, StageOutputsSharedFS,run, ..., True
Checking output values
After any run the output is written to
psims-release-2.0\outputs\output_0001_0001.psims.nc.
cd C:\Users\<user>\Downloads\psims-release-2.0
.venv\Scripts\python.exe -c "
import netCDF4
nc = netCDF4.Dataset('outputs/output_0001_0001.psims.nc')
for v in nc.variables:
if v not in ('lon', 'lat', 'time', 'scen'):
val = float(nc.variables[v][:].flatten()[0])
print(f'{v} = {val} {nc.variables[v].units}')
nc.close()
"
Verified reference outputs (sample data, 1982 Gainesville FL)
| Model | Variable | Value | Units |
|---|---|---|---|
| DSSAT 4.8 | HWAM (grain yield) | 2283 | kg/ha |
| DSSAT 4.8 | PDAT (planting DOY) | 57 | DOY |
| DSSAT 4.8 | MDAT (maturity) | 128 | Days |
| STICS 10 | masec(n) (biomass) | 1.1875 | t/ha |
| STICS 10 | mafruit (grain) | 0.8032 | t/ha |
| APSIM-X | Yield | 2143.31 | kg/ha |
These match the HPC Linux runs exactly.
Troubleshooting
Any step shows False
The failed step will print a Python traceback above the result line.
| Symptom | Likely cause | Fix |
|---|---|---|
FileNotFoundError: DSCSM048.EXE | Wrong executable path | Check executable in params file |
FileNotFoundError: JavaSticsCmd.exe | Wrong STICS path | Update executable and workdir in params |
FileNotFoundError: Models.exe | Wrong APSIM path or version | Check APSIM installation directory |
ModuleNotFoundError | Wrong Python used | Make sure you use .venv\Scripts\python.exe |
FileNotFoundError in StageInputsSharedFS | Running from wrong directory | cd to psims-release-2.0\ first |
"Real number is too long" warning
This comes from DSSAT's CUL/ECO file reader and is harmless. The run completes correctly and values are read correctly.
APSIM takes longer than expected
APSIM-X takes ~10–12 seconds on the first run (JIT warm-up). Subsequent runs in the same session are faster. This is normal.
Output values don't match reference
Possible causes:
- Different APSIM version — APSIM results can vary slightly between versions. Check your installed version against what the params file references.
- Different STICS version — same applies.
- The run completed but with a different input — check that the campaign and weather tile are the same sample files.
Architecture overview
params file ─────────────────────────────────────────────┐
▼
pysims.py → StageInputsSharedFS (copy tiles to workdir)
→ [tappcmp] (campaign → experiment.json)
→ tappwth (climate tile → model weather file)
→ tappinp (soil tile + experiment → model input)
→ model binary (DSSAT / STICS / APSIM)
→ postprocess (model output → psims.nc)
→ StageOutputsSharedFS (copy psims.nc to outputs/)
Each step is a Python class. The pipeline logs each step with its
runtime and a True/False pass flag.
What is different between Windows and Linux
| Aspect | Windows | Linux HPC |
|---|---|---|
| Python | .venv\Scripts\python.exe | ~/.conda/envs/mamba_env/envs/psims/bin/python |
| Working directory | psims-release-2.0\ | ~/psims_2.0/ |
| DSSAT binary | DSCSM048.EXE (Windows PE) | dscsm048 (Linux ELF) |
| STICS invocation | JavaSticsCmd.exe called directly | java -jar JavaSticsCmd.exe (bundled JRE 17) |
| APSIM invocation | Models.exe called directly | Shell wrapper → apptainer exec datamill.sif Models |
| Params location | root of working directory | params/ subdirectory |
| Path separators | backslash (or forward slash in YAML) | forward slash only |
The pipeline code, sample data, and output format are identical on both platforms.
pSIMS 2.0 — Linux HPC Setup and Run Guide
Status as of 2026-05-21: fully operational. All three models (DSSAT 4.8, STICS 10, APSIM-X) run successfully on the HPC.
Quick reference — what is where
| Thing | Path on HPC |
|---|---|
| pSIMS root | ~/psims_2.0/ |
| Python environment | ~/.conda/envs/mamba_env/envs/psims/bin/python |
| DSSAT binary | ~/psims_2.0/tools/dssat48/dscsm048 |
| STICS JAR | ~/psims_2.0/tools/stics10/JavaSticsCmd.exe |
| STICS Java runtime | ~/psims_2.0/tools/jdk-17.0.19+10-jre/bin/java |
| APSIM-X wrapper | ~/psims_2.0/tools/apsimx/Models |
| APSIM-X container | ~/crop-ensemble/containers/datamill.sif |
| Sample data | ~/psims_2.0/samples/ |
| Params (Linux) | ~/psims_2.0/params/ |
| Outputs | ~/psims_2.0/outputs/ |
Replace ~ with /home/<user> if you need absolute paths.
How to connect
ssh <user>@<hpc-host>
If you have an SSH alias configured (e.g. hpc in ~/.ssh/config), you
can just run ssh hpc. See your SSH config for the actual hostname.
Running a model — template command
All three models follow the same pattern:
cd ~/psims_2.0
~/.conda/envs/mamba_env/envs/psims/bin/python pysims/pysims.py \
--param params/<params-file> \
--campaign samples/<bundle>/campaign \
--tlatidx 0001 --tlonidx 0001 \
--latidx 1 --lonidx 1
The --tlatidx/--tlonidx flags choose which tile to read from disk.
--latidx/--lonidx choose the point within that tile.
For the single-point sample data both are always 0001 / 1.
Running each model
DSSAT 4.8
cd ~/psims_2.0
~/.conda/envs/mamba_env/envs/psims/bin/python pysims/pysims.py \
--param params/params.dssat48.point.linux.sample \
--campaign samples/dssat48_point_bundle/campaign \
--tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1
Expected output — every line ends with True:
0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run, ..., True
0001/0001, Camp2Json, run, ..., True
0001/0001, Psims2Wth, run, ..., True
0001/0001, Jsons2Dssat, run, ..., True
0001/0001, Dssat48, run, ..., True
0001/0001, Out2Psims, run, ..., True
0001/0001, StageOutputsSharedFS,run, ..., True
Known harmless warning (safe to ignore):
UserWarning: Real number is too long
STICS 10
cd ~/psims_2.0
~/.conda/envs/mamba_env/envs/psims/bin/python pysims/pysims.py \
--param params/params.stics.point.linux.sample \
--campaign samples/stics_point_bundle/campaign \
--tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1
Expected output:
0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run, ..., True
0001/0001, Psims2Stics, run, ..., True
0001/0001, Jsons2Stics, run, ..., True
0001/0001, Stics, run, ..., True
0001/0001, Out2Psims, run, ..., True
0001/0001, StageOutputsSharedFS,run, ..., True
How STICS runs on Linux:
pSIMS invokes java -jar JavaSticsCmd.exe --run <workspace> maize using
the bundled JRE 17 (tools/jdk-17.0.19+10-jre/). The java_executable
param in the params file activates this mode. No system Java or JavaStics
installation is needed.
APSIM-X (Next Generation)
cd ~/psims_2.0
~/.conda/envs/mamba_env/envs/psims/bin/python pysims/pysims.py \
--param params/params.apsimx.point.linux.sample \
--campaign samples/apsimx_point_bundle/campaign \
--tlatidx 0001 --tlonidx 0001 --latidx 1 --lonidx 1
Expected output:
0001/0001, StageInputsSharedFS, run_tile, ..., True
0001/0001, StageInputsSharedFS, run, ..., True
0001/0001, Psims2Met, run, ..., True
0001/0001, Jsons2Apsimx, run, ..., True
0001/0001, ApsimX, run, ..., True
0001/0001, Out2Psims, run, ..., True
0001/0001, StageOutputsSharedFS,run, ..., True
How APSIM runs on Linux:
The file tools/apsimx/Models is a shell script that calls:
apptainer exec --bind "$(pwd):$(pwd)" \
~/crop-ensemble/containers/datamill.sif \
Models "$@"
The Models binary lives inside the datamill.sif Apptainer container
(AgriScale v1.2.0). The container auto-binds $HOME, so all paths under
~/psims_2.0/ are accessible to APSIM inside the container.
Checking output values
After any run, the output is written to
~/psims_2.0/outputs/output_0001_0001.psims.nc.
cd ~/psims_2.0
~/.conda/envs/mamba_env/envs/psims/bin/python - << 'EOF'
import netCDF4
nc = netCDF4.Dataset('outputs/output_0001_0001.psims.nc')
for v in nc.variables:
if v not in ('lon', 'lat', 'time', 'scen'):
val = float(nc.variables[v][:].flatten()[0])
print(f' {v} = {val} {nc.variables[v].units}')
nc.close()
EOF
Verified reference outputs (sample data, 1982 Gainesville FL)
| Model | Variable | Value | Units |
|---|---|---|---|
| DSSAT 4.8 | HWAM (grain yield) | 2283 | kg/ha |
| DSSAT 4.8 | PDAT (planting DOY) | 57 | DOY |
| DSSAT 4.8 | MDAT (maturity) | 128 | Days |
| STICS 10 | masec(n) (biomass) | 1.1875 | t/ha |
| STICS 10 | mafruit (grain) | 0.8032 | t/ha |
| APSIM-X | Yield | 2143.31 | kg/ha |
These values match exactly with the Windows runs — the pipeline is deterministic across platforms for the sample data.
Troubleshooting
Any step shows False
Check the full output (remove 2>/dev/null from your command). The failed
step will print a Python traceback. Common causes:
| Symptom | Likely cause | Fix |
|---|---|---|
FileNotFoundError in StageInputsSharedFS | Missing sample data directory | Check samples/<model>_point_bundle/ structure |
ApsimX step fails | Apptainer can't find the container | Verify ~/crop-ensemble/containers/datamill.sif exists |
Stics step fails | Java not found | Check tools/jdk-17.0.19+10-jre/bin/java exists |
Dssat48 step fails | Binary not executable | Run chmod +x tools/dssat48/dscsm048 |
Output file not created
If the Out2Psims or StageOutputsSharedFS step shows False, the
output file will not exist. Fix the root cause and re-run. You can safely
delete outputs/output_0001_0001.psims.nc between runs.
Conda environment not found
The activate step is not needed when using the full path:
~/.conda/envs/mamba_env/envs/psims/bin/python
If that path does not exist, check with:
conda env list
Architecture overview
params file ─────────────────────────────────────────────┐
▼
pysims.py → StageInputsSharedFS (copy tiles to workdir)
→ [tappcmp] (campaign → experiment.json)
→ tappwth (climate tile → model weather file)
→ tappinp (soil tile + experiment → model input)
→ model binary (DSSAT / STICS / APSIM)
→ postprocess (model output → psims.nc)
→ StageOutputsSharedFS (copy psims.nc to outputs/)
Each step is a Python class. The pipeline logs each step with its
runtime and a True/False pass flag.
Sample data layout
samples/
dssat48_point_bundle/
campaign/ Campaign.nc4 + exp_template.json
weather/ 0001/clim_0001_0001.tile.nc4
soils/ 0001/soil_0001_0001.tile.nc4
refdata/ DSCSM048.EXE, CUL/ECO/SPE files, ...
stics_point_bundle/
campaign/ experiment.json
weather/ 0001/clim_0001_0001.tile.nc4
soils/ 0001/soil_0001_0001.tile.nc4
refdata/ workspace_template/ (STICS XML files)
apsimx_point_bundle/
campaign/ experiment.json + Maize.apsimx
weather/ → symlink to dssat48_point_bundle/weather/
soils/ → symlink to dssat48_point_bundle/soils/
refdata/ Maize.apsimx
The APSIM weather and soils directories are symlinks to the DSSAT bundle — the pSIMS NetCDF tile format is the same for all models.
Re-deploying after code changes
When you push pSIMS code changes from Windows, sync them to the HPC with:
# From your local machine (Git Bash or WSL)
rsync -av --exclude='*.pyc' --exclude='__pycache__' \
/path/to/psims-release-2.0/pysims/ \
<user>@<hpc-host>:~/psims_2.0/pysims/
Sample data and params files only need to be re-synced if they change.