Machine learning advances
ML is a type of data-driven approach that trains a regression or
classification model through complex nonlinear mapping with adjustable
parameters, based on a training data set. Several recent river carbon
cycle studies have used random forest ML algorithms (RF); for example,
Maavara et al.(Maavara et al., 2023) calibrated a RF that extrapolated
GPP to almost 100,000 river reaches and lakes within the watershed using
available predictor data such as flow, temperature, and canopy cov(Abbe
& Montgomery, 1996)er. Segatto et al. (2021, 2023) also improved
metabolic upscaling by incorporating a temporal dimension into their
predictions of metabolic regimes by training RFs using long-term,
sensor-based estimates of GPP and ER in the Ybbs River catchment in
Austria, as well as catchment physical and climate properties. However,
RFs typically require large datasets and their transferability to
systems for which they have not been trained can be problematic. DL is
an additional branch of ML, distinguished by multiple layers of neurons
in neural network architecture, which provide a higher ability to
represent complex functions than non-deep neural networks (Zhang et al.,
2021a).
Accurate quantification of carbon emissions from aquatic systems remains
constrained by scientific uncertainties, high complexity of physical and
chemical process linkages such as non-stationarity, dynamism, and
non-linearity. As a result, prediction and forecasting with
process-driven methods can be inaccurate; for rivers, water temperature
and discharge data currently provide the best opportunities for
forecasting, whereas research on near-term biological/chemical
predictions has advanced more quickly for lakes (McClure et al., 2021).
DL has been suggested as a potential means to overcome uncertainty and
nonlinearity in river sciences (Shen, 2018) and is now being applied in
hydrologic predictions (water level, discharge (Xu et al., 2022)),
regional rainfall-runoff linkages (Zhang et al., 2021a) and water
quality dynamics (Zheng et al., 2023). This is important due to the
increased need to reduce flood risk due to climate change. DL also has
relevance in aquatic ecosystem prediction, including data mining and
identifying outliers (Kim et al., 2022). With respect to water quality
data, DL methods have been shown to offer potential to predict N and P
concentrations from physical data that can be collected more easily with
sesnors (e.g. pH, turbidity, temperature, DO, conductivity) (Ba-Alawi et
al., 2023). Moreover, DL can serve both as an auxiliary tool for
process-driven methods, reducing computational loads in uncertainty
analyses (Li et al., 2020) and as a component of process-driven models,
describing a process difficult to characterize mathematically (Huang et
al., 2022).
Physical models can now be embedded into DL models to improve
performance and mitigate risks, by providing important supplementary
information (Reichstein et al., 2019, Huang et al., 2022).
Physics-informed neural network (PINN) models incorporate the residual
of physics principles (e.g. governing equations) as a regulation in loss
functions to enable learning by penalizing poor predictions (Tartakovsky
et al., 2020). PINN is increasingly being applied in areas such as
estimating water quantity and quality (Liang et al., 2019). Therefore,
the development of physics informed surrogate models that link DOM
concentrations and other water quality data with river flows could offer
the potential for forecasting carbon emissions with greater accuracy and
with improved consideration of uncertainty propagation.
Transfer learning (TL) developments offer additional potential for DL
applications in water resource science and management. TL recognizes
knowledge from a previous task and applies it to a new task (Pan &
Yang, 2010). The previous task is usually an efficient ML model trained
on large datasets, and then new tasks are related to the previous task
but with smaller datasets. TL methods in hydrology have focused mainly
on data interpolation and prediction in areas where observed data are
missing or unavailable. For example, Willard et al. (2021) showed how
lake water temperature can be predicted in areas without monitoring, and
Zhou(Zhou, 2020) developed real-time predictions of river water quality
applied to situations where data were missing (e.g. broken sensors).
Applications to river carbon cycle understanding and management could
include learning between catchments that differ in data availability
(e.g. Figure 1), enabling knowledge gained from the better-studied
catchment(s) to advance understanding of the less-studied system(s).
Despite numerous successful DL applications in aquatic sciences,
challenges and risks remain in applying these approaches for aquatic
carbon management. Overarching issues for all ML applications include
the potential for sensor and data processing security breaches (Richards
et al., 2023) leading to risks for water security. A second issue
concerns detection, as the accuracy of DL methods relies on the quantity
of observational data. Insufficient data may prevent DL from achieving
satisfactory precision (Cao et al., 2022); however, even in developed
countries with well-established infrastructures, the cost of obtaining a
substantial volume of high-precision environmental monitoring data such
as that needed for river carbon cycle estimation could hinder the
application of DL in some locations (Richards et al., 2023). Moreover,
even water quality monitoring networks in developing countries are often
limited by financial resources and technical capabilities and so must
prioritize resource allocation. Third, DL methods work well only when
training and test data are drawn from the same data feature space and
distribution (Pan & Yang, 2010). This implies that DL methods must be
specifically designed and tailored for context. Due to the influence of
factors such as geometry and land cover, aquatic systems often differ
between watersheds, meaning models from other study areas can lead to
errors in prediction and risks for decision-making. However, by
incorporating explicit mechanisms into the training process DL models
are beginning to emerge to overcome these issues, offering strong
potential to advance further our understanding of river carbon cycling
and emissions.