Abstract

Observational uncertainty poses major challenges to groundwater model calibration. As the primary source of information for multi-source data assimilation, monitoring network design is critical for accurately characterizing subsurface dynamics. Under limited measurement accuracy or cost constraints, monitoring networks must remain robust to observational errors. This study develops a multivariate network design framework that quantifies the uncertainty of multicomponent responses using joint entropy and employs deep learning to accelerate computations. Case study results show that the framework reliably estimates non-Gaussian permeability fields even under high-noise observations. The calibrated reactive transport model demonstrates strong capability in reproducing historical data and predicting system responses. This work advances the understanding of multi-source data fusion and supports the development of groundwater monitoring networks under observational uncertainty. Moreover, the proposed approach can be extended to the design of geophysical survey lines that integrate geophysical data.

Plain Language Summary

Groundwater monitoring is essential for understanding subsurface flow and contaminant transport. To obtain more reliable estimates of hydrogeological parameters, researchers often integrate diverse field observations with numerical models. However, observational uncertainty critically limits the reliability of groundwater modeling. We developed a monitoring network design framework grounded in the joint uncertainty of multivariate responses, employing an entropy-based data discretization strategy to guide the optimization algorithm in selecting monitoring stations. This framework enables robust estimation of non-Gaussian permeability fields under high-noise observations at optimal networks. Overall, the proposed approach provides a new perspective on groundwater monitoring and data utilization in the presence of observational uncertainty.

Key Points

A multivariate groundwater network design framework is developed by combining information entropy theories and deep learning methods
The developed framework can identify non-Gaussian permeability fields by integrating multi-source observations with high noise
The calibrated model honors both historical and predictive responses of water flow and multicomponent transport

1 Introduction

Accurate hydrological modeling is essential for predicting the spatiotemporal evolution of water flow and contaminant transport in aquifer systems (Sheng et al., 2024). Nevertheless, limited knowledge of system dynamics and the parameter uncertainty restrict the accurate simulation of real-world hydrological processes (M. Cao et al., 2024; Olorunsaye & Heiss, 2024). Data assimilation (DA) frameworks, which integrate observational data to update model states (Zhan et al., 2022), hydraulic parameters (C. Cao et al., 2024), as well as initial (Meyer et al., 2019) and boundary conditions (Chen et al., 2021), provide a powerful means of reducing uncertainty.

Within the Bayesian probabilistic framework, DA algorithms explicitly characterize discrepancies between model predictions and observations, enabling optimal integration and posterior uncertainty quantification. The deterministic inversion methods can produce a unique solution (Chen & Dai, 2024), while stochastic DA provides a comprehensive representation of uncertainty in both model states and parameters (Parrish et al., 2012). DA is therefore particularly well suited to complex hydrological systems characterized by substantial parameter uncertainty and limited data availability (Ghorbanidehno et al., 2020). The advancement of generalized DA methods for nonlinear, high-dimensional, and non-Gaussian problems critically depends on the quality of observational information and the performance of the inversion algorithms (J. Zhang et al., 2024). On the algorithmic front, substantial progress has been made, evolving from basic sampling approaches, such as Markov Chain Monte Carlo (Vrugt et al., 2008), to more efficient ensemble-based methods, including the ensemble Kalman filter (Xue & Zhang, 2014) and the ensemble smoother (Bao et al., 2020). These methods have been further enhanced through local update strategies to better address nonlinearity, such as the Iterative Local Update Smoother (Zhang et al., 2018), and through adaptive techniques for parameter estimation and uncertainty reduction (Ju et al., 2018).

Despite advances in DA methods, their utilization remains constrained by the observation data value (Zheng et al., 2018). Multi-source data fusion is a vital strategy for enhancing the diversity of response information (Gettelman et al., 2022). However, the acquisition of observational data is inherently subject to multiple sources of uncertainty, including instrument-related errors, temporal and spatial variability in measurements, environmental noise, and human-induced biases during data collection or preprocessing. Moreover, different data types often exhibit varying accuracies and sensitivities to system dynamics, leading to complex error structures during data fusion (da Silveira Barcellos & de Souza, 2022).

Optimizing monitoring networks is crucial for enhancing data utilization, particularly under observational uncertainty (Meggiorin et al., 2024). This requires defining an optimization criterion to quantify the information value of each station. Common sensitivity-based design criteria (Hsu & Yeh, 1989; Knopman et al., 1991; Sciortino et al., 2002) optimize monitoring stations iteratively by maximizing sensitivity and minimizing correlation. However, these criteria, which rely on the first two-order moments of parameter distributions, are not well suited for complex systems with multimodal parameter distributions. Entropy-based metrics provide a quantitative framework for evaluating the uncertainty associated with monitoring variables and the information transfer among stations (Shannon, 1948). Commonly used measures include marginal entropy, joint entropy, conditional entropy, transinformation, and total correlation. Keum et al. (2017) reviewed the application of entropy theory in designing monitoring networks for precipitation, streamflow, water level, soil moisture, groundwater, and water quality. Building on this foundation, multivariate network designs based on conditional entropy have been proposed by Keum and Coulibaly (2017), further demonstrating the potential of information-theoretic approaches to support integrated hydrological observations.

Incorporating optimized networks with hydrological models and DA frameworks is essential for validating their effectiveness and enhancing credibility (Ma et al., 2025). Careful consideration must also be given to data discretization strategies (Bosserelle & Hughes, 2024), objective functions (Foroozand & Weijs, 2020), temporal dynamics (Wang et al., 2018), and boundary constraints (Leach et al., 2016). Because entropy-based information estimation replaces probabilities with frequencies, the analysis requires data sets that accurately represent system uncertainty. Constructing such data sets in complex hydrological systems poses major challenges, including the “curse of dimensionality” and high computational costs. Recent advances in deep learning provide alternatives by reducing computational demands and enhancing practical applicability (Zhi et al., 2024). For instance, Generative Adversarial Networks (Ling & Jafarpour, 2024) and Denoising Diffusion Probabilistic Models (X. Zhang et al., 2024) enable the parameterization of high-dimensional model parameters into low-dimensional feature vectors, while deep learning-based surrogate models can efficiently replace CPU-intensive forward models (Zhan et al., 2025).

Our previous work (Cao, Dai, Chen, et al., 2025; Chen et al., 2022) has incorporated these recent advances in information entropy and deep learning into the monitoring network optimization framework, and reliable inversion results were achieved under ideal observations. However, as discussed earlier, measurement data often suffer from limited accuracy, or cost considerations necessitate higher tolerance to observational errors in practical applications. Therefore, the core objective of this study is to develop monitoring networks that can still ensure reliable estimation under high-noise observations. To this end, we developed a multivariate monitoring network design framework that integrates joint entropy with deep learning.

Accordingly, a multivariate, multi-objective optimization algorithm was established for optimal network design, in which the uncertainties of hydraulic heads, pH, and multicomponent concentrations were quantified through joint entropy. To improve computational efficiency, an integrated surrogate modeling framework was introduced, combining deep convolutional generative adversarial networks for reconstructing non-Gaussian permeability fields with deep convolutional residual networks for predicting multivariate system responses. A data discretization strategy grounded in information entropy was further employed to guide the optimization algorithm in monitoring station selection, thereby enhancing the robustness of multi-source DA under observational noise. The framework was validated with a coupled hydrogeochemical model, successfully identifying non-Gaussian permeability fields through assimilation of high-noise, multi-source observations. Overall, this study provides a novel theoretical foundation for integrating multi-source response data and designing multivariate groundwater networks with enhanced tolerance to high-noise observations, offering new insights for practical applications in complex groundwater modeling.

2 Methodology

This section introduces the methodology developed in this study, with the overall framework illustrated in Figure 1 and further details provided in Sections 2.1-2.3.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

(a) Integrated surrogate modeling framework, combining a Deep Convolutional Generative Adversarial Network (DCGAN) for parameterization and a Deep Convolutional Residual Network (DCRN) as the surrogate model. (b) Schematic illustration of the multivariate groundwater network design method. The acronym MOO denotes multi-objective optimization. (c) Workflow diagram of network design strategies under high-noise observations.

2.1 Integrated Surrogate Modeling

This section introduces a framework of deep learning–based integrated surrogate modeling, as illustrated in Figure 1a. The framework couples a deep convolutional generative adversarial network (DCGAN) for the parameterization of non-Gaussian permeability fields with a deep convolutional residual network (DCRN) for the replacement of forward models.

DCGAN, proposed by Radford et al. (2015), consists of two adversarial networks: a generator $G$ and a discriminator $D$ . The generator $G(\mathrm{z})$ takes a noise vector $\mathrm{z}\mathit{\sim }{p}_{\mathrm{z}}(\mathrm{z})$ , sampled from a prior distribution, and reconstructs synthetic data resembling real data in the training set. Then, the discriminator $D(\mathrm{x})$ evaluates the generated data against real instances $\mathrm{x}\mathit{\sim }{p}_{\text{data}}(\mathrm{x})$ . The generator is trained to maximize the probability of the discriminator misclassifying its outputs as real. DCGAN improves training stability by employing strided convolutions in the discriminator and fractional-strided convolutions in the generator, making it particularly effective for image-like data. In this study, DCGAN is trained to capture complex spatial features and reconstruct non-Gaussian permeability fields from low-dimensional latent vectors, enabling efficient sampling. Furthermore, we employ DCRN to approximate the mapping of the forward model from permeability fields to system responses. The residual structure of the DCRN network mitigates the vanishing gradient issue in high-dimensional mappings. The training data set takes reconstructed non-Gaussian permeability fields as inputs and normalized multivariate response fields as outputs. Specifically, all response outputs from the numerical model are uniformly processed using min–max normalization (Chen et al., 2021), ensuring that different types of response data share the same scale. This facilitates surrogate model training and enables simultaneous multivariate prediction. The framework enables efficient sampling of non-Gaussian permeability fields and rapid prediction of multivariate dynamic responses, thereby substantially reducing the computational burden for multivariate network design and performance evaluation under uncertainty.

2.2 Multivariate Network Design

As shown in Figure 1b, let

\mathcal{P}=\left\{{X}_{ij}^{P}\mathit{\vert }i=1,2,\mathit{\text{\ldots }},M;j=1,2,\mathit{\text{\ldots }},m\right\}

represent an initial hydrological monitoring network comprising M densely distributed preselected stations, and each recording m hydrological variables.

{X}_{i}^{P},i=1,2,\mathit{\text{\ldots }},M

, denote the multivariate random variables at station i. Unlike the marginal entropy in the univariate case, the uncertainty of

{X}_{i}^{P}

is quantified by their joint entropy, as follows:

H\,\left({X}_{i}^{P}\right)=H\,\left({X}_{i1}^{P},{X}_{i2}^{P},\mathit{\text{\ldots }},{X}_{im}^{P}\right)=-\sum\limits _{{r}_{1}=1}^{{n}_{1}}\mathit{\cdots }\sum\limits _{{r}_{m}=1}^{{n}_{m}}\,{p}_{{r}_{1},...,{r}_{m}}\,{\mathit{log}}_{2}\ {p}_{{r}_{1},...,{r}_{m}}

(1)

where

{p}_{{r}_{1},...,{r}_{m}}

denotes the joint probability of the m hydrological variables, estimated from empirical frequencies. To ensure computational feasibility, we adopted the merging algorithm (Alfonso et al., 2010), which reformulates multivariate entropy into lower-dimensional forms without information loss (e.g.,

H(X,Y,Z)=H(\langle X,Y\rangle ,Z)

The optimal monitoring networks are determined using a multi-objective optimization algorithm based on a multivariate MIMR criterion, which is formulated as:

\underset{\text{multivariate}\mbox{-}\text{MIMR}}{\max }\,{=\lambda }_{1}H\,\left({X}_{1}^{S},{X}_{2}^{S},\mathit{\text{\ldots }},{X}_{K}^{S}\right)+\sum\limits _{{i}_{U}=1}^{{L}_{U}}T\,\left({X}_{1}^{S},{X}_{2}^{S},\mathit{\text{\ldots }},{X}_{K}^{S};{X}_{i}^{U}\right)-{\lambda }_{2}C\,\left({X}_{1}^{S},{X}_{2}^{S},\mathit{\text{\ldots }},{X}_{K}^{S}\right)

(2)

where, K and L denote the numbers of selected and unselected monitoring stations, respectively.

T\,\left({X}_{1}^{S},{X}_{2}^{S},\mathit{\text{\ldots }},{X}_{K}^{S};{X}_{i}^{U}\right)=\sum\limits _{{i}_{1}=1}^{{n}_{1}}\mathit{\cdots }\sum\limits _{{i}_{K}=1}^{{n}_{K}}\,\sum\limits _{{i}^{\mathit{\prime }}=1}^{{n}^{\mathit{\prime }}}{p}_{{i}_{1},...,{i}_{K}};{p}_{i}^{\mathit{\prime }}\,{\mathit{log}}_{2}\frac{{p}_{{i}_{1},...,{i}_{K}};{p}_{i}^{\mathit{\prime }}}{{p}_{{i}_{1},...,{i}_{K}}{p}_{i}^{\mathit{\prime }}}

defines the transinformation between the set of selected stations and each individual unselected station.

C\,\left({X}_{1}^{S},{X}_{2}^{S},\mathit{\text{\ldots }},{X}_{K}^{S}\right)=\sum\limits _{{i}_{S}=1}^{{K}_{S}}\,H\,\left({X}_{i}^{S}\right)-H\,\left({X}_{1}^{S},{X}_{2}^{S},\mathit{\text{\ldots }},{X}_{K}^{S}\right)

quantifies redundancy among selected stations. The weighting parameters λ₁ and λ₂ balance the trade-off between effective information and redundant information, and are set to 0.8 and 0.2 following Li et al. (2012).

The multivariate groundwater monitoring network design procedure developed in this study is summarized as follows. First, specify the information ratio ${R}_{sp}\in (0,1)$ and initialize the selected network set $\mathcal{S}$ . Then, compute the multivariate joint entropy by Equation 1 for each preselected station and add the one with the highest uncertainty to $\mathcal{S}$ . The set $\mathcal{S}$ is then expanded according to Equation 2. After each addition, the joint entropy $H(\mathcal{S})$ is updated, and the process terminates when $H(\mathcal{S})\mathit{\ge }{R}_{sp}\times H(\mathcal{P})$ . Subsequently, perform random sampling of the permeability field, repeat the above selection procedure, and incorporate additional stations into $\mathcal{S}$ . The ensemble network is finalized when successive trials fail to add new stations.

2.3 Network Design Amid High-Noise Observations

2.3.1 Data Discretization Setups

The data discretization influences the probability distribution of model responses and thus determines the calculation of information entropy (Cellucci et al., 2005). In this study, multivariate random variables were normalized to the interval [0,1] and discretized according to the following expression:

\begin{array}{c}{\mathrm{x}}^{\prime }=\lfloor \frac{x}{a}\rfloor \end{array}

(3)

where

a

denotes the bin width, finer discretization (i.e., a smaller

a

) provides a closer approximation to the true data distribution, but limited sample sizes may introduce statistical fluctuations that compromise the accuracy of probability estimation.

With a sufficiently large sample size, applying finer discretization to continuous data enables more accurate quantification of data uncertainty. In contrast, coarser discretization is preferable for entropy estimation when samples are limited or data is noisy. In this study, three types of bin-width settings were defined: $a$ = 0.1 for fine discretization, $a$ = 0.25 for moderate discretization, and $a$ = 0.5 for coarse discretization. These correspond to 10, 4, and 2 categorical divisions of the model responses, respectively, reflecting different levels of tolerance to data errors.

2.3.2 Multi-Source Data Assimilation

We employ the iterative local updating ensemble smoother (ILUES) developed by Zhang et al. (2018) to update parameters of interest by integration of multi-source observations, as illustrated in Figure 1c. Compared with the standard ensemble smoother, ILUES offers improved stability and convergence in nonlinear problems, making it more suitable for geological modeling and subsurface flow simulations under high uncertainty. The ILUES update equation is as follows:

\begin{array}{c}{p}_{k}^{n+1}={p}_{k}^{n}+{K}^{n}\left({d}_{k}^{\ast }-f\left(G\left({p}_{k}^{n}\right)\right)\right)\end{array}

(4)

where, the Kalman gain is computed as

{K}^{n}={C}_{PM}^{n}{\left({C}_{MM}^{n}+{N}_{\text{iter}}\mathit{\cdot }{C}_{D}\right)}^{-1}

, with

{C}_{PM}^{n}

and

{C}_{MM}^{n}

denote the cross-covariance between parameters and model responses, and the auto-covariance of model responses, respectively.

{C}_{D}

denotes the observation error covariance matrix. The perturbed observation is defined as

{d}_{k}^{\ast }=d+\sqrt{{N}_{\text{iter}}}\mathit{\cdot }{e}_{k}^{n}

, with

{e}_{k}^{n}\sim \left(0,{C}_{D}\right)

representing the stochastic perturbation of the observation vector.

3 Numerical Example

The primary purpose of the modeling case is to demonstrate the feasibility of the developed multivariate groundwater monitoring network design framework under high observational noise. To balance computational efficiency, a two-dimensional multicomponent reactive transport model was employed to validate the framework, while maintaining enough complexity in the setup. It should be emphasized, however, that the proposed approach is generally applicable and can be readily adapted for three-dimensional models.

The model domain covers an 80 × 80 ${\text{km}}^{2}$ area discretized into a uniform 80 × 80 grid, and simulations are conducted with TOUGHREACT. Groundwater flow is governed by prescribed boundary conditions, including constant hydraulic heads on the left and right sides with a head difference of ΔH = 50 m, while the remaining boundaries are treated as no-flow. In addition to boundary conditions, the model incorporates the following assumptions. Groundwater flow is considered saturated and incompressible, following Darcy's law. The reactive transport processes include advection, dispersion, and chemical reactions, which were predefined based on previous studies, with reaction types and associated parameters treated as determined.

To represent non-Gaussian heterogeneity, the channelized field is generated from a training image (Figure 2a) using multiple-point statistics (MPS) simulations. The implementation follows the code provided by Zhang et al. (2020), while further details of the MPS approach can be found in Mariethoz et al. (2010). Figure 2b shows several random realizations of the channelized field. We followed the method proposed by Mo et al. (2020) to generate both the reference value and random realizations of the non-Gaussian permeability field. In TOUGREACT, the permeability of each grid was defined as ${k}_{i}={\alpha }_{i}{k}_{\text{ref}}$ , with ${k}_{\text{ref}}=1.0\times {10}^{-14\,}{\mathrm{m}}^{2}$ in this study. The permeability modifier ${\alpha }_{i}$ of each facies follows a log-Gaussian distribution and is assigned distinct heterogeneous properties: ${\mathit{ln}\left({\alpha }_{i}\right)}_{1}\sim \mathcal{N}\left(2,{0.5}^{2}\right)$ for low-permeability matrix, and ${\mathit{ln}\left({\alpha }_{i}\right)}_{2}\sim \mathcal{N}\left(5,{0.5}^{2}\right)$ for high-permeability channels. Following the methodology outlined in Section 2.1, non-Gaussian permeability fields were reconstructed by a DCGAN model trained on a prior ensemble of 10,000 samples, implemented in PyTorch. Representative realizations are shown in Figure 2c. The performance of the generative model was assessed by comparing one-dimensional marginal probability density functions and two-dimensional scatter plots at randomly selected stations in the model domain with those derived from the training image (Figure S1 in Supporting Information S1). The reference permeability field used in this study is presented in Figure 2d, along with the uniformly distributed preselected monitoring stations across the model domain.

The reactive transport model is adapted from Dai and Samper (2006) and Chen et al. (2021), accounting for the main geochemical reactions in the aquifer system, including aqueous complexation, cation exchange, and calcite dissolution/precipitation. The reaction equations are listed in Table S1 in Supporting Information S1. The primary species in the model include ${\mathrm{H}}_{2}\mathrm{O}$ , ${\mathrm{H}}^{+}$ , $\text{HC}{\mathrm{O}}_{3}^{-}$ , $\mathrm{S}{\mathrm{O}}_{4}^{2-}$ , ${\text{Ca}}^{2+}$ , ${\text{Mg}}^{2+}$ , ${\text{Na}}^{+}$ , ${\mathrm{K}}^{+}$ and XOH. Due to the manuscript length limitation, the associated model parameters, including cation exchange capacity, cation exchange coefficients, and the initial/boundary concentrations of the primary species, are provided in Table S2 in Supporting Information S1. TOUGHREACT calculates the concentrations of secondary species at each time step by solving coupled geochemical equilibrium/kinetic reactions and mass transport equations, where their concentrations are derived from the chemical speciation of primary species based on thermodynamics or reaction kinetics. Note that secondary species concentrations are intermediate variables only and not directly used in subsequent calculations. The multivariate dynamic responses included hydraulic heads, pH, and the concentrations of Ca²⁺, HCO₃⁻, K⁺, Mg²⁺, and Na⁺, with the reference case simulations shown in Figure 2e.

The surrogate for the forward model was trained on a data set constructed from non-Gaussian permeability fields generated by DCGAN and multivariate responses simulated with TOUGHREACT. The performance of the surrogate model was evaluated by comparing its outputs with those of the numerical simulations. Details of the training parameters and error evaluation are provided in Text S1 in Supporting Information S1.

4 Results and Discussions

This section presents the results of the optimal monitoring networks and their application to multi-source DA under noisy observations for identifying non-Gaussian permeability fields. Section 4.1 investigates how data discretization influences the spatial distribution of monitoring stations. Section 4.2 assesses inversion results for non-Gaussian permeability fields across different noise levels, with emphasis on posterior estimates of historical and predictive responses under high-noise observations.

4.1 Optimal Monitoring Networks

Uniformly spaced preselected monitoring stations (361 in total, arranged in a 19 × 19 grid) were established across the modeling domain. Since information entropy is calculated by approximating probabilities through frequency counts, a sufficiently large sample data set of model responses is required to accurately quantify uncertainty. We evaluated the relationship between the information content at preselected stations and the data size (Figure S2 in Supporting Information S1). The results show that both the mean and variance of information entropy fluctuate significantly when the data size is small, but become stable when the sample size exceeds 10,000. Accordingly, a data size of $N=10,000$ was adopted for subsequent analyses. For a uniform distribution, the theoretical upper limit of information entropy is log₂ (N), while the actual entropy is determined by the specific data distribution and variable variances.

Figures S3–S5 in Supporting Information S1 present the stochastic simulation results of optimal single networks under the three data distributions. The results show that only 3–6 stations are needed to reach the 99% (the specific Rsp defined in this study) information coverage requirement. This is because the data information arises from the joint monitoring of multivariate variables, allowing each station to convey substantial information. Particularly, under finer discretization, each station transmits richer and more detailed information. Because single optimal networks derived from random simulations are inherently uncertain, our previous work (Cao, Dai, Chen, et al., 2025) introduced the concept of an ensemble monitoring network by aggregating all simulation outcomes. However, this approach is limited in that stations appearing only in a few simulations are often redundant and may impair parameter inversion. To address this, the present study proposes a practical termination criterion: the ensemble network is finalized once 6–8 consecutive trials fail to add any new stations. The range was determined empirically. Specifically, Figure S6 in Supporting Information S1 in the Supporting Information shows the number of newly added stations in each stochastic simulation, while Figure S7 in Supporting Information S1 illustrates the spatial distribution of the final optimal network and subsequently added stations. The results indicate that only a few stations were added in the later stages, and they were mainly located in non-critical areas. Accordingly, the final optimal networks comprise 36, 30, and 49 monitoring stations for the three data discretization schemes, respectively (Figure 3a). Spatially, the optimization based on a coarser discretization results in a larger number of monitoring stations with a more concentrated distribution. In contrast, the optimization based on a finer discretization yields fewer monitoring stations that are more widely dispersed.

4.2 Inversion and Model Calibration Under Noisy Observations

In this case study, synthetic noise-free observations are first generated from the reference field, to which Gaussian noise ${\epsilon}\mathit{\sim }\mathcal{N}\left(0,{\sigma }^{2}\right)$ is added at four relative levels (5%, 10%, 25%, and 50%) to represent potential measurement uncertainties. In theory, a larger ensemble size for inversion provides a better representation of parameter uncertainty. However, computational efficiency must also be balanced against reliability. Based on sensitivity analysis, the ensemble size in this study was set to ${N}_{e}=1000$ . In the case study, the inversion parameters essentially converged to the true values after the 5th iteration, therefore, the number of iterations was set to ${N}_{\mathrm{i}\mathrm{t}\mathrm{e}\mathrm{r}}=10$ . The local smoother factor was set to $\alpha =0.1$ , following the findings of Zhang et al. (2018).

Figure 3b presents the inversion results of the non-Gaussian permeability field under different levels of noisy observations, including the posterior ensemble means and their deviations from the true field for three optimal monitoring network scenarios. For the same data discretization scenario, increasing observational noise consistently reduces inversion reliability, whereas optimal monitoring networks designed with a larger discretization bin width enhance the tolerance of parameter inversion to noisy observations. At a fine discretization scenario (a = 0.1), the estimated accuracy is sensitive to higher noise, whereas it exhibits improved noise tolerance at a moderate discretization scenario (a = 0.25). It is noteworthy that under the high-noise scenario with a 50% observation error, the monitoring network designed with a discretization bin width of a = 0.5 yields the most reliable posterior estimates. This is because, with a sufficiently large sample size, finer discretization of the response data at a monitoring station can capture more detailed system information. Nevertheless, it should be emphasized that all optimal monitoring networks in this study provide the same total amount of information, which approaches the theoretical upper limit determined by the sample data set. Thus, monitoring networks derived from fine discretization consist of fewer stations and therefore require high-precision observations to ensure reliable inversion. In contrast, networks designed with coarse discretization involve a larger number of stations, which are often distributed in relatively concentrated areas. Such a setting enhances tolerance against observational noise, allowing reliable inversion results even when measurement accuracy is limited. This advantage provides a practical solution for real-world applications, where observational data are often subject to considerable uncertainty.

To further evaluate the posterior estimates under high observational noise (50%), we compared the calibrated model performance. Specifically, we reconstructed the long-term groundwater system evolution by fitting the spatial distributions of hydraulic heads, pH, and multicomponent concentrations over the past 100,000 years with 10,000-year time steps. A systematic comparison of the breakthrough curves for Bayesian prior, posterior, and reference responses was further carried out at monitoring stations that were not included in the DA. For clarity, Figure 3c illustrates the data-matching results at two randomly selected stations. The results indicate that, overall, the proposed inversion framework effectively reduces the uncertainty of posterior ensembles of model responses even under high observational noise. In contrast, cases with lower reliability in the posterior permeability fields exhibit larger historical fitting errors in the calibrated models. The results clearly indicate that when monitoring data accuracy is limited, optimizing the monitoring network with a larger discretization bin width can maximize the utility of available observations and thereby improve modeling accuracy. Calibrated models were further evaluated by comparing predicted hydraulic head, pH, and multi-component concentration fields over the next 10,000 years against the reference system responses. As shown in Figure 3d, the ensemble means of model responses accurately capture both high-permeability channels and low-permeability matrices across all scenarios. Examination of deviations from the reference fields further confirms the earlier conclusion that, under high-noise conditions, optimal monitoring networks designed with a larger discretization bin width can more effectively integrate observational information, thereby improving parameter inversion and model accuracy.

5 Summary and Conclusions

This study develops an entropy-based framework for multivariate groundwater monitoring network design. Under limited or low-precision observational data, the optimal network derived from this framework exhibits greater tolerance to observational errors. The framework is therefore advantageous for practical applications of network optimization and multi-source DA: with high-quality data, the number of monitoring stations can be reduced to lower costs without compromising inversion accuracy, whereas under restricted monitoring conditions, it still ensures reliable inversion estimation. Overall, the proposed approach thus offers valuable guidance for the strategies of multivariate data integration and the design of optimal monitoring networks amid high-noise observations.

Despite these contributions, this framework has some limitations for applications, including the reliance of the model on prior knowledge of the aquifers, the use of borehole data that may be sparse in practice, and the validation of the case study under the assumption of a deterministic process model. Nonetheless, the proposed approach and key findings are broadly applicable and can be extended to situations with uncertainty in geostatistical parameters or transport processes. Moreover, when borehole data are limited, geophysical data can be integrated within this framework to guide survey grid design.

Acknowledgments

This work was funded by the National Natural Science Foundation of China (NSFC: U2267217, 42402241, 42141011) and Shandong Key Water Conservancy Science and Technology Project (2024370203001957).

Conflict of Interest

The authors declare no conflicts of interest relevant to this study.

Open Research

Data Availability Statement

Data are available at (Cao, Dai, & Chen, 2025).

Supporting Information

Filename	Description
2025GL117466-sup-0001-Supporting Information SI-S01.docx7 MB	Supporting Information S1

References

Alfonso, L., Lobbrecht, A., & Price, R. (2010). Optimization of water level monitoring network in polder systems using information theory. Water Resources Research, 46(12). https://doi.org/10.1029/2009WR008953
10.1029/2009WR008953
PubMed Google Scholar
Bao, J., Li, L., & Redoloza, F. (2020). Coupling ensemble smoother and deep learning with generative adversarial networks to deal with non-Gaussianity in flow and transport data assimilation. Journal of Hydrology, 590, 125443. https://doi.org/10.1016/j.jhydrol.2020.125443
10.1016/j.jhydrol.2020.125443
Web of Science® Google Scholar
Bosserelle, A. L., & Hughes, M. W. (2024). Groundwater monitoring infrastructure: Evaluation of the shallow urban and coastal network in Ōtautahi christchurch. Journal of Hydrology: Regional Studies, 55, 101934. https://doi.org/10.1016/j.ejrh.2024.101934
10.1016/j.ejrh.2024.101934
Google Scholar
Cao, C., Zhang, J., Gan, W., Nan, T., & Lu, C. (2024). A deep learning-based data assimilation approach to characterizing coastal aquifers amid non-linearity and non-gaussianity challenges. Water Resources Research, 60(7), e2023WR036899. https://doi.org/10.1029/2023WR036899
10.1029/2023WR036899
Web of Science® Google Scholar
Cao, M., Dai, Z., & Chen, J. (2025). Entropy-guided multivariate groundwater network design for multi-source data assimilation under observational uncertainty. https://doi.org/10.5281/zenodo.15605285
10.5281/zenodo.15605285
Google Scholar
Cao, M., Dai, Z., Chen, J., Yin, H., Zhang, X., Wu, J., et al. (2025). An integrated framework of deep learning and entropy theory for enhanced high-dimensional permeability field identification in heterogeneous aquifers. Water Research, 268, 122706. https://doi.org/10.1016/j.watres.2024.122706
10.1016/j.watres.2024.122706
CAS PubMed Web of Science® Google Scholar
Cao, M., Dai, Z., Jia, S., Samper, J., Ling, H., Du, Z., et al. (2024). Identification of solute transport parameters in fractured granites with heterogeneous apertures. Journal of Hydrology, 633, 130938. https://doi.org/10.1016/j.jhydrol.2024.130938
10.1016/j.jhydrol.2024.130938
Web of Science® Google Scholar
Cellucci, C. J., Albano, A. M., & Rapp, P. E. (2005). Statistical validation of mutual information calculations: Comparison of alternative numerical algorithms. Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, 71(6), 066208. https://doi.org/10.1103/PhysRevE.71.066208
10.1103/PhysRevE.71.066208
CAS PubMed Web of Science® Google Scholar
Chen, J., & Dai, Z. (2024). Metaheuristic algorithms for groundwater model parameter inversion: Advances and prospects. Deep Resources Engineering, 1(2), 100009. https://doi.org/10.1016/j.deepre.2024.100009
10.1016/j.deepre.2024.100009
Google Scholar
Chen, J., Dai, Z., Dong, S., Zhang, X., Sun, G., Wu, J., et al. (2022). Integration of deep learning and information theory for designing monitoring networks in heterogeneous aquifer systems. Water Resources Research, 58(10), e2022WR032429. https://doi.org/10.1029/2022WR032429
10.1029/2022WR032429
ADS Web of Science® Google Scholar
Chen, J., Dai, Z., Yang, Z., Pan, Y., Zhang, X., Wu, J., & Reza Soltanian, M. (2021). An improved tandem neural network architecture for inverse modeling of multicomponent reactive transport in porous media. Water Resources Research, 57(12), e2021WR030595. https://doi.org/10.1029/2021WR030595
10.1029/2021WR030595
ADS Web of Science® Google Scholar
Dai, Z., & Samper, J. (2006). Inverse modeling of water flow and multicomponent reactive transport in coastal aquifer systems. Journal of Hydrology, 327(3–4), 447–461. https://doi.org/10.1016/j.jhydrol.2005.11.052
10.1016/j.jhydrol.2005.11.052
Web of Science® Google Scholar
da Silveira Barcellos, D., & de Souza, F. T. (2022). Optimization of water quality monitoring programs by data mining. Water Research, 221, 118805. https://doi.org/10.1016/j.watres.2022.118805
10.1016/j.watres.2022.118805
PubMed Google Scholar
Foroozand, H., & Weijs, S. V. (2020). Objective functions for information-theoretical monitoring network design: What is optimal? Hydrology and Earth System Sciences Discussions, 2020(2), 1–28. https://doi.org/10.5194/hess-25-831-2021
10.5194/hess?25?831?2021
Google Scholar
Gettelman, A., Geer, A. J., Forbes, R. M., Carmichael, G. R., Feingold, G., Posselt, D. J., et al. (2022). The future of Earth system prediction: Advances in model-data fusion. Science Advances, 8(14), eabn3488. https://doi.org/10.1126/sciadv.abn3488
10.1126/sciadv.abn3488
PubMed Web of Science® Google Scholar
Ghorbanidehno, H., Kokkinaki, A., Lee, J., & Darve, E. (2020). Recent developments in fast and scalable inverse modeling and data assimilation methods in hydrology. Journal of Hydrology, 591, 125266. https://doi.org/10.1016/j.jhydrol.2020.125266
10.1016/j.jhydrol.2020.125266
Web of Science® Google Scholar
Hsu, N. S., & Yeh, W. W. G. (1989). Optimum experimental design for parameter identification in groundwater hydrology. Water Resources Research, 25(5), 1025–1040. https://doi.org/10.1029/WR025i005p01025
10.1029/WR025i005p01025
ADS Web of Science® Google Scholar
Ju, L., Zhang, J., Meng, L., Wu, L., & Zeng, L. (2018). An adaptive Gaussian process-based iterative ensemble smoother for data assimilation. Advances in Water Resources, 115, 125–135. https://doi.org/10.1016/j.advwatres.2018.03.010
10.1016/j.advwatres.2018.03.010
ADS Web of Science® Google Scholar
Keum, J., & Coulibaly, P. (2017). Information theory-based decision support system for integrated design of multivariable hydrometric networks. Water Resources Research, 53(7), 6239–6259. https://doi.org/10.1002/2016WR019981
10.1002/2016WR019981
ADS Web of Science® Google Scholar
Keum, J., Kornelsen, K. C., Leach, J. M., & Coulibaly, P. (2017). Entropy applications to water monitoring network design: A review. Entropy, 19(11), 613. https://doi.org/10.3390/e19110613
10.3390/e19110613
ADS Web of Science® Google Scholar
Knopman, D. S., Voss, C. I., & Garabedian, S. P. (1991). Sampling design for groundwater solute transport: Tests of methods and analysis of Cape Cod tracer test data. Water Resources Research, 27(5), 925–949. https://doi.org/10.1029/90WR02657
10.1029/90WR02657
CAS ADS Web of Science® Google Scholar
Leach, J. M., Coulibaly, P., & Guo, Y. (2016). Entropy based groundwater monitoring network design considering spatial distribution of annual recharge. Advances in Water Resources, 96, 108–119. https://doi.org/10.1016/j.advwatres.2016.07.006
10.1016/j.advwatres.2016.07.006
ADS Web of Science® Google Scholar
Li, C., Singh, V. P., & Mishra, A. K. (2012). Entropy theory-based criterion for hydrometric network evaluation and design: Maximum information minimum redundancy. Water Resources Research, 48(5). https://doi.org/10.1029/2011WR011251
10.1029/2011WR011251
Google Scholar
Ling, W., & Jafarpour, B. (2024). Improving the parameterization of complex subsurface flow properties with style-based generative adversarial network (StyleGAN). Water Resources Research, 60(11), e2024WR037630. https://doi.org/10.1029/2024WR037630
10.1029/2024WR037630
Web of Science® Google Scholar
Ma, F., Chen, J., Dai, Z., Cai, F., Wang, D., & Ma, Y. (2025). Impact of groundwater extraction intensity on the monitoring design for seawater intrusion in heterogeneous coastal aquifers. Journal of Hydrology, 661, 133638. https://doi.org/10.1016/j.jhydrol.2025.133638
10.1016/j.jhydrol.2025.133638
Web of Science® Google Scholar
Mariethoz, G., Renard, P., & Straubhaar, J. (2010). The direct sampling method to perform multiple-point geostatistical simulations. Water Resources Research, 46(11). https://doi.org/10.1029/2008WR007621
10.1029/2008WR007621
PubMed Web of Science® Google Scholar
Meggiorin, M., Naranjo-Fernández, N., Passadore, G., Sottani, A., Botter, G., & Rinaldo, A. (2024). Data-driven statistical optimization of a groundwater monitoring network. Journal of Hydrology, 631, 130667. https://doi.org/10.1016/j.jhydrol.2024.130667
10.1016/j.jhydrol.2024.130667
Web of Science® Google Scholar
Meyer, R., Engesgaard, P., & Sonnenborg, T. O. (2019). Origin and dynamics of saltwater intrusion in a regional aquifer: Combining 3-D saltwater modeling with geophysical and geochemical data. Water Resources Research, 55(3), 1792–1813. https://doi.org/10.1029/2018WR023624
10.1029/2018WR023624
CAS ADS Web of Science® Google Scholar
Mo, S., Zabaras, N., Shi, X., & Wu, J. (2020). Integration of adversarial autoencoders with residual dense convolutional networks for estimation of non-gaussian hydraulic conductivities. Water Resources Research, 56(2), e2019WR026082. https://doi.org/10.1029/2019WR026082
10.1029/2019WR026082
ADS Web of Science® Google Scholar
Olorunsaye, O., & Heiss, J. W. (2024). Stability of saltwater-freshwater mixing zones in beach aquifers with geologic heterogeneity. Water Resources Research, 60(8), e2023WR036056. https://doi.org/10.1029/2023WR036056
10.1029/2023WR036056
Web of Science® Google Scholar
Parrish, M. A., Moradkhani, H., & DeChant, C. M. (2012). Toward reduction of model uncertainty: Integration of bayesian model averaging and data assimilation. Water Resources Research, 48(3). https://doi.org/10.1029/2011WR011116
10.1029/2011WR011116
Google Scholar
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. https://doi.org/10.48550/arXiv.1511.06434
10.48550/arXiv.1511.06434
Google Scholar
Sciortino, A., Harmon, T. C., & Yeh, W. W. G. (2002). Experimental design and model parameter estimation for locating a dissolving dense nonaqueous phase liquid pool in groundwater. Water Resources Research, 38(5), 15-11–15-19. https://doi.org/10.1029/2000WR000134
10.1029/2000WR000134
Web of Science® Google Scholar
Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
10.1002/j.1538-7305.1948.tb01338.x
Web of Science® Google Scholar
Sheng, C., Jiao, J. J., Zhang, J., Yao, Y., Luo, X., Yu, S., et al. (2024). Evolution of groundwater system in the Pearl River Delta and its adjacent shelf since the late Pleistocene. Science Advances, 10(15), eadn3924. https://doi.org/10.1126/sciadv.adn3924
10.1126/sciadv.adn3924
CAS PubMed Web of Science® Google Scholar
Vrugt, J. A., Ter Braak, C. J., Clark, M. P., Hyman, J. M., & Robinson, B. A. (2008). Treatment of input uncertainty in hydrologic modeling: Doing hydrology backward with Markov chain Monte Carlo simulation. Water Resources Research, 44(12). https://doi.org/10.1029/2007WR006720
10.1029/2007WR006720
ADS Web of Science® Google Scholar
Wang, W., Wang, D., Singh, V. P., Wang, Y., Wu, J., Wang, L., et al. (2018). Optimization of rainfall networks using information entropy and temporal variability analysis. Journal of Hydrology, 559, 136–155. https://doi.org/10.1016/j.jhydrol.2018.02.010
10.1016/j.jhydrol.2018.02.010
ADS Web of Science® Google Scholar
Xue, L., & Zhang, D. (2014). A multimodel data assimilation framework via the ensemble Kalman filter. Water Resources Research, 50(5), 4197–4219. https://doi.org/10.1002/2013WR014525
10.1002/2013WR014525
ADS Web of Science® Google Scholar
Zhan, C., Dai, Z., Jiao, J. J., Soltanian, M. R., Yin, H., & Carroll, K. C. (2025). Toward artificial general intelligence in hydrogeological modeling with an integrated latent diffusion framework. Geophysical Research Letters, 52(3), e2024GL114298. https://doi.org/10.1029/2024GL114298
10.1029/2024GL114298
Web of Science® Google Scholar
Zhan, C., Dai, Z., Soltanian, M. R., & Zhang, X. (2022). Stage-wise stochastic deep learning inversion framework for subsurface sedimentary structure identification. Geophysical Research Letters, 49(1), e2021GL095823. https://doi.org/10.1029/2021GL095823
10.1029/2021GL095823
ADS Web of Science® Google Scholar
Zhang, J., Cao, C., Nan, T., Ju, L., Zhou, H., & Zeng, L. (2024). A novel deep learning approach for data assimilation of complex hydrological systems. Water Resources Research, 60(2), e2023WR035389. https://doi.org/10.1029/2023WR035389
10.1029/2023WR035389
ADS Web of Science® Google Scholar
Zhang, J., Lin, G., Li, W., Wu, L., & Zeng, L. (2018). An iterative local updating ensemble smoother for estimation and uncertainty assessment of hydrologic model parameters with multimodal distributions. Water Resources Research, 54(3), 1716–1733. https://doi.org/10.1002/2017WR020906
10.1002/2017WR020906
ADS Web of Science® Google Scholar
Zhang, J., Zheng, Q., Wu, L., & Zeng, L. (2020). Using deep learning to improve ensemble smoother: Applications to subsurface characterization. Water Resources Research, 56(12), e2020WR027399. https://doi.org/10.1029/2020WR027399
10.1029/2020WR027399
ADS Web of Science® Google Scholar
Zhang, X., Jiang, S., Zheng, N., Xia, X., Li, Z., Zhang, R., et al. (2024). Integration of DDPM and ILUES for simultaneous identification of contaminant source parameters and Non-Gaussian channelized hydraulic conductivity field. Water Resources Research, 60(9), e2023WR036893. https://doi.org/10.1029/2023WR036893
10.1029/2023WR036893
Web of Science® Google Scholar
Zheng, F., Tao, R., Maier, H. R., See, L., Savic, D., Zhang, T., et al. (2018). Crowdsourcing methods for data collection in geophysics: State of the art, issues, and future directions. Reviews of Geophysics, 56(4), 698–740. https://doi.org/10.1029/2018RG000616
10.1029/2018RG000616
ADS Web of Science® Google Scholar
Zhi, W., Appling, A. P., Golden, H. E., Podgorski, J., & Li, L. (2024). Deep learning for water quality. Nature Water, 2(3), 228–241. https://doi.org/10.1038/s44221-024-00202-z
10.1038/s44221-024-00202-z
PubMed Web of Science® Google Scholar

Entropy-Guided Multivariate Groundwater Network Design for Multi-Source Data Assimilation Under Observational Uncertainty