Parameterization of Shallow Cumulus Entrainment and Detrainment Rates Using Global Satellite Observations: Physical and Machine Learning Approaches

https://doi.org/10.1029/2025GL117775
2025-09-12
Geophysical Research Letters . Volume 52 , issue 18
Lei Zhu, Chunsong Lu, Dezhi Xiao, Sinan Gao, Yichuan Wang, Yannian Zhu, Xin He, Yubin Li, Jing Yang

Abstract

Accurate parameterizations of entrainment and detrainment rates (ε and δ) of shallow cumulus are crucial for improving simulations of atmospheric energy and water cycles. Existing ε and δ parameterizations are often derived from theoretical derivations, limited observations, or numerical simulations, which lack comprehensive global observational support. To address this limitation, parameterizations of shallow cumulus ε and δ are developed and evaluated using a global satellite-derived data set. For parameterizations based on the physical approach, the recommended scheme incorporates environmental relative humidity (RHe) and vertical velocity to parameterize ε, while δ is parameterized using ε and RHe. Furthermore, the machine learning (ML) approach trained on thermodynamic, dynamic, and cloud microphysical properties can accurately predict ε and δ. Comparative analysis reveals that ML performs better than the physical approach. These findings provide valuable insights for refining cumulus parameterizations and enhancing the accuracy of climate model simulations.

Plain Language Summary

Clouds play a critical role in shaping the Earth's climate, but accurately representing their behavior in climate models remains a challenge. Entrainment and detrainment are two key processes that describe how cumulus clouds interact with the atmosphere, influencing cloud formation, lifespan, and their impact on climate. However, most existing methods for determining entrainment and detrainment rely heavily on limited data or theoretical assumptions, which can introduce inaccuracies in climate simulations. To address this, our study used a global satellite data set to improve the representation of entrainment and detrainment in climate models. We explored two approaches: one based on the physical approach and another using machine learning (ML), where algorithms learn patterns from extensive cloud and atmospheric data to make predictions. Our results show that while the physical approach offers valuable insights, the ML approach performs better at predicting entrainment and detrainment. This is because ML can uncover complex relationships in the data that traditional approaches are unable to capture. These findings help improve cloud representations in climate models, thereby enhancing the accuracy of long-term climate change predictions.

Key Points

  • Entrainment and detrainment rate parameterizations for shallow cumulus are derived using global satellite observations for the first time

  • Optimal entrainment and detrainment rate parameterizations based on the physical approach are determined

  • Machine learning demonstrates superior performance compared to the physical approach in predicting entrainment and detrainment rates

1 Introduction

Shallow cumulus clouds, which cover an average of 5%–30% of the sky (Mieslinger et al., 2019; Norris, 1998), significantly impact the atmospheric thermal structure (Li et al., 2019; Neggers et al., 2007), large-scale circulation (Nie, 2013; A. Siebesma, 1998), and cloud-climate feedback (Bony & Dufresne, 2005), and their effects are represented through parameterizations in current climate models (Arakawa, 2004; Gu et al., 2020; Yang et al., 2021). Cumulus parameterizations are essential for simulating climate phenomena, such as typhoons (Zhao et al., 2018a), the Madden-Julian Oscillation (Del Genio et al., 2012; Jiang et al., 2020), monsoons (Zou & Zhou, 2011), and the El Niño-Southern Oscillation (B. Lu and Ren, 2016). Among cumulus parameterizations, entrainment and detrainment rates (ε and δ) are two of the most essential parameters (Stanfield et al., 2019; Zhao, Wang, Liu and Wu, 2024; Zhao, Wang, Liu, Wu, et al., 2024; Zhu, Lu, Xu, et al., 2024). In commonly used mass-flux schemes, ε and δ govern the mass exchange between cumulus clouds and the environment (Arakawa & Schubert, 1974; Luo et al., 2010; Z. Wang, 2020). However, as noted by de Rooy et al. (2013), ε and δ are poorly constrained and show significant spatial and temporal variability. This uncertainty constitutes a significant limitation in cumulus parameterizations and directly affects their ability to realistically represent cumulus processes. Numerous studies have highlighted the pivotal role of ε and δ in simulating climate phenomena (Bush et al., 2015; Sanderson et al., 2008; Tokioka et al., 1988; Zhao et al., 2012), underscoring the need for improved parameterizations of ε and δ.

Based on observational data or model simulations, numerous studies have developed parameterizations for ε and δ by linking them to thermodynamic and dynamic factors (Li et al., 2025; Villalba-Pradas & Tapiador, 2022). However, a wide variety of such schemes exist, and no consensus has been reached, nor has a unified scheme been established (de Rooy et al., 2013). The parameterizations of ε are often developed based on cloud radius (Simpson, 1971; Squires & Turner, 1962; Takahashi et al., 2021; Turner, 1962), constant values (Derbyshire et al., 2004; A. Siebesma, 1998; Soares et al., 2004), or cloud height (Gregory & Rowntree, 1990; A. P. Siebesma and Cuijpers, 1995). However, these parameterizations are overly simplified and fail to represent the intricate entrainment-mixing processes occurring in the atmosphere. Consequently, researchers have utilized both observational data and model simulations to explore more advanced parameterizations based on vertical velocity, buoyancy, and their derivative forms (Gregory, 2001; Lin, 1999; Lu et al., 2016; Xu et al., 2021; Zhang et al., 2016). Environmental relative humidity (RHe) is another commonly used parameter for ε parameterization, but its relationship with ε is still under debate. Some studies reported a positive correlation between ε and RHe (Jensen & Del Genio, 2006; Lu et al., 2018), while others found an inverse relationship (Bechtold et al., 2008; Zhao et al., 2018b), posing challenges to accurately parameterize ε. The different ε-RHe relationships arise from two competition mechanisms related to cloud size and natural selection (Derbyshire et al., 2011). The first operates in convection-favorable environments (with higher RHe), where larger clouds tend to develop more adiabatic cores and thus exhibit lower ε, that is, a negative relationship between ε and RHe. The second becomes significant when there is sufficient variability: a positive ε-RHe correlation can emerge because, in drier environments, only clouds with lower ε can survive. Furthermore, large-eddy simulations by Romps (2010) suggest that ε exhibits weak or no correlation with cloud height, buoyancy, or buoyancy gradient. These results underscore the persistent uncertainties in ε parameterizations.

Because of the difficulties in estimating δ, parameterizations for δ are relatively scarce. Early cumulus parameterizations assumed that detrainment occurs only near the cloud top (Arakawa & Schubert, 1974) or at levels of neutral buoyancy (Emanuel, 1991; Moorthi & Suarez, 1992). In many studies, δ is either prescribed as a constant (Gregory, 2001; Soares et al., 2004) or set equal to ε (Han & Pan, 2011; Tiedtke, 1989). Given the strong coupling between detrainment and entrainment processes, several studies have incorporated ε into the parameterization of δ (A. Siebesma, 1998). In the classical Kain-Fritsch scheme (Kain, 2004; Kain & Fritsch, 1990), δ is parameterized as a function of the critical mixing fraction between environmental air and cloud air (Böing et al., 2012; Bretherton et al., 2004; Dawe & Austin, 2013). In addition, RHe is also commonly employed in δ parameterization (Bush et al., 2015; Stirling & Stratton, 2012).

Existing parameterizations for ε and δ are primarily derived from theoretical derivations, limited observational data sets, or numerical simulations, and thus lack support from extensive observational data sets. Moreover, most parameterizations rely on one or two influencing factors and are developed using conventional empirical curve-fitting approaches (Villalba-Pradas & Tapiador, 2022), hereafter referred to as “physical approaches”, which fail to account for the full complexity of processes affecting ε and δ. Machine learning (ML) approaches, capable of identifying nonlinear relationships between multiple variables via data-driven training, have been widely applied in geosciences (Christopoulos et al., 2024; Colfescu et al., 2024; Eusebi et al., 2025; Gao et al., 2024; Su et al., 2020; Wickramasinghe et al., 2024; Zhang et al., 2024; Zhao et al., 2023). For instance, Shin and Baik (2022) used ML to predict ε and δ using data obtained from large-eddy simulations. However, their results lack validation against global observational data sets. To address the gap, this study utilizes the global ε and δ data set derived from satellite observations by Zhu et al. (2025) and develops parameterizations for shallow cumulus ε and δ through both physical and ML approaches. The performance of the two approaches is evaluated and compared, providing insights into the improvement of cumulus parameterizations.

2 Data and Methods

2.1 Data Set

The ε and δ data set used in this study is derived from global observations from June–August 2017 by the Visible Infrared Imaging Radiometer Suite (VIIRS) sensor onboard the Suomi National Polar-orbiting Partnership (SNPP) satellite. Additionally, the 6-hourly NCEP FNL reanalysis data (National Centers for Environmental Prediction, 2000) and National Oceanic and Atmospheric Administration (NOAA) sea surface temperature (SST) data (Huang et al., 2021), both on a 1° × 1° grid, were used in the retrieval process. The cloud product retrievals and the derivation of the ε and δ data sets are detailed in Zhu et al. (2025). The uncertainty analysis indicates that ε and δ exhibit relatively small uncertainty under the current retrieval framework (see Text S1 in Supporting Information S1 for details). Cloud droplet number concentration (Nd) was retrieved using cloud optical thickness (COT) and cloud droplet effective radius (re) (Szczodrak et al., 2001; Wang et al., 2021, 2023). The RHe was obtained from the reanalysis data, while cloud base vertical velocity (w) was retrieved using the algorithm proposed by Zheng and Rosenfeld (2015). A total of 83,240 cloud samples were collected on a global scale, exhibiting broad spatial coverage across both land and ocean areas with no significant seasonal and regional biases (Figures S1 and S2 in Supporting Information S1). To ensure the reliability and generalizability of the parameterizations, 80% of the data was randomly selected for parameterization development, while the remaining 20% was reserved for independent evaluation. In the ML approach, the 80% portion used for model construction was further split into training and validation sets in an 8:2 ratio. The sampling process was repeated ten times with different random seeds to ensure robustness and quantify uncertainty.

2.2 Machine Learning Approach

Machine learning has found widespread applications in data analysis across multiple disciplines due to its ability to automatically learn data patterns and optimize models. Among various ML algorithms, gradient boosting, an ensemble learning method, combines multiple weak learners into a robust predictive model. Studies have demonstrated that it outperforms random forests, artificial neural networks, and convolutional neural networks (Abdollahi-Arpanahi et al., 2020; Ma et al., 2018; Yan et al., 2021). Light Gradient Boosting Machine (LightGBM), developed by Microsoft, is a gradient-boosting algorithm optimized for efficiently handling large-scale structured data sets (Ke et al., 2017). It offers advantages such as high training speed, low memory usage, and the ability to perform parallel computations. Gao et al. (2024) compared four ML algorithms for predicting cloud droplet number concentration and cloud droplet relative dispersion and found that LightGBM achieves the best performance in both computational efficiency and prediction accuracy, emphasizing its potential for advancing research on entrainment-mixing processes.

To achieve the study's objective of predicting ε and δ during entrainment-mixing processes with ML, LightGBM was selected as the preferred algorithm. Model input features were chosen based on their relevance to ε and δ and their availability in numerical models. For predicting ε, the selected features included environmental variables (RHe, specific humidity (qve), environmental temperature (Te), and environmental pressure (Pe)), in-cloud variables (w, liquid water content (LWC), Nd, re, liquid water path, and COT), and meteorological conditions (cloud fraction, lower tropospheric stability (LTS), SST, surface relative humidity (RHsurf), and terrain height (Hterrain)). For predicting δ, the observed ε was included as an additional input feature. Note that all variables used are temporally matched to the satellite observation time and spatially matched to a 1° × 1° grid, as detailed in Zhu et al. (2025).

One of the most widely used methods for interpreting ML is the SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017), which has been widely applied for ML interpretation in geoscientific applications (García & Aznarte, 2020). SHAP values provide a fair allocation of each feature's contribution to the prediction, facilitating a deeper understanding of how ML makes predictions. For any given feature, a positive SHAP value indicates a positive contribution relative to the baseline prediction, while a negative value indicates a negative contribution. The mean absolute SHAP value is a standard metric for ranking feature importance. Given that entrainment-mixing is a small-scale process characterized by complex nonlinear interactions among variables, SHAP offers a robust framework for elucidating the complex physical mechanisms underlying ML prediction of ε and δ (Gao et al., 2024).

3 Results

3.1 Parameterization of Entrainment and Detrainment Rates Based on the Physical Approach

Previous studies have proposed parameterizations through curve fitting and other physical approaches (Li et al., 2025; Xu et al., 2021; Zhang et al., 2016). In this section, parameterizations for ε and δ are developed using physical approaches. Following similar methods in earlier research (Lu et al., 2016; Xu et al., 2021), ε is parameterized as a function of RHe and w. Figure 1 presents the fitted functions derived from a randomly selected 80% of the data, as well as the correlation coefficient (R) and root mean squared error (RMSE) between the fitted and calculated values. Additionally, it shows the R and RMSE between the fitted and calculated ε for the remaining 20% of the data, with calculated ε serving as the ground truth for evaluation. The parameterization based on RHe (Figure 1a) outperforms that based on w (Figure 1b), achieving higher R and lower RMSE across both the fitting and test data sets. Due to the complexity of cumulus processes (Li et al., 2019; Wu et al., 2023; Yeom et al., 2025), ε is influenced by multiple factors (Lu et al., 2016; Xu et al., 2021). Therefore, a multi-variable parameterization using both RHe and w, which are independent of each other, is presented in Figure 1c. The results indicate that the multi-variable parameterization (Figure 1c) yields higher R and lower RMSE than the single-variable cases (Figures 1a and 1b) in both fitting and test data sets. Overall, for the parameterization of ε, the multi-variable parameterization based on RHe and w performs better than single-variable parameterizations.

Details are in the caption following the image

Relationship between the fitted and calculated values of entrainment (ε) or detrainment rate (δ) based on satellite observations. The parameterizations of ε use environmental relative humidity (RHe) and vertical velocity (w) and the parameterizations of δ use ε and RHe. The total number of samples is given in the upper-right corner of each subplot, with 80% of the data used for parameterization and 20% used for testing scheme. R and RMSE represent the correlation coefficient and root mean square error, respectively.

For the parameterization of δ, similar to Gregory and Rowntree (1990), a linear fit based on ε demonstrates strong performance, with high R and low RMSE in both the fitting and test data sets (Figure 1d). Moreover, except for a few outliers, the fitted and calculated values of δ align closely with the 1:1 line. Compared to the linear fitting, a power-law fitting based on ε achieves comparable R yet lower RMSE, and the agreement between fitted and calculated values is closer to the 1:1 line (Figures 1d and 1e). The simple parameterization of δ is fine. However, given the significant influence of RHe on δ and its frequent inclusion in δ parameterizations, this study also explores two alternative formulations (δ = ε(a − RHe) and δ = aε(1 − RHe), where a is a fitting parameter), inspired by Bechtold et al. (2014) and Stratton and Stirling (2012). Figures 1f and 1g illustrate the performance of these two parameterizations. Although the two forms are similar, the parameterization based on δ = ε(a − RHe) outperforms the alternative, showing higher R and reduced RMSE in both the fitting and test data sets. Considering that the single-variable power-law parameterization exhibits enhanced performance, this study further optimizes the two-variable parameterizations by introducing a power-law exponent, resulting in two new parameterizations: δ = εb (a − RHe) and δ = aεb (1 − RHe). Compared to the original schemes from Bechtold et al. (2014) and Stratton and Stirling (2012) (Figures 1f and 1g), the new parameterizations show comparable or improved R and reduced RMSE in both the fitting and test data sets. Moreover, the comparison between fitted and calculated values under the new parameterizations aligns more closely with the 1:1 line (Figures 1h and 1i). Overall, the δ parameterization based on δ = εb (a − RHe) demonstrates the strongest performance.

3.2 Machine Learning-Based Prediction of Entrainment and Detrainment Rates

Due to the inherent complexity of the cumulus entrainment-mixing process, ε and δ are influenced by multiple factors, including environmental meteorological conditions, cloud microphysical properties, and macrophysical characteristics. As a result, in addition to developing traditional parameterizations, ML offers an alternative method for predicting ε and δ. Previous studies have shown that LightGBM outperforms other algorithms in investigating entrainment processes (Gao et al., 2024). Accordingly, this section utilizes the LightGBM to predict ε and δ.

Using LightGBM trained on 80% of the data, Figures 2a and 2b presents the validation results obtained using the remaining 20% of the data. The ε values predicted by the ML approach align well with the calculated values, with a high R and a low RMSE. Similarly, the predicted δ values exhibit strong correlations with the calculated values and align closely along the 1:1 line, also showing high R and low RMSE. Overall, the ML approach highlights its effectiveness in accurately predicting both ε and δ.

Details are in the caption following the image

Comparison between machine learning (ML) predictions and satellite-derived values of entrainment and detrainment rates (ε and δ), and interpretation of feature contributions. Panels (a) and (b) show the relationships between predicted and calculated ε and δ, respectively, with the total number of samples indicated in the upper-right corner. In each case, 80% of the data were used for training the ML model and 20% were used for testing it. R and RMSE denote the correlation coefficient and root mean square error, respectively. Panels (c) and (d) present scatter plots of each feature based on the SHapley Additive exPlanations values for predicting (c) ε and (d) δ. Panels (e) and (f) show the corresponding rankings of feature importance. qve: specific humidity; RHe: relative humidity; Te: environmental temperature; w: cloud base vertical velocity; Pe: environmental pressure; LWC: liquid water content; Hterrain: terrain height; LTS: lower tropospheric stability; SST: sea surface temperature; RHsurf: surface relative humidity; CF: cloud fraction; COT: cloud optical thickness; re: effective radius; Nd: cloud droplet concentration; LWP: liquid water path.

Although the ML approach shows strong performance in predicting ε and δ, interpreting how the ML generates these predictions is equally important. This not only enhances the physical interpretability of the ML approach but also offers useful insights for future parameterizations. Through SHAP analysis, the contributions of each input feature to the predictions can be explored. Figure 2 also presents the scatter plots of SHAP values for the features contributing to the predictions of ε and δ, as well as their rankings. From the SHAP scatter plots for predicting ε (Figure 2c), higher qve and RHe contribute positively to the prediction, while higher Te, w, Pe, and LWC have negative contributions. The feature ranking (Figure 2e) indicates that the top six features influencing ε are qve, RHe, Te, w, Pe, and LWC, followed by Hterrain, LTS, and SST. The remaining features have negligible impact. Among these, qve and RHe are the most influential factors, with SHAP value contributions accounting for 26.6% and 20.6%, respectively. These results are consistent with physical expectations: higher RHe reduces buoyancy and vertical velocity within the cloud core, allowing more time for cloud-environment mixing and thus increasing ε (Lu et al., 2016, 2018; Zhu et al., 2021; Zhu, Lu, Chen, et al., 2024). The global satellite observations used for training include cloud samples with sufficient variability (relevant to the natural selection mechanism) and are not confined to convection-favorable environments (associated with the cloud-size mechanism). Therefore, the ε-RHe relationship learned by the ML approach is more likely to reflect the natural selection mechanism rather than the cloud-size mechanism, as discussed in Section 1 (Derbyshire et al., 2011). The ML approach effectively identifies the key factors influencing ε. Beyond RHe and w, the interpretation of the ML also highlights the significant influence of Te and Pe.

For the prediction of δ, Figures 2d and 2f also present the SHAP-based interpretation. The results reveal that an increase in ε contributes positively to the prediction of δ, and ε is the most significant one among the input features, with a SHAP value contribution as high as 55.9%. Other features that play relatively important roles in predicting δ include Pe, w, LWC, RHe, Te, SST, qve, and RHsurf. Among these, Pe, w, LWC, and Te contribute positively to δ prediction, while RHe, SST, qve and RHsurf have negative contributions. The remaining features exhibit negligible correlation with δ and thus can be neglected. The dominant contribution of ε to δ prediction is consistent with physical expectations, as entrainment and detrainment processes are closely related. Beyond ε, the contributions of environmental variables and w to the prediction of δ are also notable.

Note that the ML approach identifies several physically meaningful variables not directly involved in the retrieval of ε and δ as important predictors (such as w and RHe), and SHAP analysis confirms that their relationships with ε and δ align with those in physical parameterizations, indicating that the ML approach likely captures underlying physical processes rather than retrieval-related patterns. Moreover, results show that the ML approach consistently yields accurate predictions of both ε and δ across diverse spatial and environmental conditions (Figures S3, S4, and S5 in Supporting Information S1), supporting the generalizability of the developed ML-based parameterizations. Nevertheless, we still acknowledge the possibility that the ML approach may learn the systematic and structural biases embedded in the retrieval algorithm.

3.3 Discussion: Comparison Between Physical and Machine Learning Approaches

The results from the previous two sections indicate that both physical and ML approaches achieve strong performance in predicting ε and δ. Table 1 provides a comprehensive comparison of the ε and δ parameterizations developed using the two approaches, based on statistical evaluation metrics (see Text S2 for details) from ten independent runs with different random seeds. Compared to the physical approach, the LightGBM trained on all features (LGB1) yields predictions with lower mean absolute error (MAE), mean squared error (MSE), and RMSE, indicating that ML outperforms the physical approach in accurately predicting ε. Notably, the physical approach relies on only one or two parameters for parameterization (e.g., the ε parameterization based on RHe is referred to as P1_a; based on w as P1_b; and based on both RHe and w as P1_c). In contrast, LGB1 uses 14 features for training and predicting ε. To ensure a fair comparison, this study further trained ML using the same input parameters as those employed in the physical approach. Specifically, the LightGBM trained on RHe is labeled as LGB1_a, the LightGBM trained on w as LGB1_b, and the LightGBM trained on both RHe and w as LGB1_c. As shown in Table 1, LGB1_a, LGB1_b, and LGB1_c yield lower MAE, MSE, and RMSE compared to P1_a, P1_b, and P1_c, respectively. This demonstrates that, even when using the same parameters for parameterization, ML consistently outperforms the physical approach in accurately predicting ε. These results demonstrate the superiority of ML over the physical approach in developing ε parameterizations.

Table 1. Comparison of the Evaluation Metrics for Entrainment (ε) or Detrainment Rate (δ) Parameterizations Established Using Physical and Machine Learning Approaches
Predicted variable Parameterization Specific form or features MAE MSE RMSE
ε P1_a ε = aRHeb 0.589 1.126 1.061
P1_b ε = awb 0.753 1.669 1.292
P1_c ε = aRHebwc 0.578 1.133 1.063
P1_GAM RHe, qve, Te, Pe, LWC, Nd, re, LWP, COT, CF, LTS, SST, RHsurf, Hterrain 0.491 0.835 0.914
LGB1_a RHe 0.521 1.068 1.033
LGB1_b w 0.737 1.638 1.280
LGB1_c RHe, w 0.526 1.043 1.021
LGB1 RHe, qve, Te, Pe, LWC, Nd, re, LWP, COT, CF, LTS, SST, RHsurf, Hterrain 0.329 0.510 0.714
δ P2_a δ =  0.314 0.221 0.469
P2_b δ = b 0.199 0.158 0.396
P2_c δ = ε(a − RHe) 0.395 0.284 0.532
P2_d δ = (1 − RHe) 0.763 1.289 1.135
P2_e δ = εa (b − RHe) 0.226 0.173 0.415
P2_f δ = b (1 − RHe) 0.376 0.514 0.716
P2_GAM ε, RHe, qve, Te, Pe, LWC, Nd, re, LWP, COT, CF, LTS, SST, RHsurf, Hterrain 0.134 0.129 0.358
LGB2_a ε 0.153 0.145 0.379
LGB2_b ε, RHe 0.140 0.141 0.374
LGB2 ε, RHe, qve, Te, Pe, LWC, Nd, re, LWP, COT, CF, LTS, SST, RHsurf, Hterrain 0.072 0.086 0.292
  • Note. For entrainment rate (ε), the physical parameterization based on environmental relative humidity (RHe) is denoted as P1_a, that based on vertical velocity (w) is P1_b, and that based on both RHe and w is P1_c, where a, b, and c are the fitted parameters. The generalized additive model trained on all features for ε prediction is denoted as P1_GAM. The LightGBM trained on RHe, w, both RHe and w, and all features are denoted as LGB1_a, LGB1_b, LGB1_c, and LGB1, respectively. For detrainment rate (δ), the physical parameterizations based on ε are denoted as P2_a and P2_b, while those based on both ε and RHe are denoted as P2_c, P2_d, P2_e, and P2_f. The generalized additive model trained on all features for δ prediction is denoted as P2_GAM. The LightGBM trained on ε, both ε and RHe, and all features are denoted as LGB2_a, LGB2_b, and LGB2, respectively.

Table 1 also summarizes the performance evaluation of the δ parameterizations. Compared to the physical approach, the LightGBM trained on all features (LGB2) significantly reduces the MAE, MSE, and RMSE between the predicted and calculated δ, demonstrating its superior predictive accuracy. Similarly, ML for δ prediction was developed using the same parameters as those used in the physical parameterizations. Specifically, the LightGBM trained on ε is labeled as LGB2_a, and the LightGBM trained on ε and RHe is labeled as LGB2_b. The results show that LGB2_a and LGB2_b yield lower MAE, MSE, and RMSE compared to their respective physical parameterizations. This indicates that ML exhibits superior accuracy in predicting δ compared to the physical approach.

Compared to the physical approach, the ML approach inherently possesses more flexible functional forms, making the comparison structurally unbalanced. To ensure a more balanced comparison, a generalized additive model (GAM) with comparable complexity was applied to predict ε and δ using all 14 input variables. As shown in Table 1, the ML approach yields lower MAE, MSE, and RMSE than the GAM, confirming its superior predictive performance. Based on the evaluation of the prediction performance for ε and δ (Table 1), the parameterizations developed using ML exhibit significantly higher accuracy than those developed using physical approaches. This advantage likely arises from the fact that physical approaches are limited in their ability to account for the combined effects of multiple influencing factors, whereas ML can capture complex nonlinear relationships between variables through data-driven training. Furthermore, Figure S6 in Supporting Information S1 shows that the training and validation loss curves of the abovementioned ML models exhibit convergence without signs of overfitting (Gao et al., 2024; Seifert & Rasp, 2020). These findings offer valuable guidance for the future development of ε and δ parameterizations and support their applicability in broader climate and weather modeling contexts.

4 Concluding Remarks

The current parameterization schemes for entrainment and detrainment rates (ε and δ) are typically based on theoretical derivations, observations from individual sites, or numerical simulations, lacking support from the global observational data set. To fill this gap, this study utilizes a data set of ε and δ derived from global satellite observations for the development and evaluation of parameterizations using both physical and ML approaches, followed by a comparison of the two approaches. The main conclusions are as follows:

First, parameterizations of ε and δ are developed using physical approaches. The ε is recommended to be parameterized using a scheme incorporating RHe and vertical velocity, while δ is recommended to be parameterized using a scheme incorporating ε and RHe.

Second, ML-based predictions for ε and δ are developed and interpreted. The ML demonstrates high accuracy in predicting ε and δ. SHAP analysis reveals that, for predicting ε, environmental variables and vertical velocity make significant contributions. For predicting δ, ε contributes the most, while environmental variables and vertical velocity make smaller but still meaningful contributions.

Finally, comparative analysis highlights the significant advantages of ML in predicting ε and δ. Compared to physical approaches, ML achieves lower prediction errors for ε and δ. This advantage primarily stems from the ability of ML to effectively capture complex nonlinear relationships. In future efforts to improve parameterizations for numerical models, ML can be applied to predict variables that cannot be directly obtained from models or observations. Although challenges such as model interpretability and limited sample sizes may arise in practical implementation, the results of this study provide confidence for further application of ML in meteorological research.

In this study, parameterizations for ε and δ are developed based on the only currently available global-scale data set derived from the SNPP satellite. We acknowledge that the parameterizations developed in this study have limited temporal representativeness, as they are based on a data set confined to 3 months in 2017. This temporal constraint may limit their ability to capture the full range of seasonal variability and extreme meteorological regimes. In the future, the spatial and seasonal variability of ε and δ can be further investigated by incorporating more data sets covering the entire year. Meanwhile, the universality and robustness of the proposed parameterizations can be further evaluated against ε and δ data sets derived from other satellite platforms, such as the Terra and Aqua satellites, the Himawari-8/9 satellite, and the Fengyun series satellites.

Acknowledgments

This research is supported by the National Natural Science Foundation of China (42325503, 42475089). The appointment of Chunsong Lu at Nanjing University of Information Science & Technology is partially supported by the Jiangsu Specially-Appointed Professor (R2024T01). Sinan Gao is supported by the National Science Foundation of China (42305091), Guangdong Basic and Applied Basic Research Foundation (2024A1515510021), and Open Grants of the China Meteorological Administration Aerosol-Cloud and Precipitation Key Laboratory (KDW2414). We acknowledge the High Performance Computing Center of Nanjing University of Information Science & Technology for the support of this work.

    Conflict of Interest

    The authors declare no conflicts of interest relevant to this study.

    Data Availability Statement

    The data set used in this study is available from the Level-1 and Atmosphere Archive and Distribution System (LAADS) Distributed Active Archive Center, https://dx.doi.org/10.5067/VIIRS/CLDPROP_L2_VIIRS_SNPP.011 (Platnick et al., 2017), the National Center for Atmospheric Research (NCAR), https://doi.org/10.5065/D6M043C6 (National Centers for Environmental Prediction, 2000), the NOAA, https://psl.noaa.gov/data/gridded/data.noaa.oisst.v2.highres.html (Huang et al., 2021). Figures are made with MATLAB version R2024a, available at https://www.mathworks.com/products/matlab.html (MathWorks, 2024).