Probabilistic Forecast of Visibility at Gimpo, Incheon, and Jeju International Airports Using Weighted Model Averaging

: In this study, weighted model averaging (WMA) was applied to calibrating ensemble forecasts generated using Limited-area ENsemble prediction System (LENS). WMA is an easy-to-implement post-processing technique that assigns a greater weight to the ensemble member forecast that exhibits better performance; it is used to provide probabilistic visibility forecasting in the form of a predictive probability density function for ensembles. The predictive probability density function is a mixture of discrete point mass and two-sided truncated normal distribution components. Observations were obtained at Gimpo, Incheon, and Jeju International Airports, and 13 ensemble member forecasts were obtained using LENS, for the period of December 2018 to June 2019. Prior to applying WMA, a reliability analysis was conducted using rank histograms and reliability diagrams to identify the statistical consistency between the ensembles and the corresponding observations. The WMA method was then applied to each raw ensemble model, and a weighted predictive probability density function was proposed. Performances were evaluated using the mean absolute error, the continuous ranked probability score, the Brier score, and the probability integral transform. The results showed that the proposed method provided improved performance compared with the raw ensembles, indicating that the raw ensembles were well calibrated using the predicted probability density function.


Introduction
The aviation sector requires new high-quality forecast information for all types of weather conditions. In particular, it is difficult to provide high-quality information on weather variables (e. g., wind direction, wind speed, cloud altitude, visibility, etc.) directly related to aviation safety with existing numerical weather prediction (NWP) systems. Visibility is one of the important aspects of aviation weather. The lack of visibility is hazardous for airplane landing operations and can be detrimental for its management, as it can lead to delays and cancellations. Efforts have been made to produce more reliable and accurate prediction information based on the ensemble NWP system. However, the visibility forecasts generated using NWP can be biased and dispersive, due to the limitations of the models. Therefore, deterministic and probabilistic forecasts can be applied to provide improved results by calibrating these uncertainties.
Several methods for forecasting visibility and post-processing forecasts in short-time forecasting have been developed. Vislocky and Fritsch [1] compared the performance of observation-based, model output statistics-based, and persistence climatology models for short-term deterministic ceiling and visibility forecasts. Leyton and Fritsch [2,3] extended the observation-based approach using high-density and, later, high-frequency networks of surface observations to produce probabilistic forecasts. Pasini et al. [4], Bremnes

Data
Visibility data obtained from Gimpo, Incheon, and Jeju International Airports in South Korea, and 13 ensemble member forecasts obtained using LENS, a numerical forecast model operated by Korea Meteorological Administration (KMA), were used in this study. LENS consists of 13 members, 1 unperturbed control member and 12 perturbed members selected from the global ensemble prediction system with 32 km horizontal resolution. The horizontal resolution of LENS is 2.2 km for the domain including the Korea peninsula. In addition, LENS has 70 terrain-following vertical levels up to 40 km. The LENS microphysics scheme is rooted in Wilson and Ballard [23], and visibility is calculated as Visibility = −lnε/(β air + β) = −lnε/ β air + πNr 2 m Qη where the liminal contrast is ε = 0.05, β air is the extinction coefficient of clean air (which is assumed to be equivalent to 100 km visibility), β is the extinction coefficient of droplets (dry or wet) in a volume, Q is the extinction efficiency, and η is the ratio of the mean square radius to the square of the mean radius of the droplets (η (r 2 /r 2 m ) = 0.75) [24,25].
Datasets of hourly visibility, precipitation, and relative humidity were obtained between 1 December 2018 and 30 June 2019, and the ensemble forecast was conducted at 0000 UTC and 1200 UTC for projection times of 1 h, 4 h, and 24 h. Since the model resolution changed on 29 November 2018, only the datasets from December 2018 to June 2019 were analyzed. The information on the observation stations is presented in Table 1. Prior to the analyses, we checked the missing values of ensemble forecasts and their corresponding observations for the entire period. If either was absent, both were removed from the datasets. Because the property of visibility may vary depending on the season, to consider this seasonal characteristic, the datasets were converted into seasonal data (winter and spring) for the three stations.
We first examined the empirical distributions of the observed visibility data and ensemble forecasts. The plots of the empirical distributions for Station 110 are presented in Figure 1. The first histogram presents the frequencies of all observed visibility, and the second histogram only shows the frequencies of observations below 10 km. The majority of all visibility values were recorded at exactly 10 km (Figure 1(1)), which indicated that observation was censored at 10 km, even if the visibility measure was more than 10 km. In addition, this plot shows a mixture of discrete probabilities of point masses for visibility values of 10 km; therefore, a continuous probability distribution was appropriate. In Figure 1(2), the observations are highly discretized. A histogram of the ensemble forecasts is presented in Figure 1(3). The scales of the observation and ensemble forecasts varied considerably. Ensemble forecasts generated visibility values much greater than 10 km and were not constrained at 10 km, in contrast to values obtained from the observations. In addition, ensemble forecasts were continuous and were skewed to the right. For Stations 113 (Incheon) and 182 (Jeju), the empirical distributions of the observed visibility and ensemble forecast showed patterns similar to those obtained for Station 110. Datasets of hourly visibility, precipitation, and relative humidity were obtained between 1 December 2018 and 30 June 2019, and the ensemble forecast was conducted at 0000 UTC and 1200 UTC for projection times of 1 h, 4 h, and 24 h. Since the model resolution changed on 29 November 2018, only the datasets from December 2018 to June 2019 were analyzed. The information on the observation stations is presented in Table 1.
Prior to the analyses, we checked the missing values of ensemble forecasts and their corresponding observations for the entire period. If either was absent, both were removed from the datasets. Because the property of visibility may vary depending on the season, to consider this seasonal characteristic, the datasets were converted into seasonal data (winter and spring) for the three stations. Visibility, relative humidity, and precipitation forecasts generated using LENS We first examined the empirical distributions of the observed visibility data and ensemble forecasts. The plots of the empirical distributions for Station 110 are presented in Figure 1. The first histogram presents the frequencies of all observed visibility, and the second histogram only shows the frequencies of observations below 10 km. The majority of all visibility values were recorded at exactly 10 km (Figure 1(1)), which indicated that observation was censored at 10 km, even if the visibility measure was more than 10 km.
In addition, this plot shows a mixture of discrete probabilities of point masses for visibility values of 10 km; therefore, a continuous probability distribution was appropriate. In Figure 1(2), the observations are highly discretized. A histogram of the ensemble forecasts is presented in Figure 1(3). The scales of the observation and ensemble forecasts varied considerably. Ensemble forecasts generated visibility values much greater than 10 km and were not constrained at 10 km, in contrast to values obtained from the observations. In addition, ensemble forecasts were continuous and were skewed to the right. For Stations 113 (Incheon) and 182 (Jeju), the empirical distributions of the observed visibility and ensemble forecast showed patterns similar to those obtained for Station 110.

Materials and Methods
To set up the probabilistic prediction model for visibility, the characteristics of data from observed visibility and from the corresponding ensemble forecasts had to be considered. The observed visibility was censored at 10 km, while the corresponding ensemble forecasts generated using LENS had positive real numbers. In other words, even if visibility was observed for more than 10 km, the observed visibility was recorded as 10 km; in contrast, the corresponding values from the ensemble forecasts were used as they were generated. Therefore, we needed a probabilistic model that reconciled the results obtained from these different datasets.

Weighted Model Averaging
We consider WMA, which models the predictive PDF of a weather quantity of interest y, as a mixture of the conditional PDFs. Let f k be the kth ensemble member forecast and h k (y| f k ) a mixture of the conditional PDF given a specific forecast f k . The WMA predictive PDF is given by where w k is the weight of the kth ensemble member forecast and refers to its relative skill over the training period. The weights are constrained to be non-negative and sum to 1.
To determine the mixture of conditional PDF h k (y| f k ) , we consider a two-component model. The first part consists of a point mass at 10 km and corresponds to the probability that the recorded visibility is 10 km, which is conditional on the kth forecast in the ensemble. The second component of the model assigns a member-specific PDF to visibility, given that it is less than 10 km. We use a two-sided truncated normal distribution defined in (0, 10).
First, we apply a logistic regression model to estimate the probability that the observed visibility is 10 km, given the forecast of the kth ensemble member, f k , as follows: where a 0k and a 1k are regression coefficients, and these parameters are estimated using a logistic regression model using the member forecasts in the training period as predictors and a vector of binary indicator of y = 10 as the response variable.
To predict visibility when the observed visibility is less than 10 km, we consider a two-sided truncated normal distribution. Observed visibility y has a normal distribution, with mean µ and variance σ 2 defined for 0 < y < 10, and the PDF for 0 < y < 10 is given by where Φ(·) is the PDF of the standard normal distribution and Φ(·) is its cumulative distribution function.
Combining the two components of the model, we build a final conditional PDF for visibility, given the kth ensemble member forecast, as follows: where g k (y| f k ) is a member-specified truncated normal distribution. The final WMA model for the predictive probability density function of visibility y is given by Atmosphere 2022, 13, 1969 with mean µ k = b 0k + b 1k f k and standard deviation σ k of the truncated normal distribution. We consider a relative humidity variable for inclusion in each component of the model. The model for binary outcome y = 10 is represented by where f k and r k are the member-specified visibility and relative humidity forecasts, respectively. Given that y < 10, the mean and standard deviation of the associated memberspecific two-sided truncated normal distribution is specified as By inserting these two components into Equation (6), we consider the predictive PDF that takes into account the relative humidity variable.
For the given observation of visibility less than 10 km, parameters b 0k and b 1k , and standard deviation σ k are estimated using the method of maximum likelihood.
Parameters w 1 , · · · , w K are estimated as follows: After estimating parameters µ k = b 0k + b 1k f k , standard deviation σ k , and the probability that the observed visibility is 10 km given the forecast of the kth ensemble member f k over the training period, the median value is derived from the truncated normal distribution based on Equation (6) to estimate the visibility for less than 10 km. The corresponding estimates of observed visibility y during the training period are obtained aŝ median value, otherwise , i = 1, · · · , n, k = 1, · · · , K; K is the total number of ensemble members; and n is the total number of observations. We used the mean absolute error (MAE) and non-negative least squares to determine weights w 1 , · · · , w K . The weight based on MAE is used to assign the largest weight to the ensemble member forecast with the smallest prediction error in the training period. In contrast, the weights based on non-negative least squares are determined by minimizing the following weighted combinations: To select one of the two estimated weights, each prediction error is calculated for the training period; then, the weights (w = (w 1 , · · · , w K )) that provide the smallest prediction error are finally selected. WMA is applied to each station individually.

Scoring Rules
Prior to setting up a statistical model of ensemble member forecasts, we evaluated the prediction skills of ensembles. The measures used for comparing the prediction skills were the Brier score [26,27] for the binary events (y = 10), the continuous ranked probability score (CRPS), the mean absolute error (MAE), and the root mean square error (RMSE). The Brier score (BS) is defined as the mean squared error of the forecast probability for y = 10, as follows: where n is the number of observations; p i is the forecasted probability of P(y = 10); and o i is 1 if y = 10 and 0 otherwise. The BS takes a value in the range between 0 and 1, and the perfect BS has a value of zero. The CRPS [22,28] is an accurate scoring rule, which is defined as where F(·) is the cumulative distribution function of the forecast, y is the observation, and 1(·) is the indicator function. CRPS is a generalization of the MAE and is a more general measure of model fit than the BS.

Results
Since the visibility may vary depending on the season, to consider the seasonal characteristic, the datasets were divided into seasonal data (winter and spring) for the three stations. For each station, the model was estimated using the training dataset, and the performance was evaluated using the test dataset. The training and test periods for each season are given in Table 2. The point forecast using WMA was obtained by evaluating the median of the predictive probability density function. Similarly, we took the median of the ensemble forecasts to be the point forecast associated with the raw ensembles. In the calculation of the prediction performances, all ensemble forecasts greater than 10 km were set to 10 km to facilitate the comparisons between WMA and ensemble forecasts.

Reliability Analysis
A rank histogram (RH) [29,30] was used to assess the reliability of the visibility ensemble forecasts and their corresponding observations at the three stations. The RH is a very useful visual tool for evaluating the reliability of ensemble forecasts and for identifying errors related to their mean and spread.
The RHs of the 13 ensemble forecasts and the corresponding observation visibility values for the three stations, that is, Gimpo (110), Incheon (113), and Jeju (183), are presented in Figure 2. In general, the RHs showed different trends and dispersions for each station. For Station 110 (Gimpo), an RH with high counts near the right extreme and low frequency counts near the left extreme presented a systematic error in the data of the ensembles; ensemble forecasts had a strong negative bias, which indicated an under-estimation. For Station 113 (Incheon), the RH showed nearly similar frequency counts at both extremes but had a slightly higher frequency on the right. This implied that the ensemble forecasts may have been under-dispersive and had a weak negative bias. However, the RH for Station 182 (Jeju) tended to have an almost uniform distribution, although it had a slightly higher frequency on the extreme left of the histogram. The reliability index [31] was used to quantify the deviation of the rank distribu from uniformity. The reliability index (RI) is defined as where denotes the number of classes in the rank histogram and denotes the served relative frequency in class k. If the ensemble forecasts and observations were tained from the same distribution, the RI would have been zero. The RI for each statio Figure 2 is listed in Table 3, and their values show that they were far from uniformity

Prediction Skill
The prediction skills of the ensemble forecasts are listed in Table 4. From the tabl can be seen that the prediction skills for Jeju International Airport (182) were superio those for other stations in terms of the three scoring measures. This was similar to results of the RHs. As the RH for Station 110 (Gimpo) had a strong negative bias, indicated that the corresponding prediction error was relatively larger than that for other stations. We applied the WMA model to each projection time to obtain the probabilistic fo cast, and the prediction performance was evaluated by comparing the station and pro tion time. The results for Station 110 (Gimpo) considering February 5, 2019, and the p jection time of 6 h, are shown in Table 5 and Figure 3. The reliability index [31] was used to quantify the deviation of the rank distribution from uniformity. The reliability index (RI) is defined as where K denotes the number of classes in the rank histogram and p k denotes the observed relative frequency in class k. If the ensemble forecasts and observations were obtained from the same distribution, the RI would have been zero. The RI for each station in Figure 2 is listed in Table 3, and their values show that they were far from uniformity.

Prediction Skill
The prediction skills of the ensemble forecasts are listed in Table 4. From the table, it can be seen that the prediction skills for Jeju International Airport (182) were superior to those for other stations in terms of the three scoring measures. This was similar to the results of the RHs. As the RH for Station 110 (Gimpo) had a strong negative bias, this indicated that the corresponding prediction error was relatively larger than that for the other stations. We applied the WMA model to each projection time to obtain the probabilistic forecast, and the prediction performance was evaluated by comparing the station and projection time. The results for Station 110 (Gimpo) considering 5 February 2019, and the projection time of 6 h, are shown in Table 5 and Figure 3.    Table 5 lists each ensemble forecast and the corresponding WMA output details, and Figure 3 depicts the WMA predictive PDF and contributions from the weighted ensemble PDFs. From Table 5 and Figure 3, it can be seen that the range of ensemble forecasts was approximately in the range of 0.9-3.0 km, and the verifying observation value was 7 km. Although the observation value lay outside the range of ensemble forecasts, it was within 80% of the central predictive interval obtained using WMA.
The comparison of the prediction performances of the ensemble median and WMA median forecast for each station in terms of the MAE, CRPS, and BS according to season is listed in Table 6. As shown in the table, although the prediction skills of the models differed slightly depending on the season, it could be seen that the prediction skills of the WMA forecast were better than those of the ensemble for all stations. Among them, it could be seen that the prediction error was significantly improved for Station 110, and the prediction error for Station 182 (Jeju) was less improved. These improvements could also  Table 5 lists each ensemble forecast and the corresponding WMA output details, and Figure 3 depicts the WMA predictive PDF and contributions from the weighted ensemble PDFs. From Table 5 and Figure 3, it can be seen that the range of ensemble forecasts was approximately in the range of 0.9-3.0 km, and the verifying observation value was 7 km. Although the observation value lay outside the range of ensemble forecasts, it was within 80% of the central predictive interval obtained using WMA.
The comparison of the prediction performances of the ensemble median and WMA median forecast for each station in terms of the MAE, CRPS, and BS according to season is listed in Table 6. As shown in the table, although the prediction skills of the models differed slightly depending on the season, it could be seen that the prediction skills of the WMA forecast were better than those of the ensemble for all stations. Among them, it could be seen that the prediction error was significantly improved for Station 110, and the prediction error for Station 182 (Jeju) was less improved. These improvements could also be inferred from the results of the previous reliability analysis. Since the RH for Station 110 had a strong trend (negative bias), this bias was significantly calibrated in the predictive probability model, whereas the RH for Station 182 (Jeju) was generally uniform, indicating that it was relatively less calibrated for bias. We then assessed whether adding additional predictors resulted in improved WMA forecasts. The available independent variables (predictors) in this study were relative humidity and quantitative precipitation. Quantitative precipitation was mostly 0, which was not useful for estimating visibility. Therefore, only the relative humidity was significantly associated with observed visibility, and we limited the inclusion to this additional predictor. Table 7 lists the model performance scores of WMA for visibility (denoted by WMA (vis)), WMA for visibility and relative humidity (denoted by WMA (vis, rh)), and the raw ensembles. From the table, it can be seen that the WMA models performed better than the raw ensembles across all scores; in particular, the WMA(vis, rh) model showed a slight improvement over the WMA(vis) model. Moreover, for Station 110 (Gimpo), it can be seen that the prediction error improved significantly; the prediction error was less improved for Station 182 (Jeju). The results showed that it was similar to the reliability analysis that is mentioned above. We could see that the prediction performance in spring was better than that in winter, and this implied that prediction performance can be affected by season, that is, different prediction performances may be delivered according to the season.
We then evaluated the performance of the complete predictive probability density function generated using the WMA model. Figures 4-9 show the verification RHs of the raw ensembles and the probability integral transform (PIT) histograms of the WMA forecasts for each station for winter and spring during the test period. To generate the PIT histograms, each WMA cumulative distribution function was evaluated for the corresponding observation that was less than 10 km. For observations of 10 km, the resulting probability was sampled randomly from a uniform distribution in the interval between the quantity of 1 − P(y = 10) and 1. strong negative bias and weak under-dispersion. The PIT histograms of the WMA forecasts had nearly uniform distributions, indicating that the WMA models were well calibrated and showed substantial improvement over the raw ensembles. In particular, in comparison to the WMA models, it could be seen that the WMA (vis, rh) model was more calibrated than WMA(vis). In the case of Station 113 (Incheon Airport), the RH showed an under-dispersive and weak positive bias. However, the PITs of the WMA forecasts indicated that these biases and dispersions were considerably reduced using the WMA predictive model. In addition, the more uniform patterns of the PIT of WMA (vis, rh) indicated that WMA (vis, rh) was calibrated fairly well. Jeju Airport (182) showed a slightly different pattern from the other two stations. The RH had a slightly higher frequency on the low extreme, but it showed an almost uniform pattern overall. Although there was not much change in the overall pattern of the PIT, we could see that the extreme frequency was decreased using the WMA models. This indicated that the ensemble was less calibrated. Figures 7-9 show the PIT histograms of the WMA forecasts over the test period of spring. As reported in Figure 7, the RH for Station 110 showed a strong negative bias, whereas the PITs of the WMA forecasts showed nearly uniform distributions, implying that WMA was well calibrated over the range of the predictive PDFs. As mentioned above, it could be seen that the WMA(vis, rh) model was better calibrated than WMA(vis). Station 113 had patterns that were almost similar to the spring results analyzed. In the case of Station 182, it could be seen that the PITs for spring were much more calibrated than the PITs for winter. This implied that the prediction error improved significantly for spring compared with winter.  strong negative bias and weak under-dispersion. The PIT histograms of the WMA forecasts had nearly uniform distributions, indicating that the WMA models were well calibrated and showed substantial improvement over the raw ensembles. In particular, in comparison to the WMA models, it could be seen that the WMA (vis, rh) model was more calibrated than WMA(vis). In the case of Station 113 (Incheon Airport), the RH showed an under-dispersive and weak positive bias. However, the PITs of the WMA forecasts indicated that these biases and dispersions were considerably reduced using the WMA predictive model. In addition, the more uniform patterns of the PIT of WMA (vis, rh) indicated that WMA (vis, rh) was calibrated fairly well. Jeju Airport (182) showed a slightly different pattern from the other two stations. The RH had a slightly higher frequency on the low extreme, but it showed an almost uniform pattern overall. Although there was not much change in the overall pattern of the PIT, we could see that the extreme frequency was decreased using the WMA models. This indicated that the ensemble was less calibrated. Figures 7-9 show the PIT histograms of the WMA forecasts over the test period of spring. As reported in Figure 7, the RH for Station 110 showed a strong negative bias, whereas the PITs of the WMA forecasts showed nearly uniform distributions, implying that WMA was well calibrated over the range of the predictive PDFs. As mentioned above, it could be seen that the WMA(vis, rh) model was better calibrated than WMA(vis). Station 113 had patterns that were almost similar to the spring results analyzed. In the case of Station 182, it could be seen that the PITs for spring were much more calibrated than the PITs for winter. This implied that the prediction error improved significantly for spring compared with winter.                   The PIT histograms for the test period were analyzed to evaluate the performance of the complete predictive PDFs of the WMA models. Through the PIT histograms, we could analyze the improvement of the probabilistic forecasts generated with the WMA models compared with the raw ensembles. The RHs of raw ensembles and PIT histograms over the test period for each station in winter are presented in Figures 4-6.
First, the RH for Station 110 (Gimpo Airport) showed that raw ensembles had a strong negative bias and weak under-dispersion. The PIT histograms of the WMA forecasts had nearly uniform distributions, indicating that the WMA models were well calibrated and showed substantial improvement over the raw ensembles. In particular, in comparison to the WMA models, it could be seen that the WMA (vis, rh) model was more calibrated than WMA(vis). In the case of Station 113 (Incheon Airport), the RH showed an under-dispersive and weak positive bias. However, the PITs of the WMA forecasts indicated that these biases and dispersions were considerably reduced using the WMA predictive model. In addition, the more uniform patterns of the PIT of WMA (vis, rh) indicated that WMA (vis, rh) was calibrated fairly well. Jeju Airport (182) showed a slightly different pattern from the other two stations. The RH had a slightly higher frequency on the low extreme, but it showed an almost uniform pattern overall. Although there was not much change in the overall pattern of the PIT, we could see that the extreme frequency was decreased using the WMA models. This indicated that the ensemble was less calibrated. Figures 7-9 show the PIT histograms of the WMA forecasts over the test period of spring. As reported in Figure 7, the RH for Station 110 showed a strong negative bias, whereas the PITs of the WMA forecasts showed nearly uniform distributions, implying that WMA was well calibrated over the range of the predictive PDFs. As mentioned above, it could be seen that the WMA(vis, rh) model was better calibrated than WMA(vis). Station 113 had patterns that were almost similar to the spring results analyzed. In the case of Station 182, it could be seen that the PITs for spring were much more calibrated than the PITs for winter. This implied that the prediction error improved significantly for spring compared with winter.

Conclusions
As visibility is a risk for the operations and management and causes airplane delays and cancellations, it is one of the important variables in aviation weather. Therefore, significant effort has been made to provide more reliable and accurate visibility forecasts based on ensemble numerical prediction systems. In this respect, this study provides a statistical post-processing method for verifying the performance of an ensemble prediction system and for calibrating the biases and dispersions existing in the ensemble prediction system.
The characteristics of the data from ensemble member forecasts generated using LENS and from observations were examined, and the bias and dispersion existing in the ensemble forecasts were analyzed to construct an appropriate statistical model. Based on these results, we suggested a simple WMA method that provides a fully predictive PDF, making it is easier to estimate the parameters of the model using 13 ensemble member forecasts and relative humidity generated using LENS. WMA is almost similar to BMA; however, WMA can estimate weights more easily.
The resulting WMA predictive PDF was well calibrated with respect to the raw ensembles. The WMA model could resolve the problems arising from ensemble forecasts, including both systematic and random errors, and the discrepancy in scale between ensembles and observations. In addition, we considered additional variables to increase the precision of WMA forecasts. In this application, we found that the relative humidity forecast added some amount of information to the visibility forecasts compared with the use of only visibility ensembles. Because we did not compare the WMA and BMA results, we could not predict the superiority of the prediction performances of these two methods; however, it is known that the WMA model has the advantage of being much easier to apply than the BMA model. In a further study, we aim to compare the performances of the two methods using various scoring rules. Finally, the method based on the presented research is expected to be useful for the bias correction of other aviation variables and for probabilistic forecast analyses.