Predicting Water Quality Index (WQI) by feature selection and machine learning: A case study of An Kim Hai irrigation system
13/05/2024Abstract
A variety of water quality indices have been used to assess the state of waterbodies all over the world. In calculating a Water Quality Index (WQI), traditional methods require the evaluation of many water quality parameters, making them costly and time-consuming. In recent years, machine learning (ML) algorithms have emerged as an effective tool to solve many environmental problems, including water quality management. In this study, we investigate the performance of the ML-based method in calculating the WQI. We apply several feature selection techniques to select the key parameters fed the ML models. Experiments are carried out to evaluate the WQI based on a dataset collected from 2007 to 2020 of An Kim Hai system, one of the most important irrigation systems in the north of Vietnam. The obtained results show that the application of selection methods allows reducing significantly the number of water quality parameters fed the ML models without losing their accuracy. In particular, by using the embedded method, we find out four important parameters, including Coliform, DO, Turbidity, and TSS, that have the greatest impact on water quality. Based on these parameters, the Random Forest model provides the best accuracy in predicting the WQI values from the An Kim Hai system with a Similarity of 0.94. The combination of feature selection and ML methods is then considered an effective alternative for calculating the WQI, leading to a desirable performance and a reduction of input parameters. This makes water quality monitoring less costly, substantial effort, and time.
Introduction
Surface water is an important resource for the environment and for human life. Fresh surface water sustains ecological systems, provides a habitat for countless aquatic animals and plants, and supports many human uses like drinking water, domestic uses, irrigation, livestock, commercial uses, and industrial uses. Unfortunately, human exploitation has negative effects on water quality. Daily domestic wastewater, industrial wastewater, and agricultural wastewater have polluted surface water resources in rivers, lakes, ponds, and streams. In turn, the deterioration of surface water quality causes serious impacts on the surrounding ecosystem. In order to prevent those negative effects, monitoring and assessing the quality of surface water is an important task, allowing managers to take necessary actions to improve water quality once it is contaminated.
The surface water quality is determined by the physical, chemical, and biological components contained. These elements arise from natural processes and human activities. For example, the water flow erodes the banks, bringing with it mud, and leading to turbidity and suspended solids. Wastewater from livestock and farming activities contains biological ingredients like bacteria, viruses, and parasitic worms, and chemical ingredients like pH, phosphates, pesticides, and ammonical nitrogen. Meanwhile, industrial wastewater from manufacturing factories may carry heavy metals and surfactants. That is to say, there are many sources and factors that affect water quality. This makes the assessment of water quality more difficult. In this context, a multitude of Water Quality Indices (WQIs) have been introduced as an effective measure for the overall assessment of water quality. A WQI typically integrates many physical, chemical, and biological parameters into a single value that can be used to describe the general health or status of a waterbody. The value of WQIs typically ranges from 0 to 100, where the larger the WQI value is, the better the water quality is.
In the literature, there are several ways to calculate the WQI. Traditional methods formulate a single WQI as a combination of many sub-indices. That means, to calculate the WQI, these methods require measuring a large number of water quality parameters. As a consequence, they are often costly and time-consuming. This is a really big obstacle in assessing water quality, especially for developing countries, where infrastructure and investment are limited. Recently, machine learning (ML) based methods have been extensively used for calculating the WQI. Due to its power in handling complex nonlinear relational data, ML algorithms can mine potential patterns and discover the underlying mechanisms in the data, and then provide a good prediction for the WQI value.
When calculating the WQI, the first step of the ML-based method is to select the crucial water quality parameters. This selection is very important as it has a great impact on the effectiveness of the calculation of the WQI values. The main advantage of the ML-based method over the traditional methods is that they aim to select as few parameters as possible and achieve the most possible accuracy. From this point of view, one should consider that an ML-based model has no practical application if it uses the same input as the traditional methods. In the literature, most of the existing studies applied statistical methods such as Pearson correlation coefficient (Asadollah et al., 2021, Kocer and Sevgili, 2014), the PCA method (Kim et al., 2017, Jiang et al., 2020), or the expert opinions (Uddin et al., 2021) to determine important water quality parameters. Although these methods worked in some case studies, they do not take advantage of the other feature selection algorithms in machine learning. Some studies even tried all possible combinations of parameters to find the most effective ones. This certainly not only consumes time and computational effort but also makes them difficult to be applied in practice. To overcome these limitations, in this study, we propose a novel ML-based method that combines feature selection techniques with some ML algorithms for calculating the WQI. The performance of the proposed method is validated in a case study in which we evaluate the WQI based on a dataset collected from the AKH irrigation system, an important irrigation system in the North of Vietnam. The obtained results show that our proposed methods can significantly reduce the number of water quality parameters without losing the accuracy in calculating WQI. To the best of our knowledge, this is the first study in the literature exploiting different kinds of feature selection methods to automatically find the parameters that have the most influence on water quality, resulting in less use of parameters but keeping the performance of the ML models. The main contributions of this study are summarized as follows:
1)We propose a novel ML-based method that combines feature selection methods and machine learning algorithms to estimate the WQI values. Experiments on a real dataset show that our proposed method leads to a good prediction of WQI values with a similarity up to 0.94.
2)We explore the power of various feature selection techniques to find the most influential parameters included in the water. These techniques allow us to reduce significantly and efficiently the number of water quality parameters in predicting the WQI. In particular, by using the embedded method and the Random Forest algorithm, the number of parameters can be reduced from 10 to 4 while achieving the best accuracy of calculating WQI.
3)We prove through a case study that the ML-based method can be considered an effective alternative for calculating the WQI. These methods not only accurately evaluate the WQI values, but also save time and analytical costs considerably. This finding has a practical meaning, especially for the cases where it is not straightforward to measure water quality parameters.
The rest of the paper is organized as follows: In Section 2, we present a brief review of the related work in the literature. Section 3 provides information about the case study, including the study area, the data collection, and the traditional method to calculate the WQI values in Vietnam. The proposed ML-based method for computing the WQI has been presented in Section 3. Section 4 will explain the experiments and discuss the obtained results. Finally, some concluding remarks are given in Section 5.
—————————————————————————————————————— ► See detail: Predicting Water Quality Index (WQI) by feature selection and machine learning: A case study of An Kim Hai irrigation system Bui Quoc Lapa, Thi-Thu-Hong-Phanb, Huu Du Nguyenc, d, Le Xuan Quange, Phi Thi Hange, Nguyen Quang Phif, Vinh Truong Hoangg, Pham Gia Linhh, Bui Thi Thanh Hangh aFaculty of Chemistry & Environment, ROOM team, Thuy loi university, 175 Tay Son, Dong Da, Hanoi, Viet Nam bDepartment of Artificial Intelligent, FPT University, Danang, Viet Nam cSchool of Applied Mathematics and Informatics, Hanoi University of Sciences and Technology, Hanoi, Viet Nam dDepartment of Mathematics, Faculty of Information Technology, Vietnam National University of Agriculture, Hanoi, Viet Nam eInstitute for Water and Environment, Vietnam Academy for Water Resources, 171 Tay Son, Dong Da, Hanoi, Viet Nam fFaculty of Water Resources Engineering, Thuyloi University, 175 Tay Son, Dong Da, Hanoi, Viet Nam gHo Chi Minh City Open University, Ho Chi Minh, Viet Nam hUndergraduate Course of Environmental Engineering, Thuyloi University, 175 Tay Son, Dong Da, Hanoi, Viet Nam Ecological Informatics Volume 74, May 2023, 101991
Ý kiến góp ý: