Author/Authors :
Shamsipour, Mansour tehran university of medical sciences tums - School of Public Health, Center for Air Pollution Research (CAPR), Institute for Environmental Research (IER), Non-Communicable Diseases Research Center, Endocrinology and Metabolism Population Sciences Institute - Department of Epidemiology and Biostatistics, تهران, ايران , Farzadfar, Farshad tehran university of medical sciences tums - Non-Communicable Disease Research Center, Endocrinology and Metabolism Population Science Institute, Endocrinology and Metabolism Research Center, Endocrinology and Metabolism Researsh Institute, تهران, ايران , Gohari, Kimiya shahid beheshti university of medical sciences - Faculty of Paramedical Sciences - Department of Biostatistics, تهران, ايران , Gohari, Kimiya tehran university of medical sciences tums - Non-Communicable Diseases Research Center, Endocrinology and Metabolism Population Sciences Institute, تهران, ايران , Parsaeian, Mahboubeh tehran university of medical sciences tums - School of Public Health, Non-Communicable Disease Research Center, Endocrinology and Metabolism Population Science Institute - Department of Epidemiology and Biostatistics, تهران, ايران , Amini, Hassan SwissTropical and Public Health Institute (Swiss TPH) - Department of Epidemiology and Public Health, Switzerland , Amini, Hassan kurdistan university of medical sciences - Kurdistan Environmental Health Research Center, ايران , Amini, Hassan University of Basel, Switzerland , Rabiei, Katayoun isfahan university of medical sciences - Isfahan Cardiovascular Research Center, Cardiovascular Research Institute, ايران , Hssanvand, Mohammad Sadegh tehran university of medical sciences tums - Center for Air Pollution Research (CAPR), Institute for Environmental Research (IER),School of Public Health - Department of Environmental Health Engineering, تهران, ايران , Navidi, Iman tehran university of medical sciences tums - School of Public Health, Non-Communicable Diseases Research Center, Endocrinology and Metabolism Population Sciences Institute - Department of Epidemiology and Biostatistics, تهران, ايران , Fotouhi, Akbar tehran university of medical sciences tums - School of Public Health - Department of Epidemiology and Biostatistics, تهران, ايران , Naddafi, Kazem tehran university of medical sciences tums - Center for Air Pollution Research (CAPR), Institute for Environmental Research (IER), School of Public Health - Department of Environmental Health Engineering, تهران, ايران , Sarrafzadegan, Nizal isfahan university of medical sciences - Isfahan Cardiovascular Reaserch Institute, Isfahan Cardiovascular Reaserch Center, ايران , Mansouri, Anita shahid beheshti university of medical sciences - Faculty of Paramedical Sciences - Department of Biostatistics, تهران, ايران , Mansouri, Anita tehran university of medical sciences tums - Non-Communicable Disease Research Center, Endocrinology and Metabolism Population Science Institute, تهران, ايران , Mesdaghinia, Alireza tehran university of medical sciences tums - Centerfor Air Pollution Research (CAPR), Institute for Environmental Research (IER), Center for Water Quality Research (CWQR), Institute for Environmental Research (IER), تهران, ايران , Larijani, Bagher tehran university of medical sciences tums - Endocrinology and Metabolism Research center, Endocrinology and Metabolism Research Institute, تهران, ايران , Yunesian, Masud tehran university of medical sciences tums - Center for Air Pollution Research (CAPR), Institute for Environmental Research (IER), School of Public Health - Department of Environmental Health Engineering, تهران, ايران
Abstract :
BACKGROUND: Management and cleaning of large environmental monitored data sets is a specific challenge. In this article, the authors present a novel framework for exploring and cleaning large datasets. As a case study, we applied the method on air quality data of Tehran, Iran from 1996 to 2013. METHODS: The framework consists of data acquisition [here, data of particulate matter with aerodynamic diameter ≤10 µm (PM10)], development of databases, initial descriptive analyses, removing inconsistent data with plausibility range, and detection of missing pattern. Additionally, we developed a novel tool entitled spatiotemporal screening tool (SST), which considers both spatial and temporal nature of data in process of outlier detection. We also evaluated the effect of dust storm in outlier detection phase. RESULTS: The raw mean concentration of PM10 before implementation of algorithms was 88.96 µg/m3 for 1996–2013 in Tehran. After implementing the algorithms, in total, 5.7% of data points were recognized as unacceptable outliers, from which 69% data points were detected by SST and 1% data points were detected via dust storm algorithm. In addition, 29% of unacceptable outlier values were not in the PR. The mean concentration of PM10 after implementation of algorithms was 88.41 µg/m3. However, the standard deviation was significantly decreased from 90.86 µg/m3 to 61.64 µg/m3 after implementation of the algorithms. There was no distinguishable significant pattern according to hour, day, month, and year in missing data. CONCLUSION: We developed a novel framework for cleaning of large environmental monitored data, which can identify hidden patterns. We also presented a complete picture of PM10 from 1996 to 2013 in Tehran. Finally, we propose implementation of our framework on large spatiotemporal databases, especially in developing countries.