DOCTORAL DISSERTATION: Using Data Science Methods to Address Challenges in Sustainable Consumption and Production at Multiple Scales

Event Type: 
Bu Zhao
Wednesday, October 27, 2021 - 9:00am to 10:00am
Event Sponsor: 
Center for Sustainable Systems
School for Environment and Sustainability


Sustainable consumption and production (SCP) have been extensively discussed, studied, and implemented by various stakeholders, including government, international organizations, industry, and academia in tandem to achieve greater gains in environmental sustainability. However, SCP research and practice still face substantial challenges due to various issues with data. The emerging data science can potentially provide alternative solutions to address some of the data challenges in SCP. This dissertation shed light on how data science methods can help improve data availability and data resolution in SCP research and practice. This dissertation selected four specific questions to demonstrate how various data science methods can be applied to address data challenges in SCP at multiple scales.

This dissertation began by addressing the challenge of data availability through a case study in the area of life cycle assessment (LCA). I developed a supervised learning model to estimate missing unit process data solely relying on existing data. The model built upon ecoinvent 3.1, a widely used life cycle inventory (LCI) database, can successfully classify the zero and non-zero flows with a low misclassification rate (0.79% when 10% of the data are missing). For non-zero flows, the model can estimate their values with an R2 over 0.7 when less than 20% of data are missing in one unit process. This model can provide critical data to complement primary LCI data for LCA studies and demonstrates the promising applications of machine learning techniques in LCA.

Next, I further addressed the data availability issue often facing SCP research at the supply chain scale. Specifically, I proposed a hybrid method to estimate regional input-output (IO) tables, which are widely used to evaluate supply chain-wide environmental impacts of consumption, by combining the traditional RAS method with a Deep Neural Network (DNN) model. The DNN model can significantly improve the performance with the R2 increased from 0.6412 and 0.5271 to 0.8726 and 0.7893 when estimating IO tables one year and five years later, respectively. The estimated IO tables can be used to examine the environmental impacts of consumption for periods when primary IO tables are not available.

I also addressed some of the data resolution challenges in SCP research. I started with a specific question on how to map environmental impacts (i.e., air pollution in this study) due to production and consumption in high spatial-temporal resolution at the urban scale. I developed a machine learning model to infer the distribution of PM2.5 concentration in Beijing at 1 km by 1 km and 1-hour resolution by using the mobile monitoring data collected from a ride-hailing fleet with low-cost sensors. The model was able to show both short- and long-term variations of urban PM2.5 concentration and identify local air pollution hotspots. Compared with a benchmark model that only uses data from stationary monitoring sits, my model showed significant improvement with the R2 increased from 0.56 to 0.80 and Root Mean Square Error (RMSE) decreased from 12.6 to 8.1 μg/m3. These results demonstrated the potential and necessity of using fleet vehicles as routine mobile sensors combined with advanced data science methods to provide high-resolution urban air quality monitoring.

Lastly, I studied the potential impacts of the consumption and production systems on social well-being at a micro-level as another case of improving data resolution. Specifically, I examined the relationship between all-cause mortality and individual risk factors by adopting more flexible machine learning models. I used the survival tree and random survival forests (RSF) to detect, analyze, and visualize complex interactions between individual risk factors and all-cause mortality based on the data from the National Health and Nutrition Examination Survey (NHANES). Based on these models, I identified the most important physiological indicators and examine the associations between these indicators and the all-cause mortality by constructing multi-dimensional heatmaps. The method used in this study can automatically identify the most influential factors while accounting for complex interactions and maintaining high predictive power, which significantly improves the flexibility of the model construction for survival data.


Bu Zhao is a Ph.D. candidate in Resource Policy and Behavior (SEAS) and Scientific Computing (MICDE) at the University of Michigan. He is affiliated with the Center for Sustainable Systems, working under the supervision of Ming Xu. His research broadly focuses on the data science application in environmental systems (e.g., life cycle assessment, input output analysis, urban air pollution mapping). He utilizes data science techniques and machine learning tools to promote the understanding of sustainable environmental systems.


Join Zoom meeting:  |  Passcode: 231913

Admin Content
CSS participant: