Similarity-based link prediction for estimating life cycle inventory data
Life cycle assessment (LCA) measures the environmental impacts of a product in its whole life cycle from resource extraction through disposal. A good LCA study relies on the availability and quality of life cycle inventory (LCI) data, which measure the inputs of energy and/or resources and outputs of emissions and/or waste for each stage of a productÉ__s life cycle. Traditionally, LCI data are collected from a variety of sources, including direct reports from manufacturing operations (e.g., meter readings, operation logs/journals), publications, government statistics. Because it often requires on-site investigation of manufacturing processes, these approaches are time consuming and expensive. Besides, manufacturers often treat the data as confidential information and are unwilling to make them public. Comparing to the traditional approaches, this study proposes a new computational approach to estimate missing LCI data. This approach is built upon the transformative promise seen from link prediction techniques in network science that enable the prediction of missing information of a network based on limited observations. LCI databases are commonly represented as a matrix, with the columns representing manufacturing processes and the rows representing environmental interventions. This matrix can also be represented as a network, which has two types of nodes respectively representing manufacturing processes and environmental interventions. The two types of nodes are connected by links, indicating how much and what type of environmental interventions each manufacturing process is associated with. The basic assumption of our approach is that similar processes in an LCI network tend to have similar environmental interventions, i.e. material/energy inputs and emission outputs. We used the Ecoinvent 3.1 database to test our method in the following steps: 1) randomly select and remove a certain numbers of data in each process, indicating as missing; 2) estimate the missing data based on the similarities of this process with other processes; 3) evaluate the estimation by comparing the estimated data and the original data. The results show that the estimation error (root mean square error, RMSE) increases as the number of missing data increases. The 80% RMSE is less than 1.65«_10-2. About 50% of the optimal sizes of training data, which is the number of processes used to achieve the best estimation, are less than 1,477. This study shows the promising potential of computational approaches to estimate missing LCI data. First, predicting missing LCI data without empirical data will significantly reduce the cost of and save time for LCA studies. Second, data used in an LCA often come from various sources with different quality and accuracy. By comparing the predicted results with the observed data, one can evaluate the quality of those observed data, identify inaccurate data, and guide future improvements. Lastly, the technology system is constantly creating new processes and products are invented all the time. Predicting emerging links between processes and environmental interventions can help reasonably estimate LCI data for emerging technologies for which empirical LCI data are less available.