One Health/ICABR – Società Italiana di Economia dello Sviluppo

The One Health Approach: The exposome and its implications for research and policy actions

The exposome

In 2005 Christopher Wild coined the term “exposome” defining its as the result of cumulative life-course environmental exposures from the pre-natal period onwards (Wild, 2005). Since then, this definition has been extended and refined by enumerating the exposome constituent components as well as by suggesting metrics and measurement methodologies.

In general terms, however, the exposome is an attempt to define in more meaningful and documental detail the environment variable in the equation phenotype= genotype + environment. It is semantically characterized by a parallelism with the genome and has been interpreted as an index of “nurture”, as opposed to the genome as an index of “nature”. As Miller & Jones put it: ” … the exposome is even more expansive than what Dr Wild described 9 years ago. The exposome captures the essence of nurture; it is the summation and integration of external forces acting upon our genome throughout our lifespan” (Miller and Jones,2014). The exposome seems also to recall the notion of human capital (Becker, 1964), itself a somewhat ambiguous term, that can also be interpreted as embodying the genome and the exposome, both interacting to produce the unique quality of a lifetime.

The exposome promises to be a revolution in bio-statistics because it substitutes the usual model of data collection with the notion of cumulative value of multivariate exposure factors. This is a concept that is sometime used, though in a highly simplistic way, to measure stock variables such as capital and wealth, but does not include the heterogenous exogenous and endogenous interactions as distinctive features of the accumulation process. Because the exposome is defined as the cumulative measure of external exposures on an organism (external exposome), and the associated biological responses (internal exposome) its empirical measurement needs a time dependent set (ideally a lifetime dependent) of exposures and reactions over long periods or, alternatively, in selected windows of time. The exposome research paradigm is thus based on a radically different approach from traditional data analysis to causal inference because of three distinctive features: multiple exposure domains; integrated exposure and response data across multiple scales of variation, such as populations, time and space; and data driven discovery from the investigation of multiple exposure–response relationships (Stingone et al, 2017).

This implies that cohort studies rather than random control trials will be the dominant observational studies of subjects identified for their exposure to complex risk factors such as those implied by the exposome model. It also implies that these studies will need mega-cohorts and more sophisticated statistical methodologies to be able to disentangle the cause-effect relations among the myriad of exposures and the high level of association across different risk factors that characterize the different individuals. More generally, exposome studies will require a new type of research infrastructures to support the expanded exposure assessment activities required by meaningful analysis of the exposome. These infrastructures will have to combine the capacity to collect, store and make available data on the genome and the exposome on an extended basis for large population samples over time. At the same time, they should also be the main depository of analytical, bioinformatic and statistical tools to process, integrate, and analyze high-dimensional data on individual genetic and exposure characteristics over time.

Exposome observational studies include a new generation of mega-cohort research projects, such as the UK Biobank or similar proposed US and Asian cohorts. In these studies, large samples of individuals over time are used to investigate the impact of genetic variations, exposures to different environments and environment changes, lifestyles, general health conditions, diseases and deaths. The sizes of these samples require a complex and yet untested mixture of methodologies which are largely still being developed and, at the moment, are no more than a mixture of traditional methodologies. These include, for example, different variants of multiple regressions, “retrofitted” to adapt to large longitudinal samples with a high degree of association across most potentially explanatory variables. At the same time, the mega-cohort studies appear to require substantial investment, not only for direct data collection, but also for the need to store and make available detailed exposome data in Biobanks beyond their present capacity of collecting, preserving and providing access to bio samples. For example, UK Biobank will recruit half a million people at a cost of around £60 million ($110 million) in the initial phase. The proposal to establish a ‘‘Last Cohort’’ of 1 million people in the United States or a similar-sized Asian cohort (Wild et al, 2005) would presumably exceed this sum. These costs have to be considered in the context of a new wave of investment in sharing and open data research infrastructures, such as biobanks and biological resource centers (OECD, 2008).

Traditional methods of assessment of the environmental impact on human health are generally based on multiple regression models where the effects of few critical variables on health indicators are explored with multivariate models of statistical correlation. Estimates generally seek to attribute specific health effects (from increased morbidity to anticipated mortality) to exposure to one or more pollutant, with the implicit assumption that these exposures are not correlated over time and that no joint effect is determined by simultaneous exposure of the same subject. Recent studies have been trying to apply the exposome research paradigm by investigating human health response to multiple exposures and their interactions with a variety of methods chosen on an ad hoc basis to deal with the high dimensionality presented by these data and the complexity of statistical inference paths that can be imagined fitting them. These methods include mixture analysis, integrating the selection, shrinkage and grouping of correlated variables (e.g. LASSO, elastic-net, adaptive elastic-net), dimension reduction techniques (e.g. principal component, partial least square analysis) or bayesian model averaging (BMA), Bayesian kernel machine regression (BKMR), etc.) (Stafoggia et al., 2017, Lazarevic et al., 2019). Because of the exploratory nature of the exposome studies, these methods combine statistical techniques with new software capabilities and typically are not framed by a theoretical model. They are only used to reduce the dimensionality of the phenomenon (the exposome as a set of multiple exposures). Consequently, they lack model selection stability (shrinkage methods), lack interpretability of the latent variables (dimension reduction) and computational inefficiency (Bayesian models). In addition, they are rarely applied in the context of large (>100 variables) and heterogenous exposome data (omics, categorical/continuous variables).

A further challenge of the exposome-based approaches concerns data integration. This includes matching exposome and genome data from various sources and integrating different data flows over space and time to account for multiple exposures. While the ideal way of data collecting would be through longitudinal studies (prospective or retrospective) of lifetime exposures combined with dynamic genomic data, in practice this type of collection still lacks strong theoretical foundations and is being developed only slowly since it is very difficult and costly to implement. However, even though the data produced with the traditional techniques are not ideally suited to study multiple exposures, primary data from cohort studies can be combined with secondary data from environmental monitoring systems. In this respect the key integrating principle is that exposure to a single environmental insult (e.g., a high level of PM 2.5) is not in itself an event, but it is part of a multidimensional event that includes multiple exposures at a particular moment and their accumulated effects over time. An individual may thus appear to exhibit a response to a single exposure, but this response can only be understood in the context of a combination of concurrent simultaneous exposures and is conditioned by the accumulation of all her exposures (external exposome) and responses (internal exposome) in the past. Moreover, this concept of accumulating and concurring exposures over time is readily applicable to the environment itself. For example, an eco-exposome can be defined as the cumulative results of all exposures of an ecosystem to both nature and human induced environmental changes over time. Similarly, an urban exposome can be conceived as the qualitative and quantitative assessment of environment and health indicators that describe the framing and evolution of urban health and its interactions with urban infrastructure, climate, and small area (neighborhood) features (Andrianou and Makris, 2018).

More generally, the concept of exposome suggests a different way to consider statistical evidence and organize routine data collection on the part of national and international institutions. At the moment, statistical data are gathered through traditional means (e.g., surveys based on questionnaires and baseline or periodical samples), administrative sources (including medical and biographical information) and increasingly with direct observations through remote sensing machinery. The great heterogeneity resulting from all these different methods is not significantly reduced by standardized classifications, because of the lack of integrating methodologies and procedures. Unlike the approach of traditional environmental epidemiology, focusing on final outcomes such as mortality and morbidity, the exposome notion suggests that data gathering should pay special attention to the intermediate and more subtle effects related to environmental exposures, for example through internal biomarkers of exposure and response, using information increasingly available from biobanks and the application of omic technologies (Vineis et al, 2020) . The main idea in this regard is that not only data collection should aim to document the accumulation of exposures through lifetime (as expressed by the motto from biography to biology), but also that it should attempt to capture the sequence of intermediate molecular changes that characterize the process of emerging health outcomes under the pressure of the environment.

Causal inference models would also be deeply affected by the new way to approach causal attribution suggested by the exposome paradigm. Documenting the intermediate steps of this process would allow a better causal attribution, even within the traditional inference model for example using a combination of external variables (such as different sources of pollution) and omic technologies. A new generation of simulation models based on simultaneous exposures and responses is needed to disentangle the exposome through a more general pattern of causal analysis than the one used in typical observational studies, including those based on randomized control trials. These models should not be designed to predict outcomes of endogenous variables in response to exogenous changes, but”… to systematically explore possible counterfactual scenarios, grounded in thought experiments – what might happen if determinants of outcomes are changed” (Heckman and Pinto, 2022). New models of economic and social epidemiology should also be used to address the truly most challenging task of policy analysts: “Forecasting the impacts of interventions (constructing counterfactual states associated with interventions) never previously implemented to various environments, including their impacts in terms of well-being”(Heckman and Pinto, 2022).

The denomination of exposome suggests a natural association with “exposomics”, a term evoking the recent development of “omics” technologies (http://omics.org/), which aim to develop high throughput analysis of a set of molecules. Omic technologies (OT) , of which the first to appear was “genomics” (as opposed to genetics) allow to combine two very important properties of modern data analysis: interrogation and intervention. OM are capable to increasingly exploit these two properties through: (a) data recovery and processing from a variety of sources, and through a plurality of computing facilities and, (b) automated analysis and selection of production and application options available through modeling. The two main distinctions of competitive options in the data collection space are their automated preparation via search and processing algorithms, and the use of Artificial Intelligence and Deep Learning to mine and manage available information. The common data model makes analytics more agile because it is designed without being constrained to the individual data models and business definitions of a particular tool. Valid unified OT platforms for life sciences are also intimately linked to their advantages as part of the process of industry integration through horizontal infrastructure, interdisciplinary communication and analysis and their agility and self-service capabilities. OT technologies are themselves natural vehicles of high throughput and instant communication, as shown by the unprecedentedly fast development of mRNA vaccines in the current pandemic. Their main basis is data integration through array technologies, producing gridded data, complemented by a wealth of associated metadata, such as information on space and time, as well as on anomalies and other distinctive features. Combined with deep learning and other automated information algorithms, array technologies have been able to spread to a variety of applications, including the analysis of diseases. For example, “volatolomics”, a high throughput, array-based technique that studies the relation between volatile compounds and molecular patterns has been successfully used to discriminate COVID-19 patients via breath analysis (Mougang et al. 2021).