Launching a materials informatics initiative for industrial applications in materials science, chemistry, and engineering

Jeffrey M. Ting; Corinne E. Lipscomb

doi:10.1515/pac-2022-0101

Article Publicly Available

Launching a materials informatics initiative for industrial applications in materials science, chemistry, and engineering

Jeffrey M. Ting and Corinne E. Lipscomb

Published/Copyright: April 1, 2022

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Pure and Applied Chemistry Volume 94 Issue 6

Abstract

The advent of materials informatics (MI) with emerging global trends in digitalization, artificial intelligence, and automation has led to promising opportunities for transforming traditional scientific research workflows. However, new MI efforts rely critically on the establishment, management, and accessibility of high-quality thermophysical and chemical data, either by mining existing databases, labelling historical data in archives, or generating sufficient data sets as prerequisites to the creation of predictive machine learning models. For ambitious MI-driven projects, amassing systematic data can be a time-intensive and prohibitively costly endeavor in spaces where data is uncurated or scarce. Here, we describe a MI initiative that started in the 3M Corporate Research Laboratories (CRL), highlighting how we strategically applied MI tools and data-driven methodologies for industrial materials research and product development workflows. Robust web applications and cloud infrastructure were developed to structure, standardize, and aggregate materials data for specific CRL projects. This integrated approach leverages the diverse skills and deep technical expertise of subject-matter experts at 3M to build the foundations for MI through systematic data management in materials research and, ultimately, to advance core technology platforms with innovative, customer-driven product solutions. Key elements that have contributed to the ongoing implementation of this highly versatile MI program, as well as challenges encountered, are presented as lessons learned for the broader MI and cheminformatics communities.

Keywords: Cheminformatics; machine learning; materials informatics; polymer science

Introduction

Advances in materials informatics (MI) for gaining new insights in research and development (R&D) rely on collecting and preparing high-quality data. This often involves more than reformatting data in spreadsheets or correcting missing measurements—data needs to be sufficiently labelled to contextualize important features that are used in decision-making tasks for machine learning (ML) algorithms. Unfortunately, this is often (i) time-intensive, requiring manual support and upfront effort in cleaning large datasets, and (ii) difficult to execute because of the complexity of data types and inherent differences in how individual researchers and teams manage, visualize, and analyze their data for different materials applications [1], [2], [3]. In industrial R&D labs, MI has been proposed to augment and enhance manufacturability, scalability, and materials selection across different markets [4]. However, few perspectives on spearheading such endeavors have been openly discussed.

The purpose of this commentary is to showcase how a MI initiative was launched in the 3M Corporate Research Laboratories (CRL). We outline how our team has approached building a well-architected MI platform for our industrial R&D community as end users. Over the past three years, this effort has grown in breadth by harnessing materials development data to be machine-actionable in a centralized software and cloud infrastructure. The 3M CRL MI Platform connects chemistries, processing variables, and characterization/testing together and ultimately, empowers industrial researchers to store their work in a searchable framework and build structure-property-function relationships in complex design state spaces through automated visualization and ML analyses. We attribute the progress of this initiative to (1) investments made in bringing together skilled scientists from science, engineering, software, and data science backgrounds, and (2) persistence and direct engagement from company leadership and 3M’s technical R&D community. Takeaway lessons learned from these factors will be briefly highlighted, and some future challenges are presented. Extensive reviews of the ML landscape remain outside the scope of this commentary and can be found elsewhere [1], [2], [3], [4], [5], [6].

Implementing a materials informatics program at 3M

With a long history of creating scientific breakthroughs in products and inventions to address global challenges [7], 3M produces scientific, technical, and marketing innovations in the form of over $32B in sales in ∼200 countries and >60,000 products found in four major business groups: safety & industrial, transportation & electronics, consumer, and health care. The size of the company and range of materials that underpin 3M technologies present considerable practical hurdles to initiating a MI program that can accommodate diverse workflows at the bench in CRL to pilot lines outside of CRL while providing programmatic access to different types of data. In the following section, we describe how the platform was developed for CRL.

Background and initial team building

The 3M MI initiative was initially proposed by a combination of technical employees and management across CRL. Acknowledging that the world of research was changing and could be accelerated by digital capabilities, the initiative was prioritized, scoped, and formally launched and resourced shortly thereafter. Initial resourcing brought together software and data science experts with subject matter experts from adhesive and abrasive backgrounds. Only a few months into the project, it was determined that ML could be used to help many technologies and product development programs. However, doing so at scale across 3M’s many diverse product lines required significant data infrastructure and user interface development. This process can take longer than anticipated, but early adopters and influencers within an organization can keep initial momentum going. Over time the team has prototyped and refined a system that is flexible enough to capture many workflows while being structured enough to enable ML to be included as a new section of these workflows.

The system today has been beta-tested across several materials domain. Its capabilities include capturing data—whether entered through the interactive user-interface (UI) or automated ingestions of analytical data, and analyzing data via ML model training. In addition to the development of this system, the MI team has set out to showcase the value of sequential learning in materials science internally through lighthouse projects. This assessment described building ML models from limited data, followed by iteratively generating more data with guidance from the model coupled with subject matter expertise. The chosen projects exemplify two types – (1) entirely new programs where scientific simulation and ML combined with experimental data can enable sequential learning to explore design space as quickly as possible and identify the type or area of chemistry necessary for the project to move forward, and (2) optimization projects where ML can be used on experimental data to speed the parsing of nuanced details to optimize performance and finalizing formulations for a given product.

3M CRL MI Platform development

The overarching goal of the 3M CRL MI Platform is to provide researchers the ability to save, organize, structure, and share their material and product data in an intuitive, flexible, and user-friendly manner. An important early goal of the platform was to present R&D scientists with a no code UI that is intuitive and not overly burdensome for data entry, visualization, and analysis. Thus, the first major milestone was the creation an integrated digital platform that reduces the challenge of saving, connecting, and sharing 3M materials data. For cloud computing, Amazon Web Services (AWS) was chosen to host this platform that is specifically designed for science and engineering data. On the front-end of this AWS-based system, formulation data can be structured with best practices for teamwork, collaboration, and security, where the UI displays the contributor(s) that have modified entries and allows material subgroups to be filtered by sample names, contributor(s), dates, keywords, and other metrics. Another high-value component of the UI for researchers was the automated ingestion and processing of analytical data to the cloud via extract, transform, and load (ETL) pipelines that data engineers created for different R&D workflows. ETL processes have been developed for instruments that perform routine materials characterization, e.g., NMR spectroscopy and mechanical testing. The assignment of a universally unique identifier (UUID) links information across types for a single material sample. This allows materials, processes, properties, and performance to be linked.

Figure 1 illustrates the main concept of the 3M CRL MI Platform’s architecture, consisting of inputs/outputs for data library and sample set management for the front-end UI. For library management, data from various sources are ingested through ETL pipelines and organized into classes. Material entries represent chemicals or other ingredients in formulations alongside relevant metadata that can be specified by users. We note that we rely on users to communicate what features and properties are important to capture as metadata for their workflows. Process parameters include information from unit operations like reactions, mixing, separations, coating, etc. in routine methods. Lastly, test properties consist of results taken from analytical instruments, such as measurements ranging from polymer characterization via size-exclusion chromatography and ¹H NMR to bulk peel tests of adhesive formulations. For sample set management, the platform’s main UI allows researchers to (i) search for UUID-assigned unique materials, processes, and properties, (ii) add elements to the UI that accompanies a spreadsheet to input tabular data, and (iii) connect elements together to enable materials search and data models. This organizes data in an intuitive manner that is accessible and secure for materials researchers. Users can then use completed datasets to share selected data with colleagues and generate new insight via ML models, ranging from simple linear and nonlinear regressions to advanced multivariable/multivariate regression or neural networks.

Fig. 1:

Schematic of how the 3M CRL MI Platform supports data management. The overarching goal of the data infrastructure is to enable data organization, access, and security for cross-team collaboration and end-use machine learning applications.

Lessons learned

While the initial version of the system was created to curate and store data involving materials for certain polymeric products, the 3M CRL MI Platform currently continues to expand and evolve as new materials domains are added and more sophisticated ML tools and pipelines are being developed. This section is anecdotal and describes a set of recommendations. We do not provide a comprehensive list of all challenges that industrial labs may encounter; rather, these themes bring special attention to common needs for starting a new MI program in a large organization.

Invest in talent at the interface of data science and materials science

The foundation of building MI-driven initiatives that can accelerate materials development critically depends on assembling talented people from diverse technical backgrounds across materials science, data science, and software development. A large fraction of our growing MI team consists of Ph.D. researchers from materials science backgrounds in polymer science, engineering, and chemistry with direct experience with ML, high-throughput experimentation, and computational chemistry. These individuals bridge existing CRL projects to the 3M CRL MI Platform by identifying practical outcomes that can accelerate productivity and innovation. Often this involves some combination of data ingestion/storage, access/sharing, and visualization/analysis needs for managing historic data or for planning new project directions.

As data science tools and ML capabilities are becoming more prominent, new pedagogical challenges have emerged on how to best teach this knowledge to the broader industrial workforce. In general, it is important to demonstrate value to overcome resistance to changes in existing industrial workflows. In our company’s technical community, we are addressing this challenge on several fronts. Company-wide workshops and bootcamp-like events have also been held on teaching data science skills to early-, mid-, and late-career employees, focusing on topics such as mathematics and statistics, programming (e.g., Python or R), data processing, and ML algorithms. Additionally, MI team members established a MI Chapter of 3M’s Tech Forum, a grassroots organization whose goal is to foster a global network of 3M scientists and engineers to encourage sharing ideas, problem-solving, cross-team collaborations across business groups and divisions. The MI Chapter virtually hosts internal educational seminars and external seminars with academic speakers. Open dialogues of what works well and what pain points exist can help align goals, articulate expectations, and foster buy-in of MI tools and capabilities.

Engage with leadership and technical communities for support

For industrial adoption of MI, one of the most frequent questions our group is asked involves outlining what led to top-down support for initiating a MI program. Clear identification of strategic objectives across industry and business operations, such as reducing a product’s development cycle from lab-scale R&D to market deployment using MI tools [4], can be an important driver as a value proposition for building an inclusive coalition of leadership support. We recommend scoping pilot projects involving regression or classification problems that simple ML can readily address, i.e., material performance optimization or physical property categorization, as case studies. For instance, with sufficiently labelled training data comprising a limited number of features for polymeric materials, project owners can create strategies to resemble reported works for physical, thermal, and mechanical property prediction [5]. At a high level, tailoring materials performance for project needs can be achieved by mining historical datasets or developing advanced ML algorithms to accelerate the pace of materials discovery, scale up, and product deployment. Each success over time gradually builds a strong framework for business leaders to deliver a digital transformation vision to stakeholders.

Simultaneously, the user experience design of end-use MI platforms and software must be tailored for technical researchers. This engagement is critical for implementing a MI initiative. The 3M CRL MI Platform has continued to evolve with routine feedback from a small group of super-users with distinct materials domain expertise, closely collaborating with data scientists and engineers. An initial focus was placed on 3M’s Adhesives Technology Platform—this technology platform covers a diverse range of products comprising curable adhesives, epoxies, and pressure-sensitive adhesives that are found in customer applications from airplanes and automobiles to electronics and medical dressings. Thus, materials, processes, and properties that were pertinent to this rich class of materials drove the development of the initial user experience for adhesive researchers. The team’s management approach uses agile principles [8] that are iterative and flexible in nature, allowing our team to quickly respond to interdependencies and user requests. This consistent rhythm of prototyping new features and data management tools enables diverse arrays of datasets to be aggregated for more efficient analysis and faster collaboration, which can be replicated at all levels of industrial R&D.

Opportunities ahead for MI in industry

We conclude by identifying unresolved challenges that need to be addressed for making the next leap forward in MI driven industrial applications. Standardization of materials data in a machine-readable manner is key to transform nascent MI initiatives into well-integrated programs. For data-driven materials discovery initiatives, the input data for small molecules and common chemicals can be ingested from existing resources such as PubChem, a National Institutes of Health open chemistry database [9], using a combination of identifiers like a SMILES string [10] or InChI key [11]. However, the chemical representation of more complex macromolecules (e.g., synthetic polymers that are often distributions rather than single entities) cannot be as easily conveyed with simple line-notations, requiring organizations to simplify and create unique specification systems [5]. BigSMILES, a system extending from SMILES notation [12], helps with some of these challenges, but complete tools for full integration of more complex, realistic materials used in industry are still lacking. For such materials the output data from experimental or simulation runs also requires standardization decisions of file formats for measurement methods. In some cases, the context of phenomenological properties such as glass transition are almost as important as the measurement value itself for ML algorithms. While we have made decisions on input/output standardization internally for the 3M CRL MI Platform, more work on developing best practices to increase flexibility and generalizability across material platforms is needed.

This leads into another important challenge that we recognize as MI in industry continues to expand: the need for reliable interoperability between internal and external organizations for chemistry data exchange. At 3M, one of the most challenging aspects of onboarding new teams to the platform involves not only understanding how domain knowledge is organized and stored, but also representing researchers’ workflows as connected information at the machine level. The flexible sample set management system can be specialized to accommodate diverse industrial research areas that align with CRL business goals, but this requires close teamwork between subject matter experts and the MI team. Meanwhile, for external collaborations with the MI community, data pipelines need to be constructed to reconcile differences in data architecture. Consistent terminology, model dictionaries, and pre-defined schemas for data processing among academic and industrial stakeholders can greatly accelerate this foundational step. Although 3M data will remain proprietary, we are pursuing interoperability by standardizing our ontology and schemas so that it can be easily curated and shared with external collaborators where it makes sense, including university partners, start-ups, and joint ventures. In our view, this approach will likely emerge as the model for how industry addresses interoperability in the MI community. Within the 3M CRL ecosystem, elements of the FAIR guiding principles [13] have been used to clearly define data and metadata. However, for industry in general this will be very project dependent, where sensitive data and metadata need to be protected internally. Global organizations like the International Union of Pure and Applied Chemistry (IUPAC) are poised to address this challenge with leading projects such as developing InChI extensions for macromolecules, providing direction that industry can broadly adopt and follow as it makes sense with how the fields of cheminformatics and materials informatics evolve [14].

Finally, for any MI initiative in industry, agility and scalability are major hurdles to overcome. There are many flavors of MI that a company might be interested in (e.g., data management + ML, ML only, etc.). The agile framework [8] that our team adopts enables flexibility and adaptability for deploying MI-driven solutions. Current applicability of ML is not universal and depends on the problem at hand. At a recent AWS event, 3M’s senior vice president of digital transformation, Shaun Braun, presented a high level overview of the company’s digital transformation progress, highlighting how our growing digital ecosystem has enabled key advances in 3M manufacturing, clean air technologies, and healthcare IT solutions at a global scale [15]. These specific real-world examples in our company portfolio showcase how the company leverages foundational knowledge of thousands of 3M scientists and combines data science into materials science to design and deliver differentiated product solutions. Such examples are built on foundational investments—this groundwork for successful MI adoption in time involves having a rigorous data management system, dispersed resources working on the same or related programs, and in-house IT resources available to support.

Altogether, starting a company-wide MI effort has been an exciting journey in the CRL. We are at a pivotal inflection point in how industrial R&D is done. Continued conversations on ideas and best practices for improving MI can greatly reinforce investments made at the interface of data/materials science in industry. By ushering in a flexible materials and chemical platform that can manage diverse sets of experimental and simulation data (and importantly, associated metadata) in a user-friendly web interface, there are an increasing number of opportunities to partner with 3M’s businesses to drive the development of purposeful products in new ways.

Article note:

A collection of invited papers on Cheminformatics: Data and Standards.

Corresponding author: Corinne E. Lipscomb, 3M Company, 3M Center, Maplewood 55144, MN, USA, e-mail: celipscomb@mmm.com

References

[1] J. J. de Pablo, N. E. Jackson, M. A. Webb, L.-Q. Chen, J. E. Moore, D. Morgan, R. Jacobs, T. Pollock, D. G. Schlom, E. S. Toberer, J. Analytis, I. Dabo, D. M. DeLongchamp, G. A. Fiete, G. M. Grason, G. Hautier, Y. Mo, K. Rajan, E. J. Reed, E. Rodriguez, V. Stevanovic, J. Suntivich, K. Thornton, J.-C. Zhao. Npj Comput. Mater. 5, 14 (2019).10.1038/s41524-019-0173-4Search in Google Scholar

[2] R. Batra, L. Song, R. Ramprasad. Nat. Rev. Mater. 6, 655 (2021), https://doi.org/10.1038/s41578-020-00255-y.Search in Google Scholar

[3] J. M. Cole. Acc. Chem. Res. 53, 599 (2020), https://doi.org/10.1021/acs.accounts.9b00470.Search in Google Scholar PubMed

[4] B. Meredig. Curr. Opin. Solid State Mater. Sci. 21, 159 (2017), https://doi.org/10.1016/j.cossms.2017.01.003.Search in Google Scholar

[5] D. J. Audus, J. J. de Pablo. ACS Macro Lett. 6, 1078 (2017), https://doi.org/10.1021/acsmacrolett.7b00228.Search in Google Scholar PubMed PubMed Central

[6] W. Sha, Y. Li, S. Tang, J. Tian, Y. Zhao, Y. Guo, W. Zhang, X. Zhang, S. Lu, Y.-C. Cao, S. Cheng. InfoMat 3, 353 (2021), https://doi.org/10.1002/inf2.12167.Search in Google Scholar

[7] E. von Hippel, S. Thomke, M. Sonnack. Harv. Bus. Rev. 5, 3 (1999).Search in Google Scholar

[8] K. Schwaber, M. Beedle. Agile Software Development with Scrum, Prentice Hall, Upper Saddle River (2001).Search in Google Scholar

[9] S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B. A. Shoemaker, P. A. Thiessen, B. Yu, L. Zaslavsky, J. Zhang, E. E. Bolton. Nucleic Acids Res. 49, D1388 (2019).10.1093/nar/gkaa971Search in Google Scholar PubMed PubMed Central

[10] D. Weininger. J. Chem. Inf. Comput. Sci. 28, 31 (1988), https://doi.org/10.1021/ci00057a005.Search in Google Scholar

[11] S. R. Heller, A. McNaught, I. Pletnev, S. Stein, D. Tchekhovskoi. J. Cheminf. 7, 1 (2015), https://doi.org/10.1186/s13321-015-0068-4.Search in Google Scholar PubMed PubMed Central

[12] T.-S. Lin, C. W. Coley, H. Mochigase, H. K. Beech, W. Wang, Z. Wang, E. Woods, S. L. Craig, J. A. Johnson, J. A. Kalow, K. F. Jensen, B. D. Olsen. ACS Cent. Sci. 5, 1523 (2019).10.1021/acscentsci.9b00476Search in Google Scholar PubMed PubMed Central

[13] M. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouwman, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez-Beltran, A. J. G. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. C. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S. A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, B. Mons. Sci. Data 3, 160018 (2016).10.1038/sdata.2016.18Search in Google Scholar PubMed PubMed Central

[14] L. R. McEwen. Chem. Int. 39(2), 6 (2017), https://doi.org/10.1515/ci-2017-0205.Search in Google Scholar

[15] AWS re:Invent 2021- 3M Drives Digital Transformation with AWS, https://www.youtube.com/watch?v=NGixN9rCQy4 (accessed Jan 1, 2022).Search in Google Scholar

Published Online: 2022-04-01

Published in Print: 2022-06-27

© 2022 IUPAC & De Gruyter. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. For more information, please visit: http://creativecommons.org/licenses/by-nc-nd/4.0/

Articles in the same Issue

https://doi.org/10.1515/pac-2022-0101

Keywords for this article

Cheminformatics; machine learning; materials informatics; polymer science