Analysing a billion reactions with the RInChI

Jonathan M. Goodman; Gerd Blanke; Hans Kraut

doi:10.1515/pac-2021-2008

Artikel Öffentlich zugänglich

Analysing a billion reactions with the RInChI

Jonathan M. Goodman , Gerd Blanke und Hans Kraut

Veröffentlicht/Copyright: 5. Mai 2022

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Pure and Applied Chemistry Band 94 Heft 6

Abstract

The RInChI is a canonical identifier for reactions which is widely used in reaction databases. It can be used to handle large collections of reactions and to link information from diverse data sources. How much information can it handle? Studies of the SAVI database, which contains more than a billion reactions, demonstrate that the RInChI is useful in analysing such a large collection of molecular data, and the reduced form of the Web-RInChIKey contains enough information to be an effective differentiator of reactions. Issues of NH tautomerism and stereochemistry are handled effectively. The RInChI illustrates that some of the properties of the algorithmically-generated SAVI database differ from SPRESI, which is a collection of experimental data. The RInChI has different properties to Reaction SMILES and both approaches provide useful and distinct information. We recommend that the RInChI be included in data models for reactions.

Keywords: Cheminformatics; InChI; integrating reaction data; reaction databases; RInChI; SAVI; SPRESI

Introduction

This is an exciting time in the computational analysis of reactions. The number of reactions which can be studied in detail through computational mechanistic analysis is growing fast [1], [2], [3] and it is now possible to look at very large collections of reaction data and draw general conclusions from them. The dramatic developments over the last few years of synthesis planning programs, are consequences of this [4].

What are the biggest challenges in this area? Developing the right algorithms, understanding stereochemistry and producing ever more useful tools, are all major opportunities. In this paper, we take a step back from this frontier. All of these new processes depend on the reaction data used to build them, and they are all limited by quantity and the quality of the data available. A great deal of data is available.

SPRESI [5], Reaxys [6] and CAS [7] are very big, very high-quality databases which have been assembled from the chemical literature over the last century. The Pistachio reaction database [8] has been generated from patent data. Chemistry, however, is very complicated. These resources have data for tens of millions of reactions. Does this cover most possible chemistry, or 10% of it, or 1% of it? Perhaps it is even less than this. Expanding reaction data availability and linking different databases is critical both to answer this question and to develop a more comprehensive understanding of molecular reactivity.

In this paper, we look at how we can collect and integrate reaction data, using the RInChI, the Reaction-InChI [9, 10], which builds on the IUPAC International Identifier for Molecules, the InChI [11, 12]. The RInChI can describe all molecule-based reactions: provided that the starting materials or products can be written as InChI, it is possible to construct a RInChI.

InChI version 1.06 [12] has just been released, fixing a few security issues and correcting a small number of issues with unusual molecules. It has been tested on the hundred million molecules in PubChem and found to be more than 99.99% accurate. There are a few problematic structures which are still being investigated for the next version. The latest release of the InChI gives the same identifier as the previous releases for nearly everything, and it is extraordinarily reliable.

The InChI works for molecules. The RInChI, the reaction-InChI, is designed to do the same for reactions [9, 10]. It is now possible to take any reaction and use our software or perform the operations by hand, generate a unique identifier, the RInChI. The software can be downloaded from the InChI Trust website: https://www.inchi-trust.org/downloads/. The process gives confidence that anyone else in the world, or any robot or AI anywhere the world, which was looking at the same reaction would generate the same identifier.

Fig. 1 shows how the Reaction-InChI fits with the many developments of the InChI which are on-going. The centre of the target is the core InChI for representing molecules. Close around it are the areas for which this core InChI is undergoing the most intensive development. The InChI is already very good at handling stereochemistry and tautomers. It is reasonably good at labelling organometallics, but this is an area for which there are possibilities of improving the identifier even further.

Fig. 1:

The reaction-InChI in the InChI ecosystem.

The InChIKey makes it possible to compress the InChI into a fixed-length string, which is database and search-engine friendly. Built around these central resources are many other applications, including the RInChI, a canonical identifier for reactions. More information about these other projects is available from the InChI Trust [13]. We are currently developing the RInChI further to include more auxiliary information, to cover reactions with unexpected results, atom mapping and extended stereochemistry. Any upgrades to the core InChI just drop into the RInChI and are incorporated automatically. The RInChI uses a layer structure inspired by the InChI:

Layer 1: RInChI version and underlying InChI version
Layer 2 and 3: starting materials and products
Layer 4: solvents, catalysts, other molecules which survive the reaction
Layer 5: direction of the reaction: d+, d−, d = , unspecified
Layer 6: count of no-structure materials

A great deal of information is omitted from this description of a reaction. A requirement for too much information would discourage people from using the RInChI and also may mean that different people, or different computers, will provide different information about the same reaction, and so no two reactions will have the same RInChI. This might be good for uniqueness, but is bad for establishing identity and quantifying similarity. The RInChI uses only the information in these six layers; everything else is thrown away. The discarded data are important, but the RInChI is an identifier which helps us to keep track of it rather than a data format to store everything we know about reactions. Fuller data structures, such as the Unified Data Model (UDM) [14], are effective ways of storing comprehensive reaction data, although they will not provide unique and canonical descriptions of reactions. For this reason, they should include a RInChI as one of their fields. Like the InChI, the RInChI can be compressed into a fixed-length string. There are several options available. The Long-RInChIKey comprises a list of InChIKey for each of the molecules in the reaction. The Short-RInChIKey hashes together the molecules in each layer (reactants, products, agents). In this study, we focus on the Web-RInChIKey, a 47-character string which discards the distinctions between starting materials, products and other molecules. Seventeen characters are a hashed compression of the major InChI layers of the molecules in the RInChI. The final 15 letters are a hash of the remaining structural information.

Ozone depletion is an important process associated with climate change:

O 3 + O + Cl → 2 O 2 + Cl

The RInChI for this process is:

RInChI = 1.00.1 S / O!O 3 / c 1 - 3 - 2 < > O 2 / c 1 - 2 < > Cl / d +

This comprises a list of InChI with separators to indicate whether they are starting materials (O, O₃ in this example), products (O₂) or catalysts (Cl). The InChI for all these molecules starts with “InChI = 1S/” which is omitted rather than repeated for all molecules. The InChI for O is “InChI = 1S/O”, for Cl is “InChI = 1S/Cl” for O₂ is “InChI = 1S/O2/c1-2” and for O₃ is “InChI = 1S/O3/c1-3-2”. InChI contain “/” symbols, so this cannot be used as a separator in a list of InChI. The symbol “!” is used instead to separate molecules which have the same role in the reaction. For example, the starting materials, O and O₃, are written in sorted order “O!O3/c1-3-2”. A different separator is required to distinguish lists of molecules and “<>” is used for this purpose.

The first layer “RInChI = 1.00.1S” give the version number of the RInChI and the underlying InChI. The next layers list the InChI for one side of the reaction, ozone and oxygen atoms (“O!O3/c1-3-2”), the other side, separated “<>”, oxygen (“O2/c1-2”) and then species which are present at the start and end of the reaction, chlorine atoms (“Cl”), also separated from the earlier lists of molecules by “<>”. The next layer “d+” tells us the reaction goes from ozone to oxygen rather than the other way around. The direction of a reaction may depend on the reaction conditions, and so this part of the RInChI is in a separate layer from the rest. The sixth layer is omitted, because all of the components of the reaction have a definable molecular structure. The stoichiometry of the reaction is omitted. How sure can we be that a reaction is really a one-to-one reaction? Might there be autocatalysis? Might the need for an experimental excess of a starting material be a requirement of thermodynamics or kinetics rather than mechanism? We do not always know the answers to these questions, and the RInChI is designed so it can be written without the requirement for a detailed mechanistic analysis. The RInChI contains no more information, although it can be associated with an auxiliary information string [9, 10]. A lot of information has been discarded in its construction.

It is useful to compare the RInChI with the Reaction SMILES, which encodes similar information [15]. The structure of the Reaction SMILES is:

Reactants > Agents > Products

Reactants contribute one or more atoms to the products; agents neither contribute atoms to the product nor accept atoms from the reactants. The Reaction SMILES for this reaction can be written:

[ O ] . O = [ O + ] [ O − ] > [ Cl ] > O = O

Even in this simple example, more than one Reaction SMILES provides a good description of the reaction. The list of starting materials could have been written in a different order, or a different SMILES for ozone could have been selected. Decisions were required: what is a reactant and what is an agent? Without a detailed knowledge of the mechanism of the reaction, it may not be possible to be sure, and different people may come to different decisions and generate different Reaction SMILES. These differences are part of the power of the SMILES notation and are very useful in many situations. However, SMILES does not compete with the RInChI in the quest to be unique and canonical. The RInChI might appeal to the indecisive, because these decisions do not have to be made about details of the reactions, but should also appeal to people and computers who want to ensure that all the information in a database is definite and not contaminated with subjective additions without strong experimental foundations.

Is this simple summary of a reaction complicated enough to be useful? Have we thrown away too much information? This might be an uncomfortable question for people who have worked hard to create an identifier, but it is an essential one. In this paper, we address the question by investigating the use of the RInChI in a database of more than a billion reactions.

Results and discussion

The SAVI database [16] is a Synthetically Accessible Virtual Inventory of more than a billion compounds, and the reactions which form them. The compounds in SAVI should be easily synthesisable because they were generated using expert knowledge on chemical synthesis by applying 53 reaction processes to about 150 000 commercially-available starting materials. The team, led from the National Cancer Institute, aimed to produce a free, publicly-available dataset of molecules which should be well-suited for drug discovery studies. SAVI is different from other reaction databases which are available, such as SPRESI [5], Reaxys [6], CAS [7] and Pistachio [8].

All of these databases are tiny in comparison with SAVI, although they contain much more diverse information which reflects experimental data reported in patents and papers rather than algorithmically-generated reactions in SAVI. SAVI is both free and very large: 53 different transforms were chosen from the Lhasa knowledge base and used to generate products from a large database of about 150 000 readily available starting materials (also called building blocks). The names of the transforms are used in SAVI to identify the reactions. The database contains RInChI and Reaction-InChI-Keys.

The SAVI database was downloaded from the NIH on April 6^th, 2021 [16] as 55 tar files, each of which comprised 200 compressed .csv files. Compressed, these occupy about 350 GB of disk space. Uncompressing all of the files together would require about 3.5 TB. This excludes the detailed molfile data which is also available. Multi-terabyte disks are readily available and not very expensive, so the uncompressed SAVI database can be stored on a large disk. Suppose it were a million or a billion times larger? Disk space would begin to be a problem even for the best funded institutions. Since we are interested in understanding all of chemistry, we would like to have much bigger databases than SAVI so that we can have an effective coverage of all of chemical space. Storing the data is manageable. Processing a data store of this size is a significant challenge. Familiar Unix commands for searching and sorting run out of memory without strategies for breaking the analysis into parts and reassembling the answers from multiple operations.

The database contained 1 094 782 440 entries with Web-RInChIKeys and our analysis focussed on just these reactions.

Are any of the Web-RInChIKeys duplicated? It is quite hard to find out. The obvious approach is to take the very large csv file, sort it and search for unique Web-RInChIKeys. The files are so large that our computer ran out of memory rather than give an answer, even when this was done on even quite small sections of database, and so workarounds had to be found. We wanted our approach to be scalable to even larger databases and so we chose to investigate whether we could investigate the database using the computers we had rather than to move to more powerful computers. We reduced the csv file to a list of Web-RInChIKeys, substantially reducing is size. This reduced file was then filtered into a series of smaller files using the initial characters of the Web-RInChIKey. For example, a file containing only Web-RInChIKey which start with ‘A’ is about 4% of the size of the whole database and has no duplicates with the omitted Web-RInChIKey as none of these start with ‘A’. These smaller files could be sorted and searched for duplicates using standard Linux commands, making possible the search for duplicate Web-RInChIKey.

Some of the Web-RInChIKeys were duplicated, and there were only 1 050 824 321 unique values. What gives rise to almost 44 million repetitions in the database?

One possible explanation is that this arises from hash collisions. Web-RInChIKeys are generated by hashing the full RInChI into a short fixed-length string. An inevitable consequence of this is that there are fewer different Web-RInChIKeys available than there are possible RInChI. Each Web-RInChIKey can correspond to more than one RInChI. Do these repetitions arise because the Web-RInChIKey loses too much information? The Web-RInChIKey is divided into two strings of letters. The first group of 17 letters describes most of the structural information in the RInChI; the second group (15 characters) has information on the protonation state, the stereochemistry and secondary characteristics of the process. We expect there to be a large number of duplicates in the second group as many of the reactions contain only uncharged species and molecules without stereochemical (or other non-structural) features. In fact in 905 760 419 of the reactions this second group is the string NUHFFFADPSCTJSA, leaving almost 200 million SAVI entries with some stereochemical features. The first group of 17 letters, however, should be evenly distributed over the 26¹⁷ possible values. If we choose a random number between one and 26¹⁷ one billion times, what is the chance that we choose the same number twice? 26¹⁷ is approximately 10²⁴ which is much larger than one billion. The probability of a single hash collision within the first section of the Web-RInChIKey, therefore, is about one in a million. We can be very confident, therefore, that the 44 million repetitions do not arise from hash collisions.

Information is lost when a RInChI is compressed to a Web-RInChIKey. Two different RInChI might have the same Web-RInChIKey not because of a hash collision, but because they differ only in information which is ignored by the Web-RInChIKey. We also checked for this possibility and discovered that all of the duplicated Web-RInChIKey corresponded to duplicate RInChI.

Do these duplicates arise because the RInChI is omitting a vital piece of information about 44 million reactions in SAVI? An alternative explanation is that the SAVI algorithm generated the same reaction more than once and the RInChI has detected this. We now looked at the repetitions in more detail.

Were the duplicated Web-RInChIKeys repeated just once, or did some Web-RInChIKeys appear many times in the database? Table 1 shows the analysis:

Table 1:

How many times do Web-RInChIKeys occur in SAVI?

Number of occurrences	Count of Web-RInChIKeys	Structures with stereochemistry
1	1 050 824 321	175 139 336
2	43 815 752	6 926 936
3	25 177	2731
4	25 888	4889
5	0
6	2862	176
7	0
8	4	1
9	0
10	0
11	0
12	1	0
More than 12	0

We looked first at the Web-RInChIKey which is present 12 times: IPWKGWBMOUVREXHYK-NUHFFFADPSCTJSA. The second section of the Web-RInChIKey shows that the molecules involved in the reaction do not contain any stereochemical features. The reaction corresponding to this Web-RInChIKey is shown in Fig. 2.

Fig. 2:

IPWKGWBMOUVREXHYK-NUHFFFADPSCTJSA.

Six of the 12 entries in SAVI for this process are labelled (in SAVI) as “Mitsunobu reactions” and the rest as “Ester or amide or thiolester formation”. Both of these labels are appropriate for the transformation and the diagram gives useful, but incomplete, information about the reaction. For example: which oxygen in the starting material corresponds to the non-carbonyl oxygen in the ester that is formed? The answer depends on details of the reaction conditions which are not listed in the database. A useful feature of the RInChI is that it ignores this detail of the reaction, whilst retaining the key features, and so generates a unique and canonical string.

The left-hand starting material has three possible N–H tautomers, and the right-hand starting material has two possible N–H tautomers. The starting materials, therefore, can be written in six different ways. The SMILES description of the molecules in SAVI distinguishes between these representations and the InChI does not. It would not be possible to separate the tautomers under the reaction conditions for this process, and so the RInChI is right not to distinguish them. The product can also be represented as six distinct tautomers and any of the tautomers of the starting materials could lead to any of the tautomers of the product. Thus, this reaction could be written in 36 different ways, all of which would have different Reaction SMILES, all of which would have the same RInChI and Web-RInChIKey, and only six of which are listed in SAVI.

This accounts for six of the 12 entries for IPWKGWBMOUVREXHYK-NUHFFFADPSCTJSA. The other six arise because two different SAVI transforms lead to the same reaction: a Mitsunobu and an ester formation. The combination of tautomers and alternative transforms leads to the 12 entries in the database. Reaction SMILES distinguishes them all, by separating the tautomers and by differentiating between the transforms by recording the list of starting materials in a different order. The RInChI emphasises the unity of all of these processes.

The SAVI database has a lot of information beyond structure. For example, the QED values (quantitative estimate of druglikeness) are calculated for all of the products. Rather surprisingly, the QED values are not the same for all 12 of these identical molecules.

No Web-RInChIKeys are present 11, 10 or 9 times, but four are present eight times (Fig. 3). All of these have four possible tautomeric forms, and all of them can be described either as ester formation or as a Mitsunobu reaction. The eight entries, therefore, are consistent with the previous reaction which had six tautomeric forms instead of four.

Fig. 3:

Web-RInChIKeys with eight entries in SAVI.

The last of these reactions contains a double bond and so the Web-RInChIKey does not end in NUHFFFADPSCTJSA. The reaction as described by the Reaction SMILES in SAVI does not define the double bond geometry, but the RInChI defines it as Z in the product but not in the starting material.

There are no septets of Web-RInChIKeys in SAVI, but there are 2862groups of six. We might expect that these will all be amide or ester forming reactions with three possible tautomeric forms, but the situation turns out to be more complicated. If every group comprised three Mitsunobu reactions and three other ester formations, there would need to be 8586 Mitsunobu reactions in total. It turns out that there are only 8401 Mitsunobu reactions, and just 8441 “Ester or Amide or Thiolester Formation” reactions. Most of the sextets do comprise three of each, but there is one group with four Mitsunobu reactions and seven with six “Ester or Amide or Thiolester Formation” reactions, Fig. 4.

Fig. 4:

The unusual sextets.

The first of these, VROCYXOXYZMAPFSYT-NUHFFFADPSCTJSA, has four tautomeric forms, and so an octet might have been expected. Two of the “Ester or Amide or Thiolester Formation” are omitted.

The other seven do not include any Mitsunobu reactions, and this is not surprising as they all form amides rather than esters. Four of them (ABUMYERNUOPNDTCMM-NUHFFFADPSCTJSA, FIGAXTUPRABBRRCZL-NUHFFFADPSCTJSA, SLIDVDBMAILDODXAD-NUHFFFADPSCTJSA, ZERDZYWXTJCHKUJMX-NUHFFFADPSCTJSA) have nine possible tautomers, so a nonet might have been anticipated. For the remaining three (JSESGELBJNZYBXZTY-NUHFFFADPSCTJSA, XMYXKAOUNNOENVGUV-NUHFFFADPSCTJSA, YSVXWQVQMKLTKEGDV-NUHFFFADPSCTJSA) there are six tautomers and six entries, as expected.

The Mitsunobu reaction is widely used to form esters and is not usually used to form amides. It is no surprise that the SAVI database reflects this, probably as a result of the algorithms used to generate the data. It is notable, however, that a simple RInChI-based analysis of a billion reaction database makes this feature of the reaction so prominent and suggests that other, less obvious, insights might be found in similar ways.

The rest of the sextets represent a different reaction type: 330 “Benzimidazoles from o-Phenylenediamines” (Fig. 5). The benzimidazole ring formed in these reactions has two tautomers, for unsymmetrical structures, and so an even number of tautomers should be expected. Searching the 330 RInChI with a simple regular expression confirms that all of the coupling partners have three tautomeric forms, so the example in Fig. 5 is typical of this group of reactions.

Fig. 5:

Example of a “Benzimidazoles from o-Phenylenediamines” transform.

There are no structures in SAVI for which the Web-RInChIKey is repeated in a quintet, nor for any other odd number above three. The triplets arise from triazoles, which have three NH tautomeric forms and are well represented in the structures drawn above. It is straightforward to imagine structures with five, seven or more NH tautomeric forms. Some examples are shown in Fig. 6. The absence of such structures from SAVI re-emphasises the point made by Pitt in 2009 [17] that only a small proportion of reasonable heterocycles have ever been used.

Fig. 6:

Hetero cycles with five and seven NH tautomeric forms.

There are nearly 26 000 quartets of Web-RInChIKey in the database, which is slightly more than the number of triplets. Nearly all of these arise from the same processes that lead to the larger groups, and only a minority of them (4889) have stereochemical features:

275 “Benzimidazoles from o-Phenylenediamines”

12 771 “Mitsunobu Reaction”

12 799 “Ester or Amide or Thiolester Formation”

In addition to these, there are 43 quartets for the “Buchwald–Hartwig Reaction” none of which have any stereochemistry. For some of these, a change in the order of the reactants appears to be enough to generate an extra entry in SAVI, although it has no effect on the RInChI. Fig. 7 gives an example of this. Other quartets are due to a pair of NH tautomers, giving four entries for one RInChI, as we have come to expect. Fig. 8 gives an example of this. The product in this case contains two stereogenic centres, but these are undefined and so omitted from the InChI and RInChI. It is not clear whether the products should be regarded as mixtures of stereoisomers or single substances with undefined structure.

Fig. 7:

APTUKDWJULXEATQMU-NUHFFFADPSCTJSA.

Fig. 8:

BBWWVTIKHJLOKGGRD-NUHFFFADPSCTJSA.

The 25 177 triplets come from a slightly wider range of transformations:

1 114 “Acylsulfonamide from Sulfonamide and Carboxylic Acid”

10 “Benzimidazoles from o-Phenylenediamines”

21 495 “Ester or Amide or Thiolester Formation”

5 360 “Mitsunobu Reaction”

4 “Mitsunobu SN2′ Reaction”

Many of these triplets arise from systems which have three tautomeric forms, but other examples have just three of the four possibilities which might generated by swapping the order of the reactants and having a tautomeric pair. Two thousand eight hundred and one triplets comprise one “ester or amide or thiolester formation” and two “Mitsunobu reactions”. There is only one example of a single Mitsunobu paired with two “ester or amide or thiolester formation” processes. It is not obvious how the last of these is distinctive. These omissions probably arise from the process used to generate the database and illustrate that it is not exhaustive.

There are 43 815 752 pairs of Web-RInChIKey and these represent a wide range of individual transforms:

11 140 “Acylsulfonamide from Sulfonamide and Carboxylic Acid”

1 600 816 “Benzimidazoles from o-Phenylenediamines”

1 108 “Buchwald–Hartwig Ether Formation”

122 444 “Buchwald–Hartwig Reaction”

41 097 289 “Ester or Amide or Thiolester Formation”

8 “Hiyama Carbonylative Cross-Coupling”

3 710 “Horner–Wadsworth–Emmons Olefination”

1 020 “Liebeskind–Srogl Heterocyclic Coupling”

7 032 “Mitsunobu Aryl Ether Formation”

20 “Mitsunobu carbon–carbon bond formation”

8 870 “Mitsunobu Imide Reaction”

40 206 971 “Mitsunobu Reaction”

30 “Mitsunobu SN2′ Reaction”

2 452 “Mitsunobu Sulfonamide Reaction”

28 “Paal-Knorr Pyrrole Synthesis”

48 “Pyrazoles from Beta Carbonyl Carboxylic Acid Derivatives”

56 “Sulfonamide alkylation with a cyclic ether”

2 536 380 “Sulfonamide from sulfonic acid and amine”

40 354 “Sulfonamide Schotten–Baumann from Aryl Bromide”

2 536 380 “Sulfonamide Schotten–Baumann from Sulfonate”

2 104 “Suzuki–Miyaura Cross-Coupling (Bromo) ”

10 836 “Williamson Ether Synthesis”

3 710 “Wittig Reaction”

2 “Wittig via Methoxy-Ylide”

This list represents only 24 out of the 53 transforms used to generate SAVI. Further, the top five transforms in this list cover 99.8% of all of the pairs in SAVI. This suggests that most of the reactions in the database do not have the possibility of tautomerism and the order of the reagents only generates extra entries for some transforms.

Reaction duplicate checks in SPRESI by InfoChem

InfoChem GmbH [18], Munich, Germany, publishes the chemical database SPRESI containing 5.8 million molecules abstracted from literature, nearly 6.6 million reactions, 700 000 references and 164 000 patents with the major focus on organic chemistry covering the years 1974–2014 [5].

To test the RInChI algorithm, the RInChI and the RInChI keys (Long-RInChIKey, Short-RInChIKey and Web-RInChIKey) for the entire reaction part of the database were calculated. Two hundred and thirty ninecalculations failed because of pseudo-atoms not being recognised by the InChI algorithm. That left 4 564 718 reactions to be further analysed.

If two or more reactions have identical keys, it is expected that their structural representations are undistinguishable and that they only differ in terms of reaction conditions. However, the analysis of the RInChI keys showed that Long-RInChIKeys and Short-RInChIKeys identify the same number of duplicate records but the number of duplicated reactions identified by Web-RInChIKeys differed from those of the other two keys as shown in Table 2.

Table 2:

Unique reactions, Short-RInChIKeys and Web-RInChIKeys.

	Reactions	Unique reactions with multiple entries in SPRESI	Total unique reactions
SPRESI_RINCHI	4 564 718
Short-RInChIKey: total number of reactions with at least one duplicate	418 540	176 454	4 322 632
Web-RInChIKey: total number of reactions with at least one duplicate	455 572	191 382	4 300 528
Difference between totals	37 032	14 928	−22 104

Obviously, SPRESI contains duplicates: the same reaction is run under different conditions and, therefore, is reported in the database multiple times. As shown in Table 3 the number of duplicate entries in the database is large for pairs, but diminishes rapidly for triplets, quartets and higher levels of duplication.

Table 3:

Unique reactions, Short-RInChIKey and Web-RInChIKey.

Number of identical reactions	Identity checked by Long_RinChIKey/Short_RInChIKey: number of reactions	Identity checked by Web_RInChIKey: number of reactions
2	146 132	158 428
3	18 595	20 074
4	6384	6920
5	1752	1997
6	1304	1393
7	543	620
8	442	511
9	277	308
10	202	228
>10	1	1

There is one group of more than 10 reactions which is represented by the Web-RInChIKey = YNZRACMPSGLNHVVMS-NUHFFFADPSCTJSA. This occurs 175 times in the database and is related to the reactions in Table 4.

Table 4:

The most common Web-RInChIKey, distinguished by Long-RInChIKey.

Number of identical Long-RInChIKeys	Reaction from SPRESI
113	CO → CO₂
60	CO₂ → CO
1	CO + CO → CO₂
1	CO CO + CO → CO₂

All four reaction types provide a distinct Long-RInChIKey and Short-RInChIKey because they differ in terms of reaction direction (113 forward reactions identified by the layer FUHFF, 60 backward reactions identified by the layer BUHFF), the number of reactants (stoichiometric approach) and the interpretation of agents, although they are depicted in an unexpected way in these cases where the Web-RInChIKey groups them together in a helpful way.

Because the Web-RInChIKey is compiled based on the components without the related role in the reaction and because each of the compounds is only considered once in the calculation of the Web-RInChIKey all the 175 reactions provide the same Web-RInChIKey.

This shows the strength of the Web-RInChIKey: the key makes it straightforward to identify related forward and backward reactions. Reaction databases are generally limited to forward reactions [19]. To represent equilibrium reactions, the forward and the backward reaction have to be stored in the database together with their relationship. The introduction of the RInChI provides a new approach by using the Web-RInChIKey to group these reactions together to let them be identified as potential equilibrium reactions.

Because the Web-RInChIKey ignores the role of a compound and its number of occurrence in a reaction, it identifies those cases as identical reactions where structures have multiple roles or different roles in a reaction, like the examples in the last two reactions of Table 4.

The Web-RInChIKey has been designed to make reactions searchable in data sets where the drawing rules are not known and the definitions for the storage of reactants, products and agents are not available. As shown in the example above, the resulting hit lists may contain more results than expected but they ensure that no expected reaction is lost. This makes the Web-RInChIKey an ideal solution for reaction searches in the Internet where neither the normalisation rules nor the reaction directions are generally known.

How does SAVI compare with SPRESI

SPRESI only contains 0.5% of the total number of reactions in SAVI. Because these reactions are extracted from the literature, the database is more diverse. The Long-RInChIKey and Short-RInChIKey are capable of identifying duplicates in the database but the Web-RInChIKey is needed to group related back- and forward reactions together as well as to handle discrepancies in the reaction structure representations.

Conclusions

The RInChI can help analyse very large reaction databases. Web-RInChIKeys are very unlikely to be duplicated by hash collisions in a billion reaction database. Most repeated structures arise from NH tautomerism and the current RInChI treatment of this is very effective. SAVI’s billion reaction database is large enough to be hard to analyse, but not large enough to survey more than a tiny sub-set of chemical reaction space.

SAVI is huge but it is not large enough to include all chemistry. We guess that we need millions of times more information, probably billions of times more, and perhaps much more than this. The RInChI can help curate SAVI, but also enable the linking of SAVI to other datasets: a consistent key which works on the scale of billions, to link diverse data.

We need more reaction data. The RInChI provides an effective identifier to link diverse reaction databases together.

Corresponding author: Gerd Blanke, StructurePendium Technologies GmbH, Reulsbergweg 5, D-45257 Essen, Germany, e-mail: Gerd.Blanke@StructurePendium.com

Jonathan M. Goodman, Twitter: @goodman_j Article note: A collection of invited papers on Cheminformatics: Data and Standards.

Acknowledgments

Members of the RInChI working group: Guenter Grethe, István Öri, Jan Holst Jensen and Nicki Davis.

Research funding: None declared.

References

[1] K. Ermanis, A. C. Colgan, R. S. J. Proctor, B. W. Hadrys, R. J. Phipps, J. M. Goodman. J. Am. Chem. Soc. 142, 21091 (2020), https://doi.org/10.1021/jacs.0c09668.Suche in Google Scholar PubMed PubMed Central

[2] S. Lee, J. M. Goodman. J. Am. Chem. Soc. 142, 9210 (2020), https://doi.org/10.1021/jacs.9b13449.Suche in Google Scholar PubMed

[3] J. P. Reid, L. Simon, J. M. Goodman. Acc. Chem. Res. 49, 1029 (2016), https://doi.org/10.1021/acs.accounts.6b00052.Suche in Google Scholar PubMed

[4] C. W. Coley, W. H. Green, K. F. Jensen. Acc. Chem. Res. 51, 1281 (2018), https://doi.org/10.1021/acs.accounts.8b00087.Suche in Google Scholar PubMed

[5] SPRESI, https://www.SPRESI.com/ (accessed Sep 14, 2021).Suche in Google Scholar

[6] Reaxys, https://www.reaxys.com (accessed Sep 28, 2021).Suche in Google Scholar

[7] CAS, https://www.cas.org (accessed Sep 28, 2021).Suche in Google Scholar

[8] Pistachio. NextMove Software Limited, Cambridge CB4 0WG.Suche in Google Scholar

[9] G. Grethe, J. M. Goodman, C. H. G. Allen. J. Cheminf. 5, 45 (2013), https://doi.org/10.1186/1758-2946-5-45.Suche in Google Scholar PubMed PubMed Central

[10] G. Grethe, G. Blanke, H. Kraut, J. M. Goodman. J. Cheminf. 10, 22 (2018), https://doi.org/10.1186/s13321-018-0277-8.Suche in Google Scholar PubMed PubMed Central

[11] S. R. Heller, I. Pletnev, S. Stein, D. Tchekhovskoi. J. Cheminf. 7, 23 (2015), https://doi.org/10.1186/s13321-015-0068-4.Suche in Google Scholar PubMed PubMed Central

[12] J. M. Goodman, I. Pletnev, P. Thiessen, E. Bolton, S. R. Heller. J. Cheminf. 13, 40 (2021), https://doi.org/10.1186/s13321-021-00517-z.Suche in Google Scholar PubMed PubMed Central

[13] InChI Trust, https://www.inchi-trust.org (accessed Sep, 2021).Suche in Google Scholar

[14] UDM, https://github.com/PistoiaAlliance/UDM (accessed Sep, 2021).Suche in Google Scholar

[15] A. R. Leach, J. Bradshaw, D. V. S. Green, M. M. Hann, J. J. DelanyIII. J. Chem. Inf. Comput. Sci. 39, 1161 (1999), https://doi.org/10.1021/ci9904259.Suche in Google Scholar PubMed

[16] H. Patel, W.-D. Ihlenfeldt, P. N. Judson, Y. S. Moroz, Y. Pevzner, M. L. Peach, V. Delannée, N. I. Tarasova, M. C. Nicklaus. Sci. Data 7, 384 (2020), https://doi.org/10.1038/s41597-020-00727-4.Suche in Google Scholar PubMed PubMed Central

[17] W. R. Pitt, D. M. Parry, B. G. Perry, C. R. Groom. J. Med. Chem. 52, 2952 (2009), https://doi.org/10.1021/jm801513z.Suche in Google Scholar PubMed

[18] InfoChem GmbH, https://www.deepmatter.io (accessed Sep 28, 2021).Suche in Google Scholar

Published Online: 2022-05-05

Published in Print: 2022-06-27

© 2022 IUPAC & De Gruyter. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. For more information, please visit: http://creativecommons.org/licenses/by-nc-nd/4.0/

Artikel in diesem Heft

https://doi.org/10.1515/pac-2021-2008

Schlagwörter für diesen Artikel

Cheminformatics; InChI; integrating reaction data; reaction databases; RInChI; SAVI; SPRESI