QSPR Calculation of Normal Boiling Points of Organic Molecules Based on the Use of Correlation Weighting of Atomic Orbitals with Extended Connectivity of Zero- and First-Order Graphs of Atomic Orbitals

by Maykel Pérez González 1,Andrey A. Toropov 2,Pablo R. Duchowicz 3 andEduardo A. Castro 3,*1Department of Drug Design, Experimental Sugar Cane Station “Villa Clara-Cienfuegos”, Ranchuelo, Villa Clara, C.P. 53100, Cuba2Vostok Holding Innovation Company, Sadik Azimov 4th Street, 15, Tashkent 700000, Uzbekistan3INIFTA, Suc.4, C.C. 16, La Plata 1900, Argentina*Author to whom correspondence should be addressed.Molecules20049(12), 1019-1033; https://doi.org/10.3390/91201019Received: 8 July 2004 / Accepted: 4 August 2004 / Published: 31 December 2004Download PDFBrowse Figure


We report the results of a calculation of the normal boiling points of a representative set of 200 organic molecules through the application of QSPR theory. For this purpose we have used a particular set of flexible molecular descriptors, the so called Correlation Weighting of Atomic Orbitals with Extended Connectivity of Zero- and First-Order Graphs of Atomic Orbitals. Although in general the results show suitable behavior to predict this physical chemistry property, the existence of some deviant behaviors points to a need to complement this index with some other sort of molecular descriptors. Some possible extensions of this study are discussed.Keywords: Boiling point – Flexible Molecular Descriptors – Correlation Weighting of Atomic Orbitals.


One of the topics of continuing interest in structure-property studies is to arrive at simple correlations between the selected properties and the molecular structure. For such considerations the molecular structure is often represented as a simple mathematical object, such as a number, sequence, or a set of selected invariants of matrices, generally referred to as molecular descriptors. Multiple regression analysis is usually used in such studies in the hope that it might point to structural factors that influence a particular property. Of course, regression analysis does not establish a causal relationship between structural components and molecular properties. Nevertheless, it may help one in model building and assist in the design of molecules with prescribed desirable properties, which is an important goal in drug research. In chemistry, anything that can be said about the magnitude of the property and its dependence upon changes in the molecular structure depends on the chemist’s capability to establish valid relationships between structure and property. In many physical-chemistry, organic, biochemical and biological areas, it is increasingly necessary to translate those general relations into quantitative associations expressed in useful algebraic equations known as Quantitative Structure-Activity (-Property) Relationships (QSAR/QSPR). To obtain a significant correlation, it is crucial that appropriate descriptors be employed, whether they be theoretical, empirical or derived from readily available experimental features of the molecular structures. Many descriptors reflect simple molecular properties and thus they can provide some meaningful insights into the physical-chemistry nature of the activity/property under consideration.Chemical graph theory [1] advocates an alternative approach to QSAR/QSPR studies based on mathematically derived molecular descriptors. Such descriptors, often referred to as topological indices [2], include the well-known Wiener index W [3], the Hosoya index Z [4], and the connectivity index χ [5]. The last three decades have witnessed an upsurge of interest in applications of graph theory in chemistry. Constitutional formulae of molecules are chemical graphs where vertices represent the set of atoms and edges represent chemical bonds [6]. The pattern of connectedness of atoms in a molecule is preserved by constitutional graphs. A graph G = [V,E] consists of a finite nonempty set V of points together with a prescribed set E of unordered pairs of distinct points of V [7].The correlation and prediction of physical-chemistry properties of pure liquids and of mixtures, such as boiling point, density, viscosity, static dielectric constant, and refractive index, is of practical (process design and control) and theoretical (role of the molecular structure in determining the macroscopic properties of the solvent) relevance to both chemists and engineers. Traditionally, procedures for estimating these properties have been based either on theoretical relationships often making use of empirical parameters that have to be fitted or on empirical relationships derived from additive-constitutive schemes based on atomic groups or bonds contribution within the molecule [8,9,10,11,12]. More recently, the QSPR approach has been applied especially to predict boiling points (BPs), partition coefficients, chromatographic retention indexes, surface tension, critical temperatures, viscosity, refractive index, thermodynamic state functions and static dielectric constant, among other properties. The use of calculated molecular descriptors in QSPR analysis has two main advantages: (a) the descriptors can be univocally defined for any molecular structure or fragment; (b) thanks to the high and well-defined physical information content encoded in many theoretical descriptors, they can clarify the mechanism relating the studied property with the chemical structure. Furthermore, QSPR models based on calculated descriptors help understanding of the inter- and intramolecular interactions that are mainly responsible for the behavior of complex chemical systems and processes.The normal BP (i.e. the boiling point at 1 atm) is one of the major physical-chemistry properties used to characterize and identify a compound. Besides being an indicator for the physical state (liquid or gas) of a compound, the BP also provides an indication of its volatility. In addition, the BPs can be used to predict or estimate other physical properties, such as critical temperatures, flash points, enthalpies of vaporization, etc. [13,14,15]. The BP is often the first property measured for a new compound and one of the few parameters known for almost every volatile compound. Normal BPs are easy to determine, but when a chemical is unavailable, as yet unknown, or hazardous to handle, a reliable procedure for estimating its BP is required. Furthermore, the rapid and nearly explosive growth of combinatorial chemistry, where literally millions of new compounds are synthesized and tested without isolation, could render such a procedure very useful.A large number of methods for estimating BPs have been devised and numerous QSPR correlations of normal BPs have been reported and detailed reviews have been given elsewhere [15,16,17,18,19,20,21,22]. The aim of this study is to present the results derived from the use of a particular sort of flexible molecular descriptors to estimate the BPs of a representative set of organic molecules, in order to seek better ways of calculating physical-chemistry properties. Some previous experience with this issue has shown the convenience of resorting to this special sort of molecular descriptor.The paper is organized in the following way: the next section deals with the basic methodology, presenting some general properties of flexible molecular descriptors and some previous uses of the same. Then, we describe the calculation strategy, after which we give and discuss the results. Finally, our conclusions are presented together with some possible future further extensions of the method.

Molecular Descriptors

The basic algebraic expression of the fundamental principle governing the QSAR/QSPR, i.e. the quantitative formula representing the structure-activity/property relationship, isP = f({d})(1)where P stands for the activity/property, {d} is a set of molecular descriptors and f is an arbitrary function. The commonest and simplest cases are those where {d} is reduced just to one variable and f is a linear function, i..e.P = a + bd(2)with a,b ∈ Unknown node type: fontUnknown node type: font, and real numbers a, b are determined by a standard least squares procedure.Since there are too many possibilities to choose the set of molecular descriptors and besides they can be highly interrelated, this leads to a nasty situation which is termed the nightmare of the regression analysis. Some of these drawbacks include how to make the selection of descriptors, as well as ambiguities of the criteria used to select optimal descriptors and uncertainties when choosing the order in which descriptors are to be orthogonalized. Naturally, none of these difficulties exists for simple regression based on a single molecular descriptor, particularly if the regression is linear. This is one of the major reasons why researchers are striving to find or to design novel descriptors that would produce good correlation for a single molecular property of a set of compounds. However, not many molecular properties can be sufficiently well described by a single descriptor [23].A quite interesting alternative to surmount these difficulties was proposed long ago by Randic [24] and it consists on defining {d} as a function of one or several variables that are determined during the search for the best correlation. Thus, in contrast to the traditional topological indices, which one can calculate after selecting a set of compounds to be studied and then proceed with statistical analysis, the variable indices are initially non-numerical. Hence, they cannot be calculated in advance for the set of compounds. Instead, one starts with an arbitrary set of values for the yet undetermined variables and, through an iterative procedure, one varies these initial values seeking optimal values that will produce the smallest standard error for the property under consideration. It is clear that the use of variable descriptors (also called flexible descriptors) can only improve correlations over the use of simple indices because if all variables take on a zero value (which is very unlikely), we would obtain the results that coincide with the results based on he traditional rigid molecular descriptors. Current literature shows that the use of variable molecular descriptors dramatically improved regression statistics [23].Among the different alternatives of choosing flexible molecular descriptors, one of us (A.A.T.) has presented the so called Optimization of Correlation Weights of Local Graph Invariants (OCWLGI) procedure which has proved to be a rather suitable way to apply the method to calculate several biological activities and physical-chemistry properties [25,26,27,28,29,30,31,32,33,34]. The OCWLI may be based on the labeled hydrogen filled graph (LHFG) [35] and the graph of atomic orbitals (GAO) [36]. The OCWLI based upon the LHFGs yield reasonable good models of enthalpies of formation from elements of coordination compounds [37]. Besides, OCWLI based on LHFG have been used to model the Flory-Huggins polymer-solvent interaction parameters [26]. The OCWLI based upon the GAOs give rather good results to predict stability constants of amino acids complexes [36].Molecular descriptors DCW are calculated by means of the following relationship

Molecules 09 01019 i001

(3)where CW(aok) and CW(1ECk) are correlation weights of the atomic orbitals that are image of the k-th vertex in the GAO and correlation weights of Morgan extended connectivity of first order that have a k-th vertex in the GAO. The Monte Carlo method is then applied to determine optimum correlation weight values which produce the largest possible values of the correlation coefficient between the physical property as a function of the descriptor computed via Eq. (3). Numerical data of the GAO local invariants are listed in Table 1 and an illustrative example is reproduced in Table 2.Table 1. Correlation weights for calculating DCW0 and DCW1DCW0




Table 2. Calculation of the DCW1 for 1,1,3,3-tetramethyldisilazane (DCW1 = 8.39793)


Since the complete and detailed description of these flexible descriptors has been given before, we refer the reader interested in further minutiae to the specific papers where these details were largely reported [25,26,27,28,29,30,31,32,33,34].

Results and Discussion

We have chosen a representative set of 200 organic molecules of varied composition to study their normal boiling points (NBPs). These molecules, with both linear and cyclic structures, comprise ketones, acids, esters, aldehydes, nitriles, amines, alcohols, and hydrocarbons and a wide variety of atoms, such as C, H, O, N, Si, Cl, Br, F, P, S. The list of molecules is given in Table 3, together with their NBPs and the extended connectivity of zero- and first-order descriptors in the GAOs (DCW0 and DCW1, respectively).Table 3. Organic molecules, experimental NBPs (Celsius degrees) and DCWs.


First we have calculated the complete set via zero- and first-order descriptors, thus obtaining the following linear relationships:NBP = 50.24 + 10.91 DCW0
n = 200, r = 0.8910, S = 53.7, F = 763(4)NBP = 25.83 + 8.87 DCW1
n = 200, r = 0.892, S = 56.0, F = 783(5)where the statistical parameters have the usual meanings.The statistical data is moderately satisfactory and when Eqs.(4) and (5) are used to predict NBPs there are relatively large deviations for a significant number of molecules.We then proceed to a more usual calculation procedure when dealing with a large number of molecules, which consists of defining two disjoint sets: a training set to determine the regression equation and a test set to perform true predictions. Results are as follows:NBP = 49.16 + 10.89 DCW0
n = 150, r = 0.8841, S = 55.1, F = 530 (training set)
n = 50, r = 0.9120, S = 49.3, F = 237 (test set)(6)NBP = 23.72 + 8.96 DCW1
n = 150, r = 0.9328, S = 42.5, F = 530 (training set)
n = 50, r = 0.8766, S = 57.6, F = 237 (test set)(7)These results are somewhat better than the previous ones and large deviations occur for a smaller number of molecules. Since the choice of the molecules comprising the training and test sets are somewhat arbitrary, we have tested several partitions of the compounds, but final results are not markedly dependent on the way used to choose the molecules in both sets.Since there are some large deviant behaviors, we have resorted to removing these molecules (just five, from the total 200 molecules: numbers 11, 15, 56, 98 and 146 according to the identification number n from Table 3). Results are the following ones:NBP = 43.25 + 11.41 DCW0
n = 145, r = 0.9199, S = 46.8, F = 787 (training set)
n = 50, r = 0.9120, S = 46.6, F = 237 (test set)(8)If molecules 4, 15, 53, 91 and 98 are removed, statistical results areNBP = 22.50 + 9.10 DCW1
n = 145, r = 0.9530, S = 36.1, F = 1414 (training set)
n = 50, r = 0.8765, S = 53.9, F = 159 (test set)(9)These results show that by taking out some deviant molecules, the results improve remarkably and somewhat better predictions can be obtained.A final numerical test was made to define training and test sets based on the clustering approach [38]. The k-Means Cluster Analysis (k-MCA) may be used in training and testing (or predictive) series design [39,40]. The idea consists of carrying out a partition of the series of compounds into several statistically representative classes of chemicals. Thence, one may select from the number of all these classes of training and predicting series. This procedure ensures that any chemical classes (as determined by the clusters derived form the k-MCA will be represented in both series of compounds (i.e. training and test sets). It permits the design of both training and predicting series, which are representative of the entire experimental universe.NBP = 53.09 + 11.39DCW0
n = 158, r = 0.9586, S = 34.8, F = 1770 (complete set)(10)NBP = 54.28 + 11.45 DCW0
n = 126, r = 0.9633, S = 33.3, F = 1599 (training set)
n = 32, r = 0.9391, S = 39.1, F = 224 (test set)(11)NBP = 23.50 + 9.119 DCW1
n = 144, r = 0.9592, S = 33.9, F = 1633 (training set)
n = 37, r = 0.9564, S = 34.8, F = 376 (test set)(12)These last results are the best ones among the different equations presented before and they represent a suitable improvement with respect to the first ones defined by Equations (4-9). An additional possibility for doing these calculations would be to employ both descriptors together, but this is not possible since they are strongly correlated, as shown in Figure 1.We cannot make any direct comparison with other theoretical results since, to the best of our knowledge, the standard literature does not register any calculation for this particular molecular set. This is quite sensible, since the molecules are quite diverse and it is well known that working with molecular sets comprising similar molecules gives results that are better than those derived from a quite dissimilar set of molecules, as it is the present case. However, our aim has been precisely this: to make a regression approach for quite different molecules via quite simple linear equations based on a single molecular descriptor to predict NBPs. A complete listing of NBP results derived from using Eqs. (4-12) is available upon request from the corresponding author.

Molecules 09 01019 g001 550

Figure 1. DCW1 (vertical axis) versus DCW0 (horizontal axis). Regression equation: DCW1 = 2.978 + 1.222 DCW0.


We have presented results on NBPs for a quite diverse molecular set based upon simple linear regression equations depending on a single molecular descriptor in order to test the capability of a special kind of such parameter: a flexible molecular descriptor. Results are very encouraging and they show the power of such types of topological variables. In fact, although there are some large deviations when employing the complete initial molecular set comprising very diverse organic molecules, the average deviations are quite sensible ones. In order to judge the relative merits of the present approach one must take into consideration that a single figure is representing a physical-chemistry property (i.e. NBPs), which evidently depends on many molecular features which cannot be encoded in a single topological descriptor. In order to reproduce a given property, it is necessary to resort to a many variables regression equation, each of them taking into account a different molecular feature. Furthermore, usually one employs a set comprising similar molecules, but our main purpose has not been to make exact numerical predictions, but rather to show the real possibilities of a particular kind of flexible topological descriptor. We consider this objective has been fully met. The next step is to complement these calculations using a several variables approach, based on choosing other molecular descriptors in order to add other physical molecular features which are not included into the OCWLI. Work along this line of research is under way and results will be presented elsewhere very soon.


  1. King R., B. (Ed.) Chemical Applications of Topology and Graph Theory; Elsevier: Amsterdam, 1983.
  2. Diudea, M. V. (Ed.) QSPR/QSAR Studies by Molecular Descriptors; Nova Science Publishers, Inc.: Huntington, New York, 2001.
  3. Wiener, H. Structural Determination of Paraffin Boiling Points. J. Am. Chem. Soc. 194756, 17–20. [Google Scholar]
  4. Hosoya, H. Topological Index. A Newly Proposed Quantity Characterizing the Topological Nature of Structural Isomers of Saturated Hydrocarbons. Bull. Chem. Soc. Jpn. 197144, 2332–2339. [Google Scholar]
  5. Randic, M. On Characterization of Molecular Branching. J. Am. Chem. Soc. 197597, 6609–6615. [Google Scholar]
  6. Trinajstic, N. Graph Theory; CRC Press: Boca Raton, FL, 1983. [Google Scholar]
  7. Harary, F. Graph Theory; Addison-Wesley: Reading, MA, 1969. [Google Scholar]
  8. Cramer, R. D. BC(DEF) Parameters. 2. An Empirical structure-Based Scheme for the Prediction of Some Physical Properties. J. Am. Chem. Soc. 1979102, 1849–1859. [Google Scholar]
  9. Monnery, W. D.; Svreck, W. Y.; Mehrota, A. K. Voscicity: A Critical Review of Practical Predictive and Correlative Methods. Can. J. Chem. Eng. 199573, 3–40. [Google Scholar]
  10. Stein, S. E.; Brown, R. L. Estimation of Normal Boiling Points from Group Contributions. J. Chem. Inf. Comput. Sci. 199434, 581–587. [Google Scholar]
  11. Pouchly, J.; Quin, A.; Munk, P. Excess Volume of Mixing and Equation of State Theory. J. Solution Chem. 199322, 399–418. [Google Scholar]
  12. Elbro, H. S.; Fredenslund, A.; Rasmussen, P. Group Contribution Meted for the Prediction of Liquid Densities as a Function of Temperatures for Solvents, Oligomers and Polymers. Ind. Eng. Chem. Res. 199130, 2576–2593. [Google Scholar]
  13. Fisher, C. H. Boiling Point Gives Critical Temperatures. Chem. Eng. 198996, 157–158. [Google Scholar]
  14. Satyanarayana, K.; Kakati, M. C. Note: Correlation of Flash Points. FIRE Mater. 199115, 97–100. [Google Scholar]
  15. Rechsteiner, C. E. Handbook of Chemical Property Estimation Methods; Lyman, W. J., Reehl, W. F., Rosenblatt, D. H., Eds.; McGraw-Hill: New York, 1982; Chapter 12. [Google Scholar]
  16. Katritzky, A.R.; Mu, L.; Lobanov, V. S.; Karelson, M. Correlation of Boiling Points with Molecular Structure. 1. A Training Set of 298 Diverse Organics and a Test Set of 9 Simple Inorganics. J. Phys. Chem. 1996100, 10400–10407. [Google Scholar]
  17. Horvath, A.L. Molecular Design: Chemical Structure Generation from the Properties of Pure Organic Compounds; Elsevier: Amsterdam, 1992. [Google Scholar]
  18. Wessel, M. D.; Jurs, P. C. Prediction of Normal Boiling Points for a Diverse Set of Industrially Important Organic Compounds from Molecular Structure. J. Chem. Inf. Comput. Sci. 199535, 841–850. [Google Scholar]
  19. Lee, T. D.; Weers, J. G. QSPR and GCA Models for Predicting the Normal Boiling Points of Fluorocarbons. J. Phys. Chem. 199599, 6739–6747. [Google Scholar]
  20. Komasa, A. Prediction of Boiling Points of Ketones Using a Quantitative Structure-Property Relationships Treatment. Polish J. Chem. 200377, 1491–1499. [Google Scholar]
  21. Kompany-Zareh, M. A QSPR Study of Boiling Point of Saturated Alcohols Using Genetic Algorithm. Acta Chim. Slov. 200350, 259–273. [Google Scholar]
  22. Öberg, T. Boiling Points of Halogenated Aliphatic Compounds: A Quantitative Structure-Property Relationship for Prediction and Validation. J. Chem. Inf. Comput. Sci. 200444, 187–192. [Google Scholar]
  23. Randic, M.; Basak, S. C. Variable Molecular Descriptors, in Some Aspects of Mathematical Chemistry; Sinha, D. K., Basak, S. C., Mohanty, R. K., Busamallick, I. N., Eds.; Visva-Bharati University Press: Santiniketan (India), 1999. [Google Scholar]
  24. Randic, M. Novel Graph Theoretical Approach to Heteroatoms in QSAR. Chemom. Intel. Labl. Syst. 199110, 213–223. [Google Scholar]
  25. Toropova, A.P.; Toropov, A. A.; Ishankhodzhaeva, M. M.; Parpiev, N. A. QSPR Modeling of Stability Constants of Coordination Compounds by Optimization Weights of Local Graph Invariants. Russ. J. Inorg. Chem. 200045, 1057–1059. [Google Scholar]
  26. Toropov, A. A.; Voropaeva, N. L.; Ruban, I. N.; Rashidova, S. Sh. Quantitative Structure-Property Relationships for Binary Polymer-Solvent Systems: Correlation Weighting of the Local Invariants of Molecular Graphs. Polymer Science Ser. A 199941, 975–985. [Google Scholar]
  27. Toropov, A.; Toropova, A.; Ismailov, T.; Bonchev, D. 3D Weighting of Molecular Descriptors for QSPR/QSAR by the Method of Ideal Symmetry (MIS). 1. Application to Boiling Points of Alkanes. J. Mol. Struct. THEOCHEM1998424, 237–247. [Google Scholar]
  28. Krenkel, G.; Castro, E. A.; Toropov, A. A. Improved Molecular Descriptors Based on the Optimization of Correlation Weights of Local Graphs. Int. J. Molec. Sci. 20012, 57–65. [Google Scholar]
  29. Toropov, A. A.; Toropova, A. A. Prediction of Heteroatomic Amine Mutagenicity by Means of Correlation Weighting of Atomic Orbital Graphs of Local Invariants. J. Mol. Struct. THEOCHEM 2001538, 287–293. [Google Scholar]
  30. Toropov, A. A.; Toropova, A. P. Modeling the Lipophilicity by Means of Correlation Weighting of Local Graph Invariants. J. Mol. Struct. THEOCHEM 2001538, 197–199. [Google Scholar]
  31. Mercader, A.; Castro, E. A.; Toropov, A. A. QSPR Modeling the Enthalpy of Formation from Elements by Means of Correlation Weighting of Local Invariants of Atomic Orbital Molecular Graphs. Chem. Phys. Lett. 2000330, 612–623. [Google Scholar]
  32. Toropov, A. A. A. P. Toropova, QSAR Modeling of Toxicity on Optimization of Correlation Weights of Morgan Extended Connectivity. J. Mol. Struct. THEOCHEM 2002578, 129–134. [Google Scholar]
  33. Toropov, A. A.; Toropova, A. P. QSPR Modeling of Alkanes Properties Based on Graph of Atomic Orbitals. J. Mol. Struct. THEOCHEM 2003637, 1–10. [Google Scholar]
  34. Toropov, A. A.; Nesterov, I. V.; Nabiev, O. M. QSPR Modeling of Cycloalkanes Properties by Correlation Weighting of Extended Graph Valence Shells. J. Mol. Struct. THEOCHEM 2003637, 37–42. [Google Scholar]
  35. Basak, S. C.; Grunwald, G. D. Predicting mutagenicity of chemicals using topological and quantum chemical parameters: A similarity based study. Chemosphere 199531, 2529. [Google Scholar]
  36. Toropov, A. A.; Toropova, A. P. QSPR modeling of the formation constants for complexes using Atomic Orbital Graphs. Russ. J. Coord. Chem. 200026, 398. [Google Scholar]
  37. Toropov, A. A. A. P. Toropova, Optimization of correlation weights of the local graph invariants: use of the enthalpies of formation of complexes compounds for the QSPR modeling. Russ. J. Coord. Chem. 199824, 81. [Google Scholar]
  38. Pérez-González, M.; González Díaz, H.; Molina Ruiz, R.; Cabrera, M. A.; Ramos de Armas, R. TOPS-MODE Based QSARs Derived from Heterogeneous Series of Compounds. Applications to the Design of New Herbicides. J. Chem. Inf. Comput. Sci. 200343, 1192–1199. [Google Scholar]
  39. Kowalski, R. B.; Wold, S. Pattern Recognition in Chemistry. In Handbook of Statistics; Krishnaiah, P. R., Kanal, L. N., Eds.; North Holland Publishing Company: Amsterdam, 1982; pp. 673–697. [Google Scholar]
  40. McFarland, J. W.; Gans, D. J. Cluster Significance Analysis. In Methods and Principles in Medicinal Chemistry; Manhnhold, R., Krgsgaard, L., Timmerman, H., Eds.; VCH: Weinheim, 1995; Vol. 2 , (Chemometric Methods in Molecular Design, van Waterbeemd, H. ed.); pp. 295–307. [Google Scholar]

© 2004 by MDPI (http://www.mdpi.org). Reproduction is permitted for noncommercial purposes.

Leave a Reply

Your email address will not be published. Required fields are marked *