Multiple linear regression output from Molecular Modeling Pro Plus

This page shows and explains the output from the Molecular Modeling Pro Plus multiple linear regression routine.  The output contains several parts:

  • correlation matrix, eigenvalues and eigenvectors
  • analysis of variance table
  • the regression model (y = intercept + a*x(1) + b*x(2) + c*x(3)...)
  • table of observed, predicted and residual values for all compounds in the data base
  • plot of predicted versus observed values
  • plot of predicted versus residual values
  • PRESS analysis for all compounds in the data base and cross-validation
  • table of compounds sorted by highest values

Correlation matrix for regression variables:
  Note: Log Kow and MR are highly correlated, which may cause problems with the analysis...

  Boiling point (C) Log Kow MR
Boiling point (C) 1. 0.29434 0.61573
Log Kow 0.29434 1. 0.864
MR 0.61573 0.864 1.


The Eigenvalues found after 1 rotations are:
W 1 = 1.846697 ; W 2 = .1533031 ;

The proportion of variance of each component is:
W 1 = .9233485 ; W 2 = 7.665153E-02 ;

The corresponding eigenvectors are:

W 1 W 2
Log P .7071068 .7071068
MR .7071068 -.7071068

Analysis of variance
Variation  source df sum of squares mean square Statistics
Total (uncorrected) 296 7652412   F = 187.32
Mean 1 5624701.82   r squared = 0.56114
Total (corrected) 295 2027709.71   s = 55.11
Regression 2 1137819.96 568909.98  
Residual 293 889889.8 3037.17  

Note: probability of significant F =<0.0001
The above table shows that 56% of the variance in solvent boiling points is explained by Log Kow (calculated) and molar refractivity (MR, calculated) (we can say this because r squared is 0.56114).  The standard deviation of 55.11 degrees C may be larger than we can except.

The Model:
Model coefficients and standard errors

parameter coefficient standard error t probability
intercept 28.53 6.51 4.38 0.0000162
Log Kow -27.99 2.54 11.02 6.82 E-24
MR 4.768 0.2679 17.80 1.21 E-48

note: response variable: Literature Boiling Point (C)

The model is:   boiling point = 28.53 - 27.99*Log Kow + 4.768*MR (column 2).  All variables easily meet the criterion of their probability of being due to chance being less than 0.05 (far right column).  The standard error of MR is fairly low, and that of Log Kow being higher.

Printout of response values, predicted values and residuals:
  observed predicted residual
acetal 102 160.71 -58.7105
acetaldehyde 21 97.6349 -76.6349
acetic acid 117 105.186 11.814
acetic anhydride 139 159.054 -20.0538

... and so on until all 300+ observations are printed out...

Examine the above table to find outliers (compounds poorly predicted and having large residuals).  You may want to take them out and redo the analysis.  You can also find outliers on the next table.  In MAP you can click on a data point on the plot to learn its identity.

 Plot of predicted versus observed:

Mlr1.gif (5527 bytes)

Plot of predicted versus residuals:

Mlr2.gif (5918 bytes)

The above table should be a perfect scatter plot (uncorrelated).    However this plot is not perfect. The negative residuals have 2 clear groups.    This leads one to suspect there may be a missing variable.  Other things to look for in the plot of residuals are curves and funnel shapes.  These effects indicate that the data should be transformed (e.g. take the reciprical, add a square term, a cross-product term, take the log etc.)

PRESS analysis and cross validation:

Contributions to PRESS (Predictive Residual Sum of Squares):

Compound Predictive discrepancy
-------------------------- ----------------------
acetal                             3482.369
acetaldehyde                  5949.886
acetic acid                      141.3421
acetic anhydride              408.8725
acetol                             156.2702

...and so on until all compounds have been listed.  The compounds with larger number are more influential in determining the model coefficients...

Total PRESS = 929350.7
Sum of squares of response (SSY) = 2027709.70570218
Press/SSY = .458325311363807
The model has failed the cross-validation criterion (PRESS/SSY <0.4).
The model marginally failed the cross-validation test.  The mediocre r squared of 0.56, the high standard deviation, the intercorrelated independent variables and the failed cross-validation test all are indicators that we can do better...


Compounds with highest predicted values:

triisononyl trimellitate ( 461.5005)
triisooctyl trimelliate ( 439.8288)
ditridecyl phthalate ( 389.5374)
triethylene glycol oleyl ether ( 376.9923)
dibutyl stearate ( 359.3761)
diisodecyl phthalate ( 353.4654)
diisodecyl phthalate ( 353.4654)
diisononyl adipate ( 352.1996)
diisononyl phthalate ( 339.0171)
diisooctyl phthalate ( 324.5693)
triacetin ( 318.6974)
dioctyl phthalate ( 317.2979)
diisoheptyl phthalate ( 310.1215)
dibutyl sebacate ( 308.8088)

Return to Molecular Modeling Pro Plus General Description