|
This page shows and explains the output from the
Molecular Modeling Pro Plus multiple linear regression routine. The output contains
several parts:
- correlation matrix, eigenvalues and eigenvectors
- analysis of variance table
- the regression model (y = intercept + a*x(1) + b*x(2) +
c*x(3)...)
- table of observed, predicted and residual values
for all compounds in the data base
- plot of predicted versus observed values
- plot of predicted versus residual values
- PRESS analysis for all compounds in the data base
and cross-validation
- table of compounds sorted by highest values
Correlation matrix for regression variables:
Note: Log Kow and MR are highly correlated, which may cause problems
with the analysis...
| |
Boiling point (C) |
Log Kow |
MR |
| Boiling point (C) |
1. |
0.29434 |
0.61573 |
| Log Kow |
0.29434 |
1. |
0.864 |
| MR |
0.61573 |
0.864 |
1. |
The Eigenvalues found after 1 rotations are:
W 1 = 1.846697 ; W 2 = .1533031 ;
The proportion of variance of each component is:
W 1 = .9233485 ; W 2 = 7.665153E-02 ;
The corresponding eigenvectors are:
W 1 W 2
Log P .7071068 .7071068
MR .7071068 -.7071068
Analysis of variance
| Variation source |
df |
sum of squares |
mean square |
Statistics |
| Total (uncorrected) |
296 |
7652412 |
|
F = 187.32 |
| Mean |
1 |
5624701.82 |
|
r squared = 0.56114 |
| Total (corrected) |
295 |
2027709.71 |
|
s = 55.11 |
| Regression |
2 |
1137819.96 |
568909.98 |
|
| Residual |
293 |
889889.8 |
3037.17 |
|
Note: probability of significant F =<0.0001
The above table shows that 56% of the variance in solvent boiling
points is explained by Log Kow (calculated) and molar refractivity (MR, calculated) (we
can say this because r squared is 0.56114). The standard deviation of 55.11 degrees
C may be larger than we can except.
The Model:
Model coefficients and standard errors
| parameter |
coefficient |
standard error |
t |
probability |
| intercept |
28.53 |
6.51 |
4.38 |
0.0000162 |
| Log Kow |
-27.99 |
2.54 |
11.02 |
6.82 E-24 |
| MR |
4.768 |
0.2679 |
17.80 |
1.21 E-48 |
note: response variable: Literature Boiling Point (C)
The model is: boiling point = 28.53 - 27.99*Log Kow +
4.768*MR (column 2). All variables easily meet the criterion of their probability of
being due to chance being less than 0.05 (far right column). The standard error of
MR is fairly low, and that of Log Kow being higher.
Printout of response values, predicted values and residuals:
| |
observed |
predicted |
residual |
| acetal |
102 |
160.71 |
-58.7105 |
| acetaldehyde |
21 |
97.6349 |
-76.6349 |
| acetic acid |
117 |
105.186 |
11.814 |
| acetic anhydride |
139 |
159.054 |
-20.0538 |
... and so on until all 300+ observations are printed out...
Examine the above table to find outliers (compounds poorly
predicted and having large residuals). You may want to take them out and redo the
analysis. You can also find outliers on the next table. In MAP you can click
on a data point on the plot to learn its identity.
Plot of predicted versus observed:

Plot of predicted versus residuals:

The above table should be a perfect scatter plot (uncorrelated).
However this plot is not perfect. The negative residuals have 2 clear groups.
This leads one to suspect there may be a missing variable. Other things
to look for in the plot of residuals are curves and funnel shapes. These effects
indicate that the data should be transformed (e.g. take the reciprical, add a square term,
a cross-product term, take the log etc.)
PRESS analysis and cross validation:
Contributions to PRESS (Predictive Residual Sum of Squares):
Compound Predictive discrepancy
-------------------------- ----------------------
acetal
3482.369
acetaldehyde
5949.886
acetic acid
141.3421
acetic anhydride
408.8725
acetol
156.2702
...and so on until all compounds
have been listed. The compounds with larger number are more influential in
determining the model coefficients...
Total PRESS = 929350.7
Sum of squares of response (SSY) = 2027709.70570218
Press/SSY = .458325311363807
The model has failed the cross-validation criterion (PRESS/SSY <0.4).
The model marginally failed the cross-validation test. The
mediocre r squared of 0.56, the high standard deviation, the intercorrelated independent
variables and the failed cross-validation test all are indicators that we can do better...
Compounds with highest predicted values:
triisononyl trimellitate ( 461.5005)
triisooctyl trimelliate ( 439.8288)
ditridecyl phthalate ( 389.5374)
triethylene glycol oleyl ether ( 376.9923)
dibutyl stearate ( 359.3761)
diisodecyl phthalate ( 353.4654)
diisodecyl phthalate ( 353.4654)
diisononyl adipate ( 352.1996)
diisononyl phthalate ( 339.0171)
diisooctyl phthalate ( 324.5693)
triacetin ( 318.6974)
dioctyl phthalate ( 317.2979)
diisoheptyl phthalate ( 310.1215)
dibutyl sebacate ( 308.8088)
Return to Molecular Modeling Pro Plus General Description
|