This lesson assumes that you have just initiated SYBYL. If you are actually already in the custom version of SYBYL you should Zap all existing molecules and Delete all Backgrounds before beginning this lesson. Here we will open a Sybyl molecular database of halocarbons and model their anaesthetic potency and toxicity with a large set of standard (1D) QSAR descriptors.
From the File pulldown on the menubar select Molecular Spreadsheet and New.... The rows will represent Molecules. In the DATABASE_FILE dialog box, choose halocarbons.mdb and press Open. After the spreadsheet is initialized and appears, we will import the biological data for columns 1 and 2. From the File menu on the Molecular Spreadsheet choose the Import... item. On the resulting Import dialog choose Format: Tripos and enter ad50ld50.tripos in the File: text area. Press Import to load the first 2 columns of the MSS with the halocarbon anaesthtic potency and toxicity data.
From the spreadsheet menubar select the AutoFill button (and choose a new Column). Select MCONNIDX as the New column type to call the. MolconnZ MSS Dialog box. There are over 300 descriptors available, but we will choose a subset to reduce wasted computational effort and to make it easier to interpret the results:
block 1: HCHnX HCsats Hother block 3: sCH3 ssCH2 sssCH ssssC block 5: sF block 6: sCl block 7: sBr sI
Press OK to select these Atom-type E-State descriptors.
Note that if the check box button on the Main MSS Dialog is not on, it doesn't matter which descriptors may or may not be activated under the associated category.
The parameters (under) the Algorithm Options... button on the MSS Dialog are fine as is, so press OK to begin filling the Spreadsheet. Occasionally a Sybyl dialog will appear asking if the Fill method is Cell or Column -- use Column, although it doesn't seem to matter. When Molconn-Z has finished there should be 44 (total) columns in the MSS.
From the QSAR pulldown, select Partial Least Squares... to call the Partial Least Squares Analysis dialog box. The Dependent Column should be 1, and Column 2 should be deleted from the List of Columns to Use as it is measured biological data for a parallel analysis. Select Leave-1-Out Validation, Use SAMPLS: off, Column Filtering: off, 10 Components, and Scaling: Autoscale. This run will take less than 1 minute, so you may run it either interactively or in batch. If you run in batch, get a report on the results with QSAR, Report QSAR...; enter qsar.lis as the File name to receive the QSAR report. To review the report, spawn or create a unix window and edit or list the report. If you run it interactively, be sure to choose Yes for Keep this analysis? In this run the optimum number of components is reported as 10 (see below) and the cross-validated r2 is 0.939.
Summary output Standard Error of Predictions (Crossvalidated) Run # Comp1 Comp2 Comp3 Comp4 Comp5 Comp6 Comp7 --- - --------- --------- --------- --------- --------- --------- --------- 1 1 AD50 0.795 0.460 0.442 0.443 0.425 0.409 0.376 Run # Comp8 Comp9 Comp10 --- - --------- --------- --------- 1 1 AD50 0.381 0.383 0.386 Optimum # of components is 10. R squared 0.926
Examination of the output reveals that Sybyl has had a very hard time in defining the Optimum number of components. Clearly Comp2 is really not that much different in standard error as Comp10, so it could have been selected. The generally accepted approach is to accept as the optimum number of components a point where adding additional components gives little (ca. 2-5%) improvement, so reporting the best model as having two or three components is quite reasonable.
Re-run the PLS analysis with No Validation and 4 Components. This analysis produces the conventional r2 of 0.916.
Standard Error of Estimate 0.372 R squared 0.916 F values ( n1= 3, n2=38 ) 137.489 Prob.of R2=0 ( n1= 3, n2=38 ) 0.000 Regression Equation(s) AD50 = 8.589 - (0.161) * Xv0_3 - (0.412) * Xv1_4 - (0.151) * Xv2_5 - (0.293) * Xvp3_6 - (0.482) * Xvp4_7 - (0.025) * Xvc3_14 + (0.658) * Xvc4_15 - (0.117) * Xvpc4_16 - (0.013) * ka1_25 - (0.535) * ka2_26 - (0.086) * ka3_27 - (0.573) * phia_28 + (0.026) * SwHBd_29 + (0.026) * Hmax_30 + (0.027) * Hmin_32 + (0.014) * Gmin_33 - (1.454) * SHCHnX_34 - (0.131) * SHother_36 + (0.066) * SsCH3_37 - (0.210) * SssssC_40 + (0.011) * SsF_41 - (0.190) * SsBr_43 - (0.275) * SsI_44 Summary output Standard Error of Estimate 0.372 R squared 0.916 F values ( n1= 3, n2=38 ) 137.489 Prob.of R2=0 ( n1= 3, n2=38 ) 0.000