Saturday, September 22, 2007
SAS/SQL
Basic Rules
* The PROC SQL ends with the "QUIT;" statement.
* Individual SQL statements are located between "PROC SQL;" and "QUIT;"
* Relational: <, <=, >, >=, = (equal) and <> (not equal)
* Logical: AND, OR, NOT
Pattern Matching
* var LIKE "b%"; // beginning with b or B
* var LIKE "%ent"; // ending with "ent"
* var LIKE "%Hun%"; // containing "Hun"
* var LIKE "-------"; // containing seven characters
* LIKE: "WHERE name LIKE 'P%'"; "WHERE name LIKE '___k'";
* IN: "WHERE year (1987, 1991, 1993);
* BETWEEN: "WHERE earning BETWEEN 2000 AND 50000";
* NULL: IS NULL;' IS NOT NULL;
* IS MISSING: "WHERE employer IS MISSING";
Create and Delete Tables
* CREATE TABLE table ( id char(7), name char(30));
* CREATE TABLE table AS SELECT variables FROM table WHERE expression ORDER BY variables DESC;
* DROP TABLE table;
Select Statement
Select General
* SELECT * FROM tables;
* SELECT variables FROM tables/views WHERE conditions GROUP BY variables HAVING expression ORDER BY variables;
* SELECT * FROM tables INNER JOIN table ON conditions;
* SELECT member.id, member.name, feepayment.year, feepayment.month, feepayment.amount FROM sql.member, sql.feepayment WHERE member.id=feepayment.id;
* SELECT m.id, m.name, f.year, f.month, f.amount FROM sql.member AS m, sql.feepayment AS f WHERE m.id=f.id; /* Using Aliases */
Selecting by Joining
* SELECT [Indiana NPO (Working)].address, followup.address FROM followup INNER JOIN [Indiana NPO (Working)] ON followup.IDS = [Indiana NPO (Working)].IDS WHERE ((([Indiana NPO (Working)].ORGNAME) Is Null));
* SELECT [Indiana NPO (Working)].name, followup.name, [Indiana NPO (Working)].address, followup.address FROM [Indiana NPO (Working)] LEFT JOIN followup ON [Indiana NPO (Working)].id=followup.id WHERE ((([Indiana NPO (Working)].address)<>[followup].[address]));
* SELECT [Indiana NPO (Working)].name, tracking.name, [Indiana NPO (Working)].address, tracking.address, [Indiana NPO (Working)].city, tracking.city, [Indiana NPO (Working)].phone, tracking.phone INTO member FROM [Indiana NPO (Working)] INNER JOIN tracking ON [Indiana NPO (Working)].id=tracking.id; /* Making a separate table with records that meet the conditions */
Joining Tables
Joining General
* INNER: Listing only those both sides are equal.
* LEFT: Listing all records from the primary side (left hand side) and only those from the right hand side when joined fields are equal
* RIGHT: Listing all records from the right hand side and only those from the left hand side when joined fields are equal
* ... FROM left-hand-side INNER JOIN right-hand-side ON conditions;
* ... FROM left-hand-side AS lhs INNER JOIN right-hand-side AS rhs ON conditions;
Joining Example
* SELECT lhs.name, rhs.name FROM members AS lhs INNER JOIN publish AS rhs ON lhs.id = rhs.id;
* SELECT [Indiana NPO (Working)].name, followup.name, [Indiana NPO (Working)].address, followup.address FROM [Indiana NPO (Working)] LEFT JOIN followup ON [Indiana NPO (Working)].id=followup.id WHERE ((([Indiana NPO (Working)].address)<>[followup].[address]));
* UPDATE [Indiana NPO (Working)] RIGHT JOIN followup ON [Indiana NPO (Working)].id=followup.id SET [Indiana NPO (Working)].email=followup.email, [Indiana NPO (Working)].webpage=followup.webpage;
* INSERT INTO members SELECT FROM [Indiana NPO (Working)] INNER JOIN followup ON [Indiana NPO (Working)].id=followup.id;
Modify
Insert
* INSERT INTO table SET expression WHERE conditions;
* INSERT INTO table SET id='8740031', name='JeeShim';
* INSERT INTO table VALUES ('9101321', 'kucc625');
* INSERT INTO members SELECT FROM [Indiana NPO (Working)] INNER JOIN followup ON [Indiana NPO (Working)].id=followup.id;
* INSERT INTO table SELECT FROM lhs INNER JOIN rhs ON lhs.id=rhs.id; /* Appending joined records to the table */
Update
* UPDATE table SET expressions WHERE conditions;
* UPDATE tracking SET tracking.state="GA", tracking.city="Atanta" WHERE tracking.address IS NOT NULL;
* UPDATE [Indiana NPO (Working)] RIGHT JOIN followup ON [Indiana NPO (Working)].id=followup.id SET [Indiana NPO (Working)].email=followup.email, [Indiana NPO (Working)].webpage=followup.webpage;
Delete
* DELETE FROM table WHERE expression;
* DELETE FROM tracking WHERE (((tracking.ADDRESS1) Is Null));
* DELETE tracking.name, tracking.address FROM tracking WHERE (((tracking.state) <>"IN"));
SQL Examples
Computing Frequencies
PROC SQL;
SELECT name AS Names, Count(Names) AS Frequency
FROM publish
GROUP BY Names
HAVING (((Count(Names))>=1));
Inner Join to get from both Tables
SELECT m.id, m.name, p.journal
FROM members AS m INNER JOIN publish AS p ON m.name = p.name;
Left join
CREATE TABLE left AS SELECT m.id, m.name, i.journal
FROM members AS m LEFT JOIN inner As i ON m.name = i.name;
This step gets the observations appearing in the primary data set and to match observations in secondary data set.
To Get Unique Observation
DATA final;
SET left;
IF journal=MISSING;
RUN;
Thursday, August 23, 2007
Martingle
Originally, martingale referred to a class of betting strategies popular in 18th century France. The simplest of these strategies was designed for a game in which the gambler wins his stake if a coin comes up heads and loses it if the coin comes up tails. The strategy had the gambler double his bet after every loss, so that the first win would recover all previous losses plus win a profit equal to the original stake. Since as a gambler's wealth and available time jointly approach infinity his probability of eventually flipping heads approaches 1, the martingale betting strategy was seen as a sure thing by those who practiced it. Of course in reality the exponential growth of the bets would eventually bankrupt those foolish enough to use the martingale for a long time.
The concept of martingale in probability theory was introduced by Paul Pierre Lévy, and much of the original development of the theory was done by Joseph Leo Doob. Part of the motivation for that work was to show the impossibility of successful betting strategies.
Definitions
A discrete-time martingale is a discrete-time stochastic process (i.e., a sequence of random variables) X1, X2, X3, ... that satisfies for all n
E(|Xn|) < infinity
E(Xn+1|X1,...,Xn)= Xn,
i.e., the conditional expected value of the next observation, given all of the past observations, is equal to the last observation.
Somewhat more generally, a sequence Y1, Y2, Y3, ... is said to be a martingale with respect to another sequence X1, X2, X3, ... if for all n
E(|Yn|)< infinity
E(Yn+1|X1,.....,Xn) = Yn.
For more details, Please click MARTINGLE
Thanks,
Wednesday, August 22, 2007
SAS Output : Proc reg
SAS Annotated Output
Regression Analysis
This page shows an example regression analysis with footnotes explaining the output. These data were collected on 200 high schools students and are scores on various tests, including science, math, reading and social studies (socst). The variable female is a dichotomous variable coded 1 if the student was female and 0 if male.
In the code below, the data = option on the proc reg statement tells SAS where to find the SAS data set to be used in the analysis. On the model statement, we specify the regression model that we want to run, with the dependent variable (in this case, science) on the left of the equals sign, and the independent variables on the right-hand side. We use the clb option after the slash on the model statement to get the 95% confidence limits of the parameter estimates. The quit statement is included because proc reg is an interactive procedure, and quit tells SAS that not to expect another proc reg immediately.
proc reg data = "d:\dataset";
model science = math female socst read / clb;
run;
quit;
SAS Annotated Output
Regression Analysis
The REG Procedure
Model: MODEL1
Dependent Variable: science science score
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 4 9543.72074 2385.93019 46.69 <.0001
Error 195 9963.77926 51.09630
Corrected Total 199 19507
Root MSE 7.14817 R-Square 0.4892
Dependent Mean 51.85000 Adj R-Sq 0.4788
Coeff Var 13.78624
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 12.32529 3.19356 3.86 0.0002
math math score 1 0.38931 0.07412 5.25 <.0001
female 1 -2.00976 1.02272 -1.97 0.0508
socst social studies score 1 0.04984 0.06223 0.80 0.4241
read reading score 1 0.33530 0.07278 4.61 <.0001
Parameter Estimates
Variable Label DF 95% Confidence Limits
Intercept Intercept 1 6.02694 18.62364
math math score 1 0.24312 0.53550
female 1 -4.02677 0.00724
socst social studies score 1 -0.07289 0.17258
read reading score 1 0.19177 0.47883
Anova Table
Analysis of Variance
Sum of Mean
Sourcea DFb Squaresc Squared F Valuee Pr > Fe
Model 4 9543.72074 2385.93019 46.69 <.0001
Error 195 9963.77926 51.09630
Corrected Total 199 19507
a. Source - This is the source of variance, Model, Residual, and Total. The Total variance is partitioned into the variance which can be explained by the independent variables (Model) and the variance which is not explained by the independent variables (Residual, sometimes called Error). Note that the Sums of Squares for the Model and Residual add up to the Total Variance, reflecting the fact that the Total Variance is partitioned into Model and Residual variance.
b. DF - These are the degrees of freedom associated with the sources of variance. The total variance has N-1 degrees of freedom. In this case, there were N=200 students, so the DF for total is 199. The model degrees of freedom corresponds to the number of predictors minus 1 (K-1). You may think this would be 4-1 (since there were 4 independent variables in the model, math, female, socst and read). But, the intercept is automatically included in the model (unless you explicitly omit the intercept). Including the intercept, there are 5 predictors, so the model has 5-1=4 degrees of freedom. The Residual degrees of freedom is the DF total minus the DF model, 199 - 4 is 195.
c. Sum of Squares - These are the Sum of Squares associated with the three sources of variance, Total, Model and Residual. These can be computed in many ways. Conceptually, these formulas can be expressed as:
SSTotal The total variability around the mean. S(Y - Ybar)2.
SSResidual The sum of squared errors in prediction. S(Y - Ypredicted)2.
SSModel The improvement in prediction by using the predicted value of Y over just using the mean of Y. Hence, this would be the squared differences between the predicted value of Y and the mean of Y, S(Ypredicted - Ybar)2. Another way to think of this is the SSModel is SSTotal - SSResidual. Note that the SSTotal = SSModel + SSResidual. Note that SSModel / SSTotal is equal to .4892, the value of R-Square. This is because R-Square is the proportion of the variance explained by the independent variables, hence can be computed by SSModel / SSTotal.
d. Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF. For the Model, 9543.72074 / 4 = 2385.93019. For the Residual, 9963.77926 / 195 = 51.0963039. These are computed so you can compute the F ratio, dividing the Mean Square Model by the Mean Square Residual to test the significance of the predictors in the model.
e. F Value and Pr > F - The F-value is the Mean Square Model (2385.93019) divided by the Mean Square Residual (51.0963039), yielding F=46.69. The p-value associated with this F value is very small (0.0000). These values are used to answer the question "Do the independent variables reliably predict the dependent variable?". The p-value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable". You could say that the group of variables math, female, socst and read can be used to reliably predict science (the dependent variable). If the p-value were greater than 0.05, you would say that the group of independent variables does not show a statistically significant relationship with the dependent variable, or that the group of independent variables does not reliably predict the dependent variable. Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variable. The ability of each individual independent variable to predict the dependent variable is addressed in the table below where each of the individual variables are listed.
Overall Model Fit
Root MSEf 7.14817 R-Squarei 0.4892
Dependent Meang 51.85000 Adj R-Sqj 0.4788
Coeff Varh 13.78624
f. Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Error).
g. Dependent Mean - This is the mean of the dependent variable.
h. Coeff Var - This is the coefficient of variation, which is a unit-less measure of variation in the data. It is the root MSE divided by the mean of the dependent variable (7.15/51.85 = 13.79).
i. R-Square - R-Square is the proportion of variance in the dependent variable (science) which can be predicted from the independent variables (math, female, socst and read). This value indicates that 48.92% of the variance in science scores can be predicted from the variables math, female, socst and read. Note that this is an overall measure of the strength of association, and does not reflect the extent to which any particular independent variable is associated with the dependent variable.
j. Adj R-Sq - Adjusted R-square. As predictors are added to the model, each predictor will explain some of the variance in the dependent variable simply due to chance. One could continue to add predictors to the model which would continue to improve the ability of the predictors to explain the dependent variable, although some of this increase in R-square would be simply due to chance variation in that particular sample. The adjusted R-square attempts to yield a more honest value to estimate the R-squared for the population. The value of R-square was .4892, while the value of Adjusted R-square was .4788 Adjusted R-squared is computed using the formula 1 - ((1 - Rsq)((N - 1) / (N - k - 1)). From this formula, you can see that when the number of observations is small and the number of predictors is large, there will be a much greater difference between R-square and adjusted R-square (because the ratio of (N - 1) / N - k - 1) will be much less than 1). By contrast, when the number of observations is very large compared to the number of predictors, the value of R-square and adjusted R-square will be much closer because the ratio of (N - 1)/(N - k - 1) will approach 1.
Parameter Estimates
Parameter Estimates
Parameter Standard
Variablek Labell DFm Estimaten Erroro t Valuep Pr > |t|p
Intercept Intercept 1 12.32529 3.19356 3.86 0.0002
math math score 1 0.38931 0.07412 5.25 <.0001
female 1 -2.00976 1.02272 -1.97 0.0508
socst social studies score 1 0.04984 0.06223 0.80 0.4241
read reading score 1 0.33530 0.07278 4.61 <.0001
Parameter Estimates
Variablek Labell DFm 95% Confidence Limitsq
Intercept Intercept 1 6.02694 18.62364
math math score 1 0.24312 0.53550
female 1 -4.02677 0.00724
socst social studies score 1 -0.07289 0.17258
read reading score 1 0.19177 0.47883
k. Variable - This column shows the predictor variables (constant, math, female, socst, read). The first variable (constant) represents the constant, also referred to in textbooks as the Y intercept, the height of the regression line when it crosses the Y axis. In other words, this is the predicted value of science when all other variables are 0.
l. Label - This column gives the label for the variable. Usually, variable labels are added when the data set is created so that it is clear what the variable is (as the name of the variable can sometimes be ambiguous). SAS has labeled the variable Intercept for us by default. Note that this variable is not added to the data set.
m. DF - This column give the degrees of freedom associated with each independent variable. All continuous variables have one degree of freedom, as do binary variables (such as female).
n. Parameter Estimates - These are the values for the regression equation for predicting the dependent variable from the independent variable. The regression equation is presented in many different ways, for example:
Ypredicted = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4
The column of estimates (coefficients or parameter estimates, from here on labeled coefficients) provides the values for b0, b1, b2, b3 and b4 for this equation. Expressed in terms of the variables used in this example, the regression equation is
sciencePredicted = 12.32529 + .3893102*math + -2.009765*female+.0498443*socst+.3352998*read
These estimates tell you about the relationship between the independent variables and the dependent variable. These estimates tell the amount of increase in science scores that would be predicted by a 1 unit increase in the predictor. Note: For the independent variables which are not significant, the coefficients are not significantly different from 0, which should be taken into account when interpreting the coefficients. (See the columns with the t-value and p-value about testing whether the coefficients are significant).
math - The coefficient (parameter estimate) is .3893102. So, for every unit (i.e., point, since this is the metric in which the tests are measured) increase in math, a .3893102 unit increase in science is predicted, holding all other variables constant. (It does not matter at what value you hold the other variables constant, because it is a linear model.) Or, for every increase of one point on the math test, your science score is predicted to be higher by .3893102 points. This is significantly different from 0.
female - For every unit increase in female, there is a -2.009765 unit decrease in the predicted science score, holding all other variables constant. Since female is coded 0/1 (0=male, 1=female) the interpretation can be put more simply. For females the predicted science score would be 2 points lower than for males. The variable female is technically not statistically significantly different from 0, because the p-value is greater than .05. However, .051 is so close to .05 that some researchers would still consider it to be statistically significant.
socst - The coefficient for socst is .0498443. This means that for a 1-unit increase in the social studies score, we expect an approximately .05 point increase in the science score. This is not statistically significant; in other words, .0498443 is not different from 0.
read - The coefficient for read is .3352998. Hence, for every unit increase in reading score we expect a .34 point increase in the science score. This is statistically significant.
o. Standard Error - These are the standard errors associated with the coefficients. The standard error is used for testing whether the parameter is significantly different from 0 by dividing the parameter estimate by the standard error to obtain a t-value (see the column with t-values and p-values). The standard errors can also be used to form a confidence interval for the parameter, as shown in the last two columns of this table.
p. t Value and Pr > |t|- These columns provide the t-value and 2 tailed p-value used in testing the null hypothesis that the coefficient/parameter is 0. If you use a 2 tailed test, then you would compare each p-value to your preselected value of alpha. Coefficients having p-values less than alpha are statistically significant. For example, if you chose alpha to be 0.05, coefficients having a p-value of 0.05 or less would be statistically significant (i.e., you can reject the null hypothesis and say that the coefficient is significantly different from 0). If you use a 1 tailed test (i.e., you predict that the parameter will go in a particular direction), then you can divide the p-value by 2 before comparing it to your preselected alpha level. With a 2-tailed test and alpha of 0.05, you may reject the null hypothesis that the coefficient for female is equal to 0. The coefficient of -2.009765 is significantly greater than 0. However, if you used a 2-tailed test and alpha of 0.01, the p-value of .0255 is greater than 0.01 and the coefficient for female would not be significant at the 0.01 level. Had you predicted that this coefficient would be positive (i.e., a one tail test), you would be able to divide the p-value by 2 before comparing it to alpha. This would yield a one-tailed p-value of 0.00945, which is less than 0.01 and then you could conclude that this coefficient is greater than 0 with a one tailed alpha of 0.01.
The coefficient for math is significantly different from 0 using alpha of 0.05 because its p-value is 0.000, which is smaller than 0.05.
The coefficient for socst (0.0498443) is not statistically significantly different from 0 because its p-value is definitely larger than 0.05.
The coefficient for read (0.3353) is statistically significant because its p-value of 0.000 is less than .05.
The constant (_cons) is significantly different from 0 at the 0.05 alpha level. However, having a significant intercept is seldom interesting.
q. 95% Confidence Limits - This shows a 95% confidence interval for the coefficient. This is very useful as it helps you understand how high and how low the actual population value of the parameter might be. The confidence intervals are related to the p-values such that the coefficient will not be statistically significant if the confidence interval includes 0. If you look at the confidence interval for female, you will see that it just includes 0 (-4 to .007). Because .007 is so close to 0, the p-value is close to .05. If the upper confidence level had been a little smaller, such that it did not include 0, the coefficient for female would have been statistically significant. Also, consider the coefficients for female (-2) and read (.34). Immediately you see that the estimate for female is so much bigger, but examine the confidence interval for it (-4 to .007). Now examine the confidence interval for read (.19 to .48). Even though female has a bigger coefficient (in absolute terms) it could be as small as -4. By contrast, the lower confidence level for read is .19, which is still above 0. So, even though female has a bigger coefficient, read is significant and even the smallest value in the confidence interval is still higher than 0. The same cannot be said about the coefficient for socst. Such confidence intervals help you to put the estimate from the coefficient into perspective by seeing how much the value could vary.
SAS : Proc Reg
The REG procedure fits least-squares estimates to linear regression models. The following statements are used with the REG procedure:
PROC REG options;
MODEL dependents=regressors / options;
VAR variables;
FREQ variable;
WEIGHT variable;
ID variable;
OUTPUT OUT=SASdataset keyword=names...;
PLOT yvariable*xvariable = symbol ...;
RESTRICT linear_equation,...;
TEST linear_equation,...;
MTEST linear_equation,...;
BY variables;
The PROC REG statement is always accompanied by one or more MODEL statements to specify regression models. One OUTPUT statement may follow each MODEL statement. Several RESTRICT, TEST, and MTEST statements may follow each MODEL. WEIGHT, FREQ, and ID statements are optionally specified once for the entire PROC step. The purposes of the statements are:
* The MODEL statement specifies the dependent and independent variables in the regression model.
* The OUTPUT statement requests an output data set and names the variables to contain predicted values, residuals, and other output values.
* The ID statement names a variable to identify observations in the printout.
* The WEIGHT and FREQ statements declare variables to weight observations.
* The BY statement specifies variables to define subgroups for the analysis. The analysis is repeated for each value of the BY variable.
Proc REG Statement
PROC REG options;
These options may be specified on the PROC REG statement:
DATA=SASdataset
names the SAS data set to be used by PROC REG. If DATA= is not specified, REG uses the most recently created SAS data set.
OUTEST=SASdataset
requests that parameter estimates be output to this data set.
OUTSSCP=SASdataset
requests that the crossproducts matrix be output to this TYPE=SSCP data set.
NOPRINT
suppresses the normal printed output.
SIMPLE
prints the "simple" descriptive statistics for each variable used in REG.
ALL
requests many different printouts.
COVOUT
outputs the covariance matrices for the parameter estimates to the OUTEST data set. This option is valid only if OUTEST= is also specified.
MODEL Statement
label: MODEL dependents = regressors / options;
After the keyword MODEL, the dependent (response) variables are specified, followed by an equal sign and the regressor variables. Variables specified in the MODEL statement must be variables in the data set being analyzed. The label is optional.
* General options:
NOPRINT
suppresses the normal printout of regression results.
NOINT
suppresses the intercept term that is normally included in the model automatically.
ALL
requests all the features of these options: XPX, SS1, SS2, STB, TOL, COVB, CORRB, SEQB, P, R, CLI, CLM.
* Options to request regression calculations:
XPX
prints the X'X crossproducts matrix for the model.
I
prints the (X'X)-1 matrix.
* Options for details on the estimates:
SS1
prints the sequential sums of squares (Type I SS) along with the parameter estimates for each term in the model.
SS2
prints the partial sums of squares (Type II SS) along with the parameter estimates for each term in the model.
STB
prints standardized regression coefficients.
TOL
prints tolerance values for the estimates.
VIF
prints variance inflation factors with the parameter estimates. Variance inflation is the reciprocal of tolerance.
COVB
prints the estimated covariance matrix of the estimates.
CORRB
prints the correlation matrix of the estimates.
SEQB
prints a sequence of parameter estimates as each variable is entered into the model.
COLLIN
requests a detailed analysis of collinearity among the regressors.
COLLINOINT
requests the same analysis as the COLLIN option with the intercept variable adjusted out rather than included in the diagnostics.
* Options for predicted values and residuals:
P
calculates predicted values from the input data and the estimated model.
R
requests that the residual be analyzed.
CLM
prints the 95% upper and lower confidence limits for the expected value of the dependent variable (mean) for each observation.
CLI
requests the 95% upper and lower confidence limits for an individual predicted value.
DW
calculates a Durbin-Watson statistic to test whether or not the errors have first-order autocorrelation. (This test is only appropriate for time-series data.)
INFLUENCE
requests a detailed analysis of the influence of each observation on the estimates and the predicted values.
PARTIAL
requests partial regression leverage plots for each regressor.
FREQ Statement
FREQ variable;
If a variable in your data set represents the frequency of occurrence for the other values in the observation, include the variable's name in a FREQ statement. The procedure then treats the data set as if each observation appears n times, where n is the value of the FREQ variable for the observation. The total number of observations will be considered equal to the sum of the FREQ variable when the procedure determines degrees of freedom for significance probabilities.
WEIGHT Statement
WEIGHT variable;
A WEIGHT statement names a variable on the input data set whose values are relative weights for a weighted least-squares fit. If the weight value is proportional to the reciprocal of the variance for each observation, then the weighted estimates are the best linear unbiased estimates (BLUE).
ID Statement
ID variable;
The ID statement specifies one variable to identify observations as output from the MODEL options P, R, CLM, CLI, and INFLUENCE.
OUTPUT Statement
The OUTPUT statement specifies an output data set to contain statistics calculated for each observation. For each statistic, specify the keyword, an equal sign, and a variable name for the statistic on the output data set. If the MODEL has several dependent variables, then a list of output variable names can be specified after each keyword to correspond to the list of dependent variables.
OUTPUT OUT=SASdataset
PREDICTED=names or P=names
RESIDUAL=names or R=names
L95M=names
U95M=names
L95=names
U95=names
STDP=names
STDR=names
STUDENT=names
COOKD=names
H=names
PRESS=names
RSTUDENT=names
DFFITS=names
COVRATIO=names;
The output data set named with OUT= contains all the variables for which the analysis was performed, including any BY variables, any ID variables, and variables named in the OUTPUT statement that contain statistics.
These statistics may be output to the new data set:
PREDICTED=
P=
predicted values.
RESIDUAL=
R=
residuals, calculated as ACTUAL minus PREDICTED.
L95M=
lower bound of a 95% confidence interval for the expected value (mean) of the dependent variable.
U95M=
upper bound of a 95% confidence interval for the expected value (mean) of the dependent variable.
L95=
lower bound of a 95% confidence interval for an individual prediction. This includes the variance of the error as well as the variance of the parameter estimates.
U95=
upper bound of a 95% confidence interval for an individual prediction.
STDP=
standard error of the mean predicted value.
STDR=
standard error of the residual.
STUDENT=
studentized residuals, the residual divided by its standard error.
COOKD=
Cook's D influence statistic.
H=
leverage.
PRESS=
residual for estimates dropping this observation, which is the residual divided by (1-h) where h is leverage above.
RSTUDENT=
studentized residual defined slightly differently than above.
DFFITS=
standard influence of observation on predicted value.
COVRATIO=
standard influence of observation on covariance of betas, as discussed with INFLUENCE option.
PLOT Statement
PLOT yvariable*xvariable=symbol / options
The PLOT statement prints scatter plots of the yvariables on the vertical axis and xvariables on the horizontal axis. It uses the symbol specified to mark the points. The yvariables and xvariables may be any variables in the data set or any of the calculated statistics available in the OUTPUT statement.
TEST Statement
label: TEST equation1,
equation2,
.
.
.
equationk;
label: TEST equation1,..., equationk / options;
The TEST statement, which has the same syntax as the RESTRICT statement except for options, tests hypotheses about the parameters estimated in the preceding MODEL statement. Each equation specifies a linear hypothesis to be tested.
One option may be specified in the TEST statement after a slash (/):
prints intermediate calculations.
BY Statement
BY variables;
A BY statement may be used with PROC REG to obtain separate analyses on observations in groups defined by the BY variables. When a BY statement appears, the procedure expects the input data set to be sorted in order of the BY variables. If your input data set is not sorted in ascending order, use the SORT procedure with a similar BY statement to sort the data, or, if appropriate, use the BY statement options NOTSORTED or DESCENDING.
Friday, July 27, 2007
SAS plus
The following is the complete SAS code to generate the animated GIF and an HTML file that references it. You should notice the following:
• The GSFNAME= option of the GOPTIONS statement specifies the name of the GIF to be created. In this example, the value of GSFNAME is specified in an associated FILENAME statement. If you want to run this example, then change the value of the FILENAME statement to something that makes sense for you.
• The following statement
goptions gsfmode=append;
is included before the second invocation of PROC GCHART so that the output is appended to the same GIF file.
• A FILE statement specifies the complete path and file name of the HTML file to be created by the PUT statements. If you want to run this example, then change the value to something that makes sense for you
data prdsummary;
input Year Quarter Country $ Product $ Actual dollar10.2;
label Actual = 'Actual Sales';
format Actual dollar11.;
datalines;
1993 1 CANADA BED $4,337.00
1993 1 CANADA CHAIR $5,115.00
1993 1 CANADA DESK $6,644.00
1993 1 GERMANY BED $5,026.00
1993 1 GERMANY CHAIR $6,276.00
1993 1 GERMANY DESK $4,330.00
1993 2 CANADA BED $2,437.00
1993 2 CANADA CHAIR $3,115.00
1993 2 CANADA DESK $5,654.00
1993 2 GERMANY BED $3,026.00
1993 2 GERMANY CHAIR $2,276.00
1993 2 GERMANY DESK $3,320.00
1993 3 CANADA BED $6,337.00
1993 3 CANADA CHAIR $7,145.00
1993 3 CANADA DESK $7,614.00
1993 3 GERMANY BED $5,026.00
1993 3 GERMANY CHAIR $3,276.00
1993 3 GERMANY DESK $6,340.00
1993 4 CANADA BED $9,337.00
1993 4 CANADA CHAIR $2,115.00
1993 4 CANADA DESK $3,646.00
1993 4 GERMANY BED $6,026.00
1993 4 GERMANY CHAIR $7,276.00
1993 4 GERMANY DESK $8,350.00
1994 1 CANADA BED $3,327.00
1994 1 CANADA CHAIR $5,345.00
1994 1 CANADA DESK $7,624.00
1994 1 GERMANY BED $4,026.00
1994 1 GERMANY CHAIR $3,276.00
1994 1 GERMANY DESK $3,340.00
1994 2 CANADA BED $5,356.00
1994 2 CANADA CHAIR $3,115.00
1994 2 CANADA DESK $7,623.00
1994 2 GERMANY BED $8,026.00
1994 2 GERMANY CHAIR $5,276.00
1994 2 GERMANY DESK $7,321.00
1994 3 CANADA BED $4,321.00
1994 3 CANADA CHAIR $3,115.00
1994 3 CANADA DESK $5,658.00
1994 3 GERMANY BED $6,026.00
1994 3 GERMANY CHAIR $5,276.00
1994 3 GERMANY DESK $6,398.00
1994 4 CANADA BED $5,357.00
1994 4 CANADA CHAIR $4,166.00
1994 4 CANADA DESK $7,662.00
1994 4 GERMANY BED $4,026.00
1994 4 GERMANY CHAIR $5,246.00
1994 4 GERMANY DESK $3,329.00
;
/* delete previously created gsegs before creating new ones */
/* (SAS creates gsegs before creating gifs from them */
proc greplay igout=work.gseg nofs;
delete _all_;
/* could also specify: delete _1993, _19931, etc. */
run; quit;
/* use filename to specify output folder for gif files */
filename myimages 'u:\public_html\Web_output\gifanim.gif';
goptions reset=all device=gifanim gsfname=myimages
gsfmode=replace /* not necessary when using "BY" */
delay=150 /* set delay between images */
border
ftext="Helvetica" ftitle="Helvetica";
title1 '1993 Sales';
proc gchart data=prdsummary(where=(year=1993));
hbar3d country / sumvar=actual subgroup=product sum
shape=hexagon caxis=black cframe=CXb0c1f4;
by quarter;
run;
quit;
goptions gsfmode=append;
title1 '1994 Sales';
proc gchart data=prdsummary(where=(year=1994));
hbar3d country / sumvar=actual subgroup=product sum
shape=hexagon caxis=black cframe=CXb0c1f4;
by quarter;
run;
quit;
Friday, May 18, 2007
SAS - 3 : General Arithematics
SAS sample statistic functions
Sample statistics for a single variable across all observations are simple to obtain using, for example, PROC MEANS, PROC UNIVARIATE, etc. The simplest method to obtain similar statistics across several variables within an observation is with a 'sample statistics function'.
For example:
sum_wt=sum(of weight1 weight2 weight3 weight4 weight5);
Note that this is equivalent to
sum_wt=sum(of weight1-weight5);
but is not equivalent to
sum_wt=weight1 + weight2 + weight3 + weight4 + weight5;
since the SUM function returns the sum of non-missing arguments, whereas the '+' operator returns a missing value if any of the arguments are missing.
The following are all valid arguments for the SUM function:
sum(of variable1-variablen) where n is an integer greater than 1
sum(of x y z)
sum(of array-name{*})
sum(of _numeric_)
sum(of x--a) where x precedes a in the PDV order
A comma delimited list is also a valid argument, for example:
sum(x, y, z)
However, I recommend always using an argument preceded by OF, since this minimises the chance that you write something like
sum_wt=sum(weight1-weight5);
which is a valid SAS expression, but evaluates to the difference between weight1 and weight5.
Other useful sample statistic functions are:
MAX(argument,...) returns the largest value
MIN(argument,...) returns the smallest value
MEAN(argument,...) returns the arithmetic mean (average)
N(argument,....) returns the number of nonmissing arguments
NMISS(argument,...) returns the number of missing values
STD(argument,...) returns the standard deviation
STDERR(argument,...) returns the standard error of the mean
VAR(argument,...) returns the variance
Example usage
You may, for example, have collected weekly test scores over a 20 week period and wish to calculate the average score for all observations with the proviso that a maximum of 2 scores may be missing.
if nmiss(of test1-test20) le 2 then
testmean=mean(of test1-test20);
else testmean=.;
Friday, May 11, 2007
SAS - 2 : Proc Print n Freq
Objectives
- Use PROC PRINT
- Use PROC FREQ
Procedure (PROC) statements specify the procedure to be used on the data set you created. The general form of the statements needed to execute a SAS procedure is:
PROC procedure_name options parameters;
Example:
PROC PRINT N;
Beginning SAS users should become familiar with these procedures:
PROC PRINT;
PROC FREQ;
PROC UNIVARIATE;
PROC MEANS;
PROC PLOT;
PROC SORT;
In this Session you will learn to use PROC PRINT and PROC FREQ.
PROC PRINT statement
Use: | Lists data as a table of observations |
Syntax: | PROC PRINT; |
Result: | SAS is asked to print a table of all observations in your data set |
PROC PRINT lists data as a table of observations by variables. The general form of the PRINT procedure is:
PROC PRINT;
VAR is an additional procedure information statement in the PRINT procedure that allows you to pick out specific variables to be printed in a certain order. Without the VAR statement, all variables in the data set would be output in the print listing.
Sample output from the PRINT procedure
PROC FREQ statement
Use: | To produce a frequency table or 2-way table of your data |
Syntax: | PROC FREQ; |
PROC FREQ shows the distribution of variable values through a one-way table or through crosstabulation tables. The general form of the PROC FREQ statement is:
PROC FREQ;
Sample output from the frequency on sex
In order to produce a two-way crosstabulation table the variables to be used in the table should be entered in the TABLES statement as follows:
PROC FREQ;
Session 3 Exercise
- Using Pico, edit your SAS program file called survey.sas.
- Add a PROC PRINT statement and a PROC FREQ statement to the end of your program by typing:
- PROC PRINT;
VAR NAME AGE SEX Ql Q2;
PROC FREQ;
TABLES AGE SEX Q1 Q2; Your SAS program should now look like this:
- Now save your changes and submit your program for processing by SAS, at the $ prompt, type:
sas survey.sas
You will know your job is completed when you see the $ prompt again. Now edit the survey.log file and check for errors, warnings, and notes. If your program ran without errors go on to the next session. If your program output shows errors, check the program carefully to make sure it is exactly like the examples given in this tutorial. If you find the errors and correct them, make sure you save your changes and then resubmit your job as above.
SAS - 1 : Data Step
Objectives
- Learn to use the DATA statement
- Learn to use the INPUT statement
- Learn to use the CARDS statement
- Learn how to use the semicolon (;)
- Learn how to include TITLES on your output
- Learn how to run a SAS program
DATA statement
Use: | Names the SAS data set |
Syntax: | DATA SOMENAME; |
Result: | A temporary SAS data set named SOMENAME is created |
The DATA statement signals the beginning of a DATA step. The general form of the SAS DATA statement is:
DATA SOMENAME;
The DATA statement names the data set you are creating. The name should be from 1-8 characters and must begin with a letter or underscore.
INPUT statement
Use: | Defines names and order of variables to SAS |
Syntax: | INPUT variable variable_type column(s); |
Result: | Input data are defined for SAS |
The INPUT statement specifies the names and order of the variables in your data. Although there are three types of INPUT statements which can be mixed, the beginning SAS user should only be concerned with learning how to use the Column Input style.
The INPUT statement should indicate in which columns each variable may be found. In addition, the INPUT statement should indicate how many lines it takes to record all of the information for each case or observation. The general form of the SAS INPUT statement is:
INPUT NAME $ 1-14 AGE 16-17 SEX 19 Q1 21 Q2 23;
The variable NAME is a character variable as is indicated by the dollar sign ($) after the variable name. All of the other variables are numeric.
If there are multiple lines of data for each observation, use a forward slash ('/') in the INPUT statement to indicate where a new data line begins.
The general form of the SAS INPUT statement with multiple lines of data per observation is:
INPUT NAME $ 1-14 AGE 16-17 / SEX 1 Q1 3 Q2 5;
Note: When describing the second line of input data, you begin with column one again. Each piece of data, or variable, will be read from the same columns for each of your observations. Only one INPUT statement is necessary to describe the data for all of your cases.
CARDS statement
Use: | Signals that input data will follow |
Syntax: | CARDS; |
Result: | Data can be processed for the SAS data set |
The CARDS statement signals that the data will follow next and immediately precedes your input data lines. The general form of the CARDS statement is:
DATA SURVEY;
Note: If the data is contained in an external file, instead of the CARDS, you will usse an INFILE statement to specify where that file resides. (Example: INFILE 'survey.dat';).
Semicolon
Use: | Signals the end of any SAS statement |
Syntax: | A DATA Step or PROCedure statement; |
Result: | SAS is signaled that the statement is complete |
The semicolon (;) is used as a delimiter to indicate the end of SAS statements.
DATA SURVEY;
INPUT NAME $ 1-14 AGE 16-17 SEX 19 Ql 21 Q2 23;
CARDS;
TITLE statement
Use: | Puts TITLES on your output |
Syntax: | TITLE 'some title'; |
Result: | A TITLE is added at the top of each page |
The TITLE statement assigns a title which appears at the top of the output page. The general form of the TITLE statement is:
TITLE 'some title';
How to run a SAS program
Once you have used your editor to type in a SAS program and have saved that program you're ready to run the program. At the Linux ($) prompt, type:
sas programname.sas
When SAS is finished running the program, it will return two files to the current directory: 1.) programname.log, which contains a log of the job execution, including errors, warnings and notes, and 2) programname.lst, which contains output from the procedures in the program.
For example, let's say you have used the Pico editor to enter a SAS program named survey.sas and have saved it in your root directory. To run that program from the Linux ($) prompt, type:
sas survey.sas
You will know when SAS has finished running the program when the $ prompt reappears. You will have two new files in your directory, survey.log and survey.lst. Now you may edit the .log file and the .lst file to check for warnings, errors, and notes and to look at the output from the procedures.
A SAS Program Example
Session 2 Exercise
In the following exercise you will enter the first five SAS statements in your SAS program and you will enter the data from the sample survey.
- Invoke the Pico editor to create a new file. At the $ prompt, type:
pico survey.sas
- Enter the first five lines of a SAS program. In order for the exercises in this tutorial to work successfully, you must type the program statements exactly as they are presented here.
Type:
TITLE 'Sample SAS Program';
Result:
A title is added to the program
Type:
DATA SURVEY;
Result:
The SAS data set named SURVEY is created
Type:
INPUT NAME $ 1-14 AGE 16-17 SEX 19 Q1 21 Q2 23;
Result:
The format for your data is described to SAS
Type:
CARDS;
Result:
SAS is given the signal that data will follow directly
Type:
JANE 20 2 2 5
MICHAEL 18 1 5 2
MARIA 21 2 2 4
JUAN 26 1 4 3
MILDRED 28 2 3 4
GUNTHER 30 1 5 2
JOSEPH 25 1 4 4
JULIA 19 2 2 2
CODY 27 1 1 1
AARON 29 1 2 2Result:
Your raw data are entered to be read by your INPUT statement.
Note: Remember to enter the data in the exact columns you have specified in your INPUT statement. For example, AGE must be in columns 16 and 17.
- Now, save this program under the name survey.sas. (In Pico, you will type
, answer yes when asked if you want to save changes and press when asked if you want to call the program survey.sas.) Your SAS program should look like this: