2007年12月30日 星期日

ODS的應用2-繪圖

還記得上一篇ODS的應用怎麼輸出HTM報表嗎

這次多加了一點花樣來玩玩~~~

也就是如何增加報表中特殊計算圖形輸出
只要再輸入一下語法將要輸出的報表包起來即可


ODS file-type ;
ODS GRAPHICS ON;

proc print;
run;

ODS GRAPHICS OFF;
ODS file-type CLOSE;



輸出後會如下圖增加一些額外繪圖,此為ROC curve繪製方法之一可參考應用SAS繪製ROC curve圖


而要目前SAS有支援的語法有:ANOVA, ARIMA, AUTOREG, CORR, ENTROPY, EXPAND, GAM, GLM, HPF, KDE, LIFETEST, LOESS, LOGISTIC, MI, MIXED, MODEL, PHREG, PRINCOMP, PRINQUAL, REG, ROBUSTREG, SYSLIN, TIMESERIES, UCM, VARMAX.等

此外所輸出的圖層也都不相同,用處也不同,所以剩下報表判讀就是使用者自己的問題了


2007年12月26日 星期三

ODS的應用

總是嫌SAS只能輸出醜醜的報表嗎?
利用ODS語法吧
剛開始先知道ODS的特性"包裝",也就是將分析步驟PROC step原本輸出在output視窗改為輸出成其他格式( file-type),如網頁格式或是直接輸出為word檔

以下為ODS基本語法,尚需指定格式 file-type

ODS file-type ;

proc print;
run;

ODS file-type CLOSE ;




輸出前指定格式 file-type可為HTML,rtf

ODS file-type ;

proc print;
run;

ODS file-type CLOSE ;



以HTML為例

ODS HTML ;

proc print;
run;

ODS HTML CLOSE ;



輸出如下圖


可將檔案輸出格式:HTML,rtf......可參考SAS說明

還是嫌ODS匯出很醜嗎,當然支援報表的客製化但寫語法還是太繁雜所以直接套用其他輸出模組不是更快嗎,以下提供一輔助語法 style=,後面接模組名稱如預設值defult

ODS HTML style= ;

proc print;
run;

ODS HTML CLOSE ;



也可改為d3d,analysis,banker,sasweb,sasweb2,torn......內建超級多

如下圖為style=torn的報表




但問題來了程式寫一堆要輸出分析的報表,但利用ODS來包裝要跳著框起來還是太麻煩了 Orz

那乾脆把整著都輸出算了,所以可以把程式重頭到尾都包起來,不過也可以利用設定SAS預設值的方式來表達


下圖還有選項可以順便選擇style


之後每次輸出報表時皆會輸出HTML格式報表,但相對的會使電腦增加一點處理

2007年12月25日 星期二

資料的行列互換

通常上下直向為一個變項,而左右橫向代表一個觀察值,但對於不常建立與使用資料的人而言,常常不了解資料的變項與觀察值的位置關係,固然在鍵入資料總是位置錯亂,而使的分析人員頭痛,不過以下提供一簡易方法


proc transpose ;
run;
proc print;
run;



原始資料


轉置後資料


此方法像翻書一樣自左下角往右上角翻過去,但在轉置仍有很多風險存在
而在轉換後因會自動將變項作命名此時看不順眼可以增加以下語法

proc transpose label=DESC name=VAR prefix=VAL ;
run;
proc print;
run;


label 即是用來替 _LABEL_ 重新命名,name 則是用來替 _NAME_ 從新命名,而 prefix 則是將 COL1 到 COL11 開頭的 COL 重新命名

重新命名後的資料呈現如下圖:


若只想轉換特定幾個變項時可以增加VAR statement

proc transpose label=DESC name=VAR prefix=VAL ;
var x1 x2;
run;
proc print;
run;



結果如下:


然而轉換利用SAS的命名方式 prefix 覺得比較無意議,則可以改用ID statement

proc transpose label=DESC name=VAR prefix=VAL ;
ID name;
run;
proc print;
run;



轉換前:

轉換後:





2007年12月24日 星期一

應用SAS繪製ROC curve圖

一般要繪製ROC curve 除了要先具備基礎觀念敏感度與特異度之外,並且知道ROC curve繪製過程,也就是變動閾值(threshold)獲得所有可能的敏感度與特異度,並將其收集為兩個變項軸x,y,其中的x軸為特異度(scentivity),y軸為1-特異度(1-specificity),並將其做圖,就是建立ROC曲線圖.
如下圖:



然而要利用SAS計算並且繪製似乎在SAS程式內未提供詳細說明與做法,吾在此提供兩種方法,在繪製前提尚需了解SAS如何處理資料,當欲考量兩族群(y=1or0)在自變數(x)的變化下對於該自變數能否有效鑑別出兩族群,一般而言會考量使用Logistic regression方法,以下為Logistic regression機率函數



並針對兩族群對於自變項設定model如下



此為Logistic regression 方程式模式化Modeling,並且SAS係利用反應變數x轉換為預測機率值將其計算該模式的特異度(scentivity),1-特異度(1-specificity)在將其繪製,自變項與預測機率關係是一對一


PROC logistic descending ;
MODEL y = x / OUTROC=roc;
output out=pred xbeta=x p=p;
RUN;



接著就是如何利用這樣的結果資料輸出為圖形,吾蒐集資料後提供吾所知道的方法,方法如下

計算與繪製:

1.ODS語法:此為SAS內建ODS語法

2.Nonparametric comparison of areas under correlated ROC curves:此為SAS macro語法,由SAS support,用來針對變項做計算與比較

3.Plot ROC curve with labelled points for a binary-response model:此為SAS macro語法,由SAS support,用來繪製ROC curve圖形,可與上述語法並用

4.簡易應用:將macro語法簡化使用

1.ODS語法
ODS html;
ODS graphics ON;


PROC logistic descending ;
MODEL y = x / OUTROC=roc;
RUN;

ODS graphics off;
ODS html CLOSE;



此方法是利用SAS內建ODS輸出語法將計算後的預測機率報表中的ROC資料加以繪製圖形,但此方法能力還是有限,僅能提供單一曲線以及該曲線的面積(AUC)


2.Nonparametric comparison of areas under correlated ROC curves

%roc(DATA= , VAR= , RESPONSE= , CONTRAST= , ALPHA= , DETAILS= ) ;

此方法為SAS提供的巨集語法macro,此語法特性是先將以下巨集語法先讀入,在設定上述語法方能計算出該資料的兩族群與自變項的曲線下面積(AUC)與該曲線95%C.I.,此法優點在於可比較多組自變項在兩族群間的效用並做檢定

巨集語法%roc


%macro roc(version, data=, var=, response=, contrast=, details=no,
alpha=.05);

%let _version=1.6;
%if &version ne %then %put ROC macro Version &_version;

%let opts = %sysfunc(getoption(notes))
_last_=%sysfunc(getoption(_last_));
options nonotes;

/* Check for newer version */
%if %sysevalf(&sysver >= 8.2) %then %do;
filename _ver url 'http://ftp.sas.com/techsup/download/stat/versions.dat' termstr=crlf;
data _null_;
infile _ver;
input name:$15. ver;
if upcase(name)="&sysmacroname" then call symput("_newver",ver);
run;
%if &syserr ne 0 %then
%put ROC: Unable to check for newer version;
%else %if %sysevalf(&_newver > &_version) %then %do;
%put ROC: A newer version of the ROC macro is available.;
%put %str( ) You can get the newer version at this location:;
%put %str( ) http://support.sas.com/ctx/samples/index.jsp;
%end;
%end;

title "The ROC Macro";
title2 " ";

%let error=0;

/* Verify that DATA= option is specified */
%if &data= %then %do;
%put ERROR: Specify DATA= containing the OUT= data sets of models to be compared;
%goto exit;
%end;

/* Verify that VAR= option is specified */
%if &var= %then %do;
%put ERROR: Specify predictor or XBETA variables in the VAR= argument;
%goto exit;
%end;

/* Verify that RESPONSE= option is specified */
%if &response= %then %do;
%put ERROR: Specify response variable in the RESPONSE= argument;
%goto exit;
%end;

%let i=1;
%do %while (%scan(&data,&i) ne %str() );
%let data&i=%scan(&data,&i);
%let i=%eval(&i+1);
%end;
%let ndata=%eval(&i-1);

data _comp(keep = &var &response);
%if &data=%str() or &ndata=1 %then set;
%else merge;
&data;
if &response not in (0,1) then call symput('error',1);
run;
%if &error=1 %then %do;
%put ERROR: Response must have values 0 or 1 only.;
%goto exit;
%end;

/* Original SAS/IML code from author follows */
proc iml;
start mwcomp(psi,z);
*;
* program to compute the mann-whitney components ;
* z is (nn by 2);
* z[,1] is the column of data values;
* z[,2] is the column of indicator variables;
* z[i,2]=1 if the observation is from the x population;
* z[i,2]=0 if the observation is from the y population;
*
* psi is the returned vector of u-statistic components;

rz = ranktie( z[,1] ); * average ranks;
nx = sum( z[,2] ); * num. of Xs ;
ny = nrow(z)-nx; * num of Ys ;
loc = loc( z[,2]=1 ); * x indexes ;
psi = j(nrow(z),1,0);
psi[loc] = (rz[loc] - ranktie(z[loc,1]))/ny; * x components ;
loc = loc( z[,2]=0 ); * y indexes ;
psi[loc] = ( nx+ranktie(z[loc,1])-rz[loc])/nx; * y components ;
free rz loc nx ny; * free space ;
finish;

start mwvar(t,v,nx,ny,z);
*;
* compute mann-whitney statistics and variance;
* input z, n by (k+1);
* z[,1:k] are the different variables;
* z[,k+1] are indicator values,
* 1 if the observation is from population x and ;
* 0 if the observation is from population y;
* t is the k by k vector of estimated statistics;
* the (i,j) entry is the MannWhitney statistic for the
* i-th column when used with the j-th column. The only
* observations with nonmissing values in each column are
* used. The diagonal elements are, hence, based only on the
* single column of values.
* v is the k by k estimated variance matrix;
* nx is the matrix of x population counts on a pairwise basis;
* ny is the matrix of y population counts on a pairwise basis;

k = ncol(z)-1;
ind = z[,k+1];
v = j(k,k,0); t=v; nx=v; ny=v;

* The following computes components after pairwise deletion of
* observations with missing values. If either there are no missing
* values or it is desired to use the components without doing
* pairwise deletion first, the nested do loops could be evaded.
*;
do i=1 to k;
do j=1 to i;
who = loc( (z[,i]^=.)#(z[,j]^=.) ); * nonmissing pairs;
run mwcomp(psii,(z[,i]ind)[who,]); * components;
run mwcomp(psij,(z[,j]ind)[who,]);
inow = ind[who,]; * Xs and Ys;
m = inow[+]; * current Xs;
n = nrow(psii)-m; * current Ys;
nx[i,j] = m; ny[i,j] = n;
mi = (psii#inow)[+] / m; * means;
mj = (psij#inow)[+] / m;
t[i,j] = mi; t[j,i] = mj;
psii = psii-mi; psij = psij-mj; * center;
v[i,j] = (psii#psij#inow)[+] / (m#(m-1))
+ (psii#psij#(1-inow))[+] / (n#(n-1));
v[j,i] = v[i,j];
end;
end;
free psii psij inow ind who;
finish;

/* start of execution of the IML program */
use _comp var {&var &response};
read all into data [colname=names];
run mwvar(t,v,nx,ny,data); * estimates and variances;
vname = names[1:(ncol(names)-1)];
manwhit = vecdiag(t);

/* omit: 0 for intercept-only model; not needed for further
computations
c=sqrt( vecdiag(v) ); c=v / (c@c`);
%if %upcase(%substr(&details,1,1)) ne N %then %do;
print c [label='Estimated Correlations' colname=vname rowname=vname];
%end;
*/

%if &contrast= %then %do;
%put ROC: No contrast specified. Pairwise contrasts of all;
%put %str( ) curves will be generated.;
call symput('col',char(ncol(data)-1));
%if &col=1 %then %str(l=1;); %else %do;
l=(j(&col-1,1)-i(&col-1))
%do i=&col-2 %to 1 %by -1;
//(j(&i,&col-&i-1,0)j(&i,1)-i(&i))
%end;
;
%end;
%end;
%else %do;
l = { &contrast };
%end;

lt=l*manwhit;
lv=l*v*l`;
c = ginv(lv);
chisq = lt`*c*lt;
df = trace(c*lv);
p = 1 - probchi( chisq, df );
/* Original SAS/IML code by author ends */

/* Individual area stderrs and CIs */
stderr=sqrt(vecdiag(v));
arealcl=manwhit-probit(1-&alpha/2)*stderr;
areaucl=manwhit+probit(1-&alpha/2)*stderr;
areastab=putn(manwhitstderrarealclareaucl,'7.4');

/* Pairwise difference stderrs and CIs */
sediff=sqrt(vecdiag(lv));
difflcl=lt-probit(1-&alpha/2)*sediff;
diffucl=lt+probit(1-&alpha/2)*sediff;
diffchi=(lt##2)/vecdiag(lv);
diffp=1-probchi(diffchi,1);

%if %upcase(%substr(&details,1,1)) ne N %then %do;
print t [label='Pairwise Deletion Mann-Whitney Statistics' colname=vname
rowname=vname];
%end;

print areastab [label=
"ROC Curve Areas and %sysevalf(100*(1-&alpha))% Confidence Intervals"
rowname=vname colname={'ROC Area' 'Std Error' 'Confidence' 'Limits'}];

call symput('maxrow',char(comb(max(nrow(l),2),2)));
rname='Row1':"Row&maxrow";
%if %upcase(%substr(&details,1,1)) ne N %then %do;
print v [label='Estimated Variance Matrix' colname=vname rowname=vname];
print nx [label='X populations sample sizes' colname=vname rowname=vname];
print ny [label='Y populations sample sizes' colname=vname rowname=vname];
print lv [label='Variance Estimates of Contrast' rowname=rname
colname=rname];
%end;
print l [label='Contrast Coefficients' rowname=rname colname=vname];

fdiffchi=putn(diffchi,'9.4');
fdiffp=putn(diffp,'pvalue.');
diffs=putn(ltsediffdifflcldiffucl,'7.4');
diffstab=diffsfdiffchifdiffp;
print diffstab [label=
"Tests and %sysevalf(100*(1-&alpha))% Confidence Intervals for Contrast Rows"
rowname=rname colname={'Estimate' 'Std Error' 'Confidence' 'Limits'
'Chi-square' 'Pr > ChiSq'}];

c2=putn(chisq,'9.4');
df2=putn(df,'3.');
p2=putn(p,'pvalue.');
ctest=c2df2p2;
print ctest [label='Contrast Test Results'
colname={'Chi-Square' ' DF' 'Pr > ChiSq'}];

/* Make overall p-value available */
%global pvalue;
call symput('pvalue',p2);

quit;

%exit:
options &opts;
title; title2;
%mend;



3.Plot ROC curve with labelled points for a binary-response model
%rocplot(out= , outroc= , p= , id= , plottype= , roffset= , font= , size= , color= , position= , plotchar= );

此方法為SAS提供的巨集語法macro,此語法特性是先將以下巨集語法先讀入,在設定上述語法方能繪出該自變項與兩族群間的ROC曲線,另外也能同時繪製多個自變項在同一張圖層上,並也能配合方法二

巨集語法%rocplot


%macro rocplot ( version, outroc=, out=, p=, id=, plottype=high, font=swissb,
size=2, position=F, color=black, plotchar=dot,
roffset=4, round=1e-8 );

%if &version ne %then %put ROCPLOT macro Version 1.0;
options nonotes;
%let nomatch=0;

/* Verify ID= is specified */
%if %quote(&id)= %then %do;
%put ERROR: The ID= option is required.;
%goto exit;
%end;

/* Verify P= is specified */
%if %quote(&p)= %then %do;
%put ERROR: The P= option is required.;
%goto exit;
%end;

/* Verify OUTROC= is specified and the data set exists */
%if %quote(&outroc) ne %then %do;
%if %sysfunc(exist(&outroc)) ne 1 %then %do;
%put ERROR: OUTROC= data set not found.;
%goto exit;
%end;
%end;
%else %do;
%put ERROR: The OUTROC= option is required.;
%goto exit;
%end;

/* Verify OUT= is specified and the data set exists */
%if %quote(&out) ne %then %do;
%if %sysfunc(exist(&out)) ne 1 %then %do;
%put ERROR: OUT= data set not found.;
%goto exit;
%end;
%end;
%else %do;
%put ERROR: The OUT= option is required.;
%goto exit;
%end;

data _outroc;
set &outroc;
_prob_=round(_prob_,&round);
run;

data _out;
set &out;
_prob_=round(&p , &round);
length _id $ 200;

/* Create single label variable */
_id=trim(left(%scan(&id,1)))
%let i=2;
%do %while (%scan(&id,&i) ne %str() );
'/'trim(left(%scan(&id,&i)))
%let i=%eval(&i+1);
%end;
;
run;

proc sort data=_out nodupkey;
by _prob_ _id;
run;

proc sort data=_outroc nodupkey;
by _prob_;
run;

data _rocplot;
_inout=0; _inroc=0;
merge _outroc(in=_inroc) _out(in=_inout);
by _prob_;
if not(_inout and _inroc) then do;
call symput('nomatch',1);
delete;
end;
run;

%if &nomatch=1 %then %do;
%put ROCPLOT: Some predicted values in OUT= did not match predicted values;
%put %str( in OUTROC=. Verify that you used the ROCEPS=0 option in);
%put %str( PROC LOGISTIC.);
%end;

%if %upcase(%substr(&plottype,1,1))=L %then %do;
footnote "Point labels are values of &id";
proc plot data=_rocplot;
plot _sensit_*_1mspec_ $ _id /
haxis=0 to 1 by .1 vaxis=0 to 1 by .1;
run; quit;
%end;

%if %upcase(%substr(&plottype,1,1))=H %then %do;
data _anno;
length function style color $ 8 position $ 1 text $ 200;
retain function 'label' xsys ysys '2' hsys '3'
size &size position "&position" style "&font"
color "&color";
set _rocplot(keep=_sensit_ _1mspec_ _id) end=eof;
x=_1mspec_; y=_sensit_; text=trim(left(_id)); output;

/* Draw (0,0) to (1,1) reference line */
if eof then do;
x=0; y=0; function='move'; output;
x=1; y=1; function='draw'; line=1; hsys='1'; size=0.25; output;
end;
run;

symbol1 i=join v=&plotchar c=blue l=1;
footnote "Point labels are values of &id";
axis1 offset=(1,&roffset)pct order=(0 to 1 by .1);
proc gplot data=_rocplot;
plot _sensit_*_1mspec_=1 / vaxis=0 to 1 by .1
haxis=axis1 annotate=_anno;
run;
quit;
%end;

footnote;
%exit:
options notes;
%mend;




4.
簡易應用

吾將語法簡化使用方式如下:
利用方法1可以簡單快速計算與繪製,也可獨立使用方法2與3

PROC logistic descending ;
MODEL y = x / OUTROC=roc;
output out=pred xbeta=x p=p;
RUN;


上述為前置動作是將依變項y與自變項x先計算出預測機率
接著先將方法2與3的巨集語法先讀入在執行下述語法


曲線下面積(AUC):
%roc(DATA=pred , VAR=p , RESPONSE=y );

ROC曲線圖:
%rocplot(out=pred , outroc=roc , p=p , id=x , color=white );



利用巨集語法計算曲線下面積(AUC)的計算也可略過Logisitc regression分析直接計算

曲線下面積(AUC):
%roc(DATA=資料 , VAR=自變項 , RESPONSE=兩族群(1,0) );
此用法前提為無遺漏值





References


SAS support
1.Nonparametric comparison of areas under correlated ROC curves
2.Plot ROC curve with labelled points for a binary-response model

Wikipedia
Receiver operating characteristic