If you are a beginning SAS programmer, then the following may not be particularly helpful, but the books suggested in the middle may be. BU students can obtain a free license for SAS to install on their own computer if it is required for a course or research project. Both will require an email from an adviser. SAS is also available on various computers in the economics department computer labs.
I also created a Ellis SAS tips for new SAS programmers.
I do a lot of SAS programming on large datasets, and thought it would be productive to share some of my programming tips on SAS in one place. Large data is defined to be a dataset so large that it cannot be stored in the available memory. (My largest data file to date is 1.7 terabytes.)
Suggestions and corrections welcome!
Use SAS macro language whenever possible;
It is so much easier to work with short strings than long lists, especially with repeated models and datasteps;
%let rhs = Age Sex HCC001-HCC394;
Design your programs for efficient reading and writing of files, and minimize temporary datasets.
SAS programs on large data are generally constrained by IO (input output, reading from your hard drives), not by CPU (actual calculations) or memory (storage that disappears once your sas program ends). I have found that some computers with high speed CPU and multiple cores are slower than simpler computers because they are not optimized for speedy hard drives. Large memory really helps, but for really huge files it can almost almost be exceeded, and then your hard drive speeds will really matter. Even reading in and writing out files the hard drive speeds will be your limiting factor.
This implication of this is that you should do variable creation steps in as few datastep steps as possible, and minimize sorts, since reading and saving datasets will take a lot of time. This requires a real change in thinking from STATA, which is designed for changing one variable at a time on a rectangular file. Recall that STATA can do this efficiently since it usually starts by bringing the full dataset into memory before doing any changes. SAS does not do this, one of its strengths.
Learning to use DATA steps and PROC SQL is the central advantage of an experienced SAS programmer. Invest, and you will save time waiting for your programs to run.
Clean up your main hard drive if at all possible.
Otherwise you risk SAS crashing when your hard drive gets full. If it does, cancel the job and be sure to delete the temporary SAS datasets that may have been created before you crashed. The SAS default for storing temporary files is something like
C:\Users\”your_user_name”.AD\AppData\Local\Temp\SAS Temporary Files
Unless you have SAS currently open, you can safely delete all of the files stored in that directory. Ideally, there should be none since SAS deletes them when it closes normally. It is the abnormal endings of SAS that cause temporary files to be saved. Delete them, since they can be large!
Change the default hard drive for temporary files and sorting
If you have a large internal secondary hard drive with lots of space, then change the SAS settings so that it uses temp space on that drive for all work files and sorting operations.
To change this default location to a different internal hard drive, find your sasv9.cfg file which is in a location like
Find the line in the config firl that starts -WORK and change it to your own location for the temporary files (mine are on drive j and k) such as:
-WORK “k:\data\temp\SAS Temporary Files”
-UTILLOC “j:\data\temp\SAS Temporary Files”
The first one is where SAS stores its temporary work files such as WORK.ONE where you define the ONE such as by DATA ONE;
The second line is where SAS stores its own files such as when sorting a file or when saving residuals.
There is a reason to have the WORK and UTIL files on different drives, so that it is in generally reading in from one drive and writing out to a different one, rather than reading in and writing out on the same drive. Try to avoid the latter. Do some test on your own computer to see how much time you can save by switching from one drive to another instead of only using one drive.
Use only internal hard drives for routine programming
Very large files may require storage or back up on external hard drives, but these are incredibly slow. External drives are three to ten times slower than an internal hard drive. Try to minimize their use for actual project work. Instead, buy more internal drives if possible. You can purchase additional internal hard drives with 2T of space for under $100. You save that much in time the first day!
Always try to write large datasets to a different disk drive than you read them in from.
Do some tests copying large files from c: to c: and from C: to F: You may not notice any difference until the file sizes get truly huge, greater than your memory size.
Consider using binary compression to save space and time if you have a lot of binary variables.
By default, SAS stores datasets in a fixed rectangular dataset that leaves lots of empty space when you use integers instead of real numbers. Although I have been a long time fan of using OPTIONS COMPRESS=YES to save space and run time (but not CPU time) I only recently discovered that
is even better for integers and binary flags when they outnumber real numbers. For some large datasets with lots of zero one dummies it has reduced my file size by as much as 97%! Standard variables are stored as 8 bytes, which have 8*256=2048 bits. In principle you could store 2000 binary flags in the space of one real number. Try saving some files on different compression and see if your run times and storage space improve. Note: compression INCREASES files size for real numbers! It seems that compression saves space when binary flags outnumber real numbers or integers;
Try various permutations on the following on you computer with your actual data to see what saves time and space;
data real; retain x1-x100 1234567.89101112; do i = 1 to 100000; output; end;run; proc means; run;
data dummies; retain d1-d100 1; do i = 1 to 100000; output; end; proc means; run;
*try various datasteps with this, using the same or different drives. Bump up the obs to see how times change.
Create a macro file where you store macros that you want to have available anytime you need them. Do the same with your formats;
Be aware of which SAS procs create large, intermediate files
Some but not all procs create huge temporary datasets.
Consider: PROC REG, and PROC GLM generates all of the results in one pass through the data unless you have an OUTPUT statement. Then they create large,uncompressed, temporary files that can be a multiple of your original file sizes. PROC SURVEYREG and MIXED create large intermediate files even without an output statement. Plan accordingly.
Consider using OUTEST=BETA to more efficiently create residuals together with PROC SCORE.
Compare two ways of making residuals;
*make test dataset with ten million obs, but trivial model;
do i = 1 to 10000000;
retain junk1-junk100 12345; * it is carrying along all these extra variables that slows SAS down;
x = rannor(234567);
y = x+rannor(12345);
Run; * 30.2 seconds);
*Straightforward way; Times on my computer shown following each step;
proc reg data = test;
y: model y = x;
output out=resid (keep=resid) residual=resid;
run; *25 seconds;
proc means data = resid;
run; *.3 seconds;
*total of the above two steps is 25.6 seconds;
proc reg data = test outest=beta ;
resid: model y = x;
run; *3.9 seconds;
proc print data = beta;
run; *take a look at beta that is created;
proc score data=test score=beta type=parms
out=resid (keep=resid) residual;
run; *6 seconds!;
proc means data = resid;
run; .3 seconds;
*total from the second method is 10.3 seconds versus 25.6 on the direct approach PLUS no temporary files needed to be created that may crash the system.
If the model statement in both regressions is
y: model y = x junk1-junk100; *note that all of the junk has coefficients of zero, but SAS does not this going in;
then the two times are
Direct approach: 1:25.84
Scoring approach: 1:12.46 on regression plus 9.01 seconds on score = 1:21.47 which is a smaller savings
On very large files the time savings are even greater because of the reduced IO gains; SAS is still able to do this without writing onto the hard drive in this “small” sample on my computer. But the real savings is on temporary storage space.
Use a bell!
My latest addition to my macro list is the following bell macro, which makes sounds.
Use %bell; at the end of your SAS program that you run batch and you may notice when the program has finished running.
*plays the trumpet call, useful to put at end of batch program to know when the batch file has ended;
*Randy Ellis and Wenjia Zhu November 18 2014;
call sound(392.00,70); *first argument is frequency, second is duration;
Purchase essential SAS programming guides.
I gave up on purchasing the paper copy of SAS manuals, because they take up more than two feet of shelf space, and are still not complete or up to date. I find the SAS help menus useful but clunky. I recommend the following if you are going to do serious SAS programming. Buy them used on Amazon or whatever. I would get an older edition, and it will cost less than $10 each. Really.
The Little SAS Book: A Primer, Fifth Edition (or an earlier one)
Nov 7, 2012
Beginners introduction to SAS. Probably the best single book to buy when learning SAS.
Professional SAS Programmer’s Pocket Reference Paperback
By Rick Aster
Wonderful, concise summary of all of the main SAS commands, although you will have to already know SAS to find it useful. I use it to look up specific functions, macro commands, and optoins on various procs because it is faster than using the help menus. But I am old style…
Professional SAS Programming Shortcuts: Over 1,000 ways to improve your SAS programs Paperback
By Rick Aster
I don’t use this as much as the above, but if I had time, and were learning SAS instead of trying to rediscover things I already know, I would read through this carefully.
Get in the habit of deleting most intermediate permanent files
Delete files if either
1. You won’t need them again or
2. You can easily recreate them again. *this latter point is usually true;
Beginner programmers tend to save too many intermediate files. Usually it is easier to rerun the entire program instead of saving the intermediate files. Give your final file of interest a name like MASTER or FULL_DATA then keep modifying it by adding variables instead of names like SORTED, STANDARDIZED,RESIDUAL,FITTED.
Consider a macro that helps make it easy to delete files.
%macro delete(library=work, data=temp, nolist=);
proc datasets library=&library &nolist;
*sample macro calls
%delete (data=temp); *for temporary, work files you can also list multiple files names but these disappear anyway at the end of your run;
%delete (library =out, data = one two three) ; *for two node files in directory in;
%delete (library=out, data =one, nolist=nolist); *Gets rid of list in output;