Fundamentals of Programming in SAS. James Blum

Читать онлайн книгу.

Fundamentals of Programming in SAS - James Blum


Скачать книгу

       Note the different file extension. This data is similar to the data in Program 2.9.2, but in a tab-delimited file.

       The DLM= option uses the hexadecimal representation of a tab, 09, along with hexadecimal literal modifier, x.

       No change is necessary in the LIST statement, regardless of what, if any, delimiter is used in the file.

      Because of the increase in information present in the log, Log 2.9.3 only shows the results for the first two records. For each record, the contents from the input buffer now occupy three lines in the log.

      Log 2.9.3: LIST Statement Results with Non-Printable Characters

      RULE:  ----+----1----+----2----+----3----+----4

      1 CHAR 439.12/11/2000.LAX.20.137 25

      ZONE 3330332332333304450330333

      NUMR 439912F11F20009C189209137

       

      2 CHAR 921.12/11/2000.DFW.20.131 25

      ZONE 3330332332333304450330333

      NUMR 921912F11F200094679209131

       The CHAR line represents the printable data from the input buffer. It displays non-printable characters as periods.

       The ZONE and NUMR rows represent the two digits in the hexadecimal representation.

       Note the fourth column in the input buffer appears to be a period. However, combining the information from the ZONE and NUMR lines indicates the hexadecimal value is 09—a tab.

      Because SAS converts all non-printable characters to a period when writing the input buffer to the log, the ZONE and NUMR lines provide crucial information to determine the actual value stored in the input buffer. In particular, they provide a way to differentiate a period that was in the original data (hexadecimal code 2E) from a period that appears as a result of a non-printable character (for example, a tab with the hexadecimal code 09).

      When debugging, two other useful statements are the PUT and PUTLOG statements. Both PUT and PUTLOG statements provide a way for SAS to write out information from the PDV along with other user-defined messages. The statements differ only in their destination—the PUTLOG statement can only write to the SAS log, while the PUT statement can write to the log or any file destination specified in the DATA step. The PUT statement is covered in more detail in Chapter 7; this section focuses on the PUTLOG statement. Program 2.9.4 uses Input Data 2.9.1 to demonstrate various uses of the PUTLOG statement, with Log 2.9.4 showing the results for the first record.

      Program 2.9.4: Demonstrating the PUTLOG Statement

      data work.flights;

      infile RawData(‘flights.prn’);

      input FlightNum Date $ Destination $ FirstClass EconClass;

      putlog ‘NOTE: It is easy to write the PDV to the log’;

      putlog _all_;

      putlog ‘NOTE: Selecting individual variables is also easy’;

      putlog ‘NOTE: ‘  FlightNum=  Date ;

      putlog ‘WARNING: Without the equals sign variable names are omitted’;

      run;

       This PUTLOG statement writes the quoted string to the log once for every record. If the string starts with the string ‘NOTE:’, then SAS color-codes the statement just like a system-generated note, and it is indexed in SAS University Edition with other system notes.

       The _ALL_ keyword selects every variable from the PDV, including the automatic variables _N_ and _ERROR_.

       The PUTLOG statements accept a mix of quoted strings and variable names or keywords.

       A variable name followed by the equals sign prints both the name and current value of the variable.

       Omitting the equals sign only prints the value of the variable, with a potentially adverse effect on readability.

       Beginning the string with ‘WARNING:’ or ‘ERROR:’ also ensures SAS color-codes the messages to match the formatting of the system-generated warnings and errors, and indexes them in SAS University Edition. If this behavior is not desired, use alternate terms such as QC_NOTE, QC_WARNING, and QC_ERROR to differentiate these user-generated quality control messages from automatically generated messages.

      Log 2.9.4 Demonstrating the PUTLOG Statement

      NOTE: It is easy to write the PDV to the log

      FlightNum=439 Date=12/11/20 Destination=LAX FirstClass=20 EconClass=137 _ERROR_=0 _N_=1

      NOTE: Selecting individual variables is also easy

      FlightNum=439 12/11/20

      WARNING: Without the equals sign variable names are omitted

      When used in conjunction in a single DATA step, the PUTLOG and LIST statements allow for easy access to both the PDV and input buffer contents, providing a simple way to track down the source of data errors. In these examples, the PUTLOG results shown in Log 2.9.4 reveal the truncation in the Date values in the PDV, while the LIST results from Log 2.9.2 or 2.9.3 show the full date is present in the input buffer. Using them together, it is clear the issue with Date values is not present in the raw data and must relate to how the INPUT statement has parsed those values from the input buffer.

      One additional debugging tool, the ERROR statement, acts exactly as the PUTLOG statement does, while also setting the automatic variable _ERROR_ to one.

      The term validation has many definitions in the context of programming. It can refer to ensuring software is correctly performing the intended actions—for example, confirming that PROC MEANS is accurately calculating statistics on the provided data set. Validation can also refer to the processes of self-validation, independent validation, or both. Self-validation occurs when a programmer performs checks to ensure any data sets read in accurately reflect the source data, data set manipulations are correct, or that any analyses represent the input data sets. Independent validation occurs when two programmers develop code separately to achieve the same results. Once both programmers have produced their intended results, those results (such as listings of data sets, tables of summary statistics, or graphics) are compared and are considered validated when no substantive differences appear between the two sets of results. SAS provides a variety of tools to use for validation, including PROC COMPARE, which is an essential tool for independent validation of data sets.

      The COMPARE procedure compares the contents of two SAS data sets, selected variables across different data sets, or selected variables within a single data set. When two data sets are used, one is specified as the base data set and the other as the comparison data set. Program 2.10.1 shows the options used to specify these data sets, BASE= and COMPARE=, respectively.

      Program 2.10.1: Comparing the Contents of Two Data Sets

      proc compare base = sashelp.fish compare = sashelp.heart;

      run;

      If no statements beyond those shown in Program 2.10.1 are included, the complete contents portions of the two data sets are compared, along with meta-information such as types, lengths, labels, and formats. PROC COMPARE only compares attributes and values for variables with common names across the two data sets. Thus, even though the full data sets are specified for comparison, it is possible for individual variables in one data set to not be compared against any variable in the other set. Program 2.10.1 compares two data sets from the Sashelp library: Fish and Heart. These data sets are not intended to match, so submitting Program 2.10.1 produces a summary of the mismatches to demonstrate the types of output available from the COMPARE procedure.

      Output 2.10.1: Comparing the Contents of Two Data Sets


Скачать книгу