Reading Compressed Raw Data Files Under Linux

Reading Compressed Raw Data Files Under Stata for Linux

This example uses Stata to read selected variables from a compressed, raw data file into Stata. The particular file is drawn from the ICPSR data holdings and is the Person data file from the National Health Interview Survey, 1984 (ICPSR 8659).

The example uses a do-file that first decompresses the data file with zcat from its archival storage location to the /scratch storage directory, a public directory for temporary files. Naming of the file with some abbreviation of your name ensures uniqueness. This simple step is the key to use of compressed data and requires decompression only long enough to subset desired variables. Selected variables are read in by the infile facility [see Stata online help infile2 for more details] using a dictionary containing the specifications for the variables of interest and a reference to the location of the decompressed file. After infile processing, the decompressed file is deleted from the scratch location. An annotated copy of the do-file is listed below.

* A sample do file to illustrate the use of an ICPSR compressed,
* raw data file and a Stata dictionary to read selected variables

version 9

* Decompress raw data file and write to scratch space - zcat is
* an external shell command, so is preceded by !
!zcat /opt/archive/icpsr/s8659/08659-0002-Data.txt.gz > /scratch/bobj.txt

* Read the raw data file using dictionary file s8659-part2.dct
infile using s8659-part2

* Delete the scratch data set
rm /scratch/bobj.txt

The dictionary file (s8659-part2.dct), illustrates the technique for reading raw data from fixed column locations. The dictionary statement defines the physical file to read and opens a stanza enclosed in braces with input specifications. The _lrecl specification defines the length of physical records, all of which are 260 characters long. The _column specifications define the starting column location, the type of data being read - all integers in this case ranging from small (byte), medium (int) to large (long) in size, the variable name, the input format and a variable label.

the byte data type stores integers from -127 to 100
the int data type stores integers from -32,767 to 32,740
the long data type stores integers from -2,147,483,647 to 2,147,483,620

dictionary using /scratch/bobj.txt {
        _lrecl(260)
        _column(1)   byte  rectype    %2f "Record type"
        _column(13)  byte  hnum       %2f "Household number"
        _column(15)  byte  pnum       %2f "Person number"
        _column(22)  byte  sex        %1f "Sex of respondent"
        _column(23)  byte  age        %2f "Age of respondent"
        _column(31)  int   yob        %4f "Year of birth"
        _column(36)  byte  race       %1f "Race recode 1"
        _column(41)  byte  marital    %1f "Marital status"
        _column(44)  byte  educ       %2f "Education of respondent"
        _column(65)  byte  health     %1f "Health status"
        _column(203) long  drvisit    %7f "Dr visits last 12 mons"
}

The Stata log below illustrates the execution of the do-file and verification of the results with a describe command and a list of the first 10 observations.

. do nhis_example.do

. * A sample do file to illustrate the use of an ICPSR compressed,
. * raw data file and a Stata dictionary to read selected variables
.
. version 9

.
. * Decompress raw data file and write to scratch space - zcat is
. * an external shell command, so is preceded by !
. !zcat /opt/archive/icpsr/s8659/08659-0002-Data.txt.gz > /scratch/bobj.txt


.
. * Read the raw data file using dictionary file s8659-part2.dct
. infile using s8659-part2

dictionary using /scratch/bobj.txt {
        _lrecl(260)
        _column(1)   byte  rectype    %2f "Record type"
        _column(13)  byte  hnum       %2f "Household number"
        _column(15)  byte  pnum       %2f "Person number"
        _column(22)  byte  sex        %1f "Sex of respondent"
        _column(23)  byte  age        %2f "Age of respondent"
        _column(31)  int   yob        %4f "Year of birth"
        _column(36)  byte  race       %1f "Race recode 1"
        _column(41)  byte  marital    %1f "Marital status"
        _column(44)  byte  educ       %2f "Education of respondent"
        _column(65)  byte  health     %1f "Health status"
        _column(203) long  drvisit    %7f "Dr visits last 12 mons"
}

(105290 observations read)

.
. * Delete the scratch data set
. rm /scratch/bobj.txt

.
end of do-file

. describe

Contains data
  obs:       105,290
 vars:            11
 size:     2,000,510 (80.9% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
rectype         byte   %8.0g                  Record type
hnum            byte   %8.0g                  Household number
pnum            byte   %8.0g                  Person number
sex             byte   %8.0g                  Sex of respondent
age             byte   %8.0g                  Age of respondent
yob             int    %8.0g                  Year of birth
race            byte   %8.0g                  Race recode 1
marital         byte   %8.0g                  Marital status
educ            byte   %8.0g                  Education of respondent
health          byte   %8.0g                  Health status
drvisit         long   %12.0g                 Dr visits last 12 mons
-------------------------------------------------------------------------------
Sorted by:
     Note:  dataset has changed since last saved

. list in 1/10

     +-------------------------------------------------------------------------------------+
     | rectype   hnum   pnum   sex   age    yob   race   marital   educ   health   drvisit |
     |-------------------------------------------------------------------------------------|
  1. |      20      1      1     2    21   1962      1         4     10        3      8489 |
  2. |      20      1      2     1     2   1981      1         0     20        1     24189 |
  3. |      20      2      1     2    70   1913      1         3      5        2     16286 |
  4. |      20      3      1     1    31   1952      1         1     12        3         0 |
  5. |      20      3      2     2    35   1948      1         1     12        3     15864 |
     |-------------------------------------------------------------------------------------|
  6. |      20      3      3     1    10   1973      1         0      4        5    477480 |
  7. |      20      3      4     1     4   1979      1         0     20        3    241890 |
  8. |      20      4      1     1    58   1925      1         1     17        3     41005 |
  9. |      20      4      2     2    57   1926      1         1     12        2     81200 |
 10. |      20      5      1     1    68   1915      1         1      8        5    162880 |
     +-------------------------------------------------------------------------------------+

.

Computing Home | Sociology Home | Arts & Sciences Home | Duke Home