Reading Compressed Raw Data Files Under Stata for Linux
This example uses Stata to read selected variables from a compressed, raw data file into Stata. The particular file is drawn from the ICPSR data holdings and is the Person data file from the National Health Interview Survey, 1984 (ICPSR 8659).
The example uses a do-file that first decompresses the data file with zcat from its archival storage location to the /scratch storage directory, a public directory for temporary files. Naming of the file with some abbreviation of your name ensures uniqueness. This simple step is the key to use of compressed data and requires decompression only long enough to subset desired variables. Selected variables are read in by the infile facility [see Stata online help infile2 for more details] using a dictionary containing the specifications for the variables of interest and a reference to the location of the decompressed file. After infile processing, the decompressed file is deleted from the scratch location. An annotated copy of the do-file is listed below.
* A sample do file to illustrate the use of an ICPSR compressed, * raw data file and a Stata dictionary to read selected variables version 9 * Decompress raw data file and write to scratch space - zcat is * an external shell command, so is preceded by ! !zcat /opt/archive/icpsr/s8659/08659-0002-Data.txt.gz > /scratch/bobj.txt * Read the raw data file using dictionary file s8659-part2.dct infile using s8659-part2 * Delete the scratch data set rm /scratch/bobj.txt
The dictionary file (s8659-part2.dct), illustrates the technique for reading raw data from fixed column locations. The dictionary statement defines the physical file to read and opens a stanza enclosed in braces with input specifications. The _lrecl specification defines the length of physical records, all of which are 260 characters long. The _column specifications define the starting column location, the type of data being read - all integers in this case ranging from small (byte), medium (int) to large (long) in size, the variable name, the input format and a variable label.
the byte data type stores integers from -127 to 100
the int data type stores integers from -32,767 to 32,740
the long data type stores integers from -2,147,483,647 to 2,147,483,620
dictionary using /scratch/bobj.txt { _lrecl(260) _column(1) byte rectype %2f "Record type" _column(13) byte hnum %2f "Household number" _column(15) byte pnum %2f "Person number" _column(22) byte sex %1f "Sex of respondent" _column(23) byte age %2f "Age of respondent" _column(31) int yob %4f "Year of birth" _column(36) byte race %1f "Race recode 1" _column(41) byte marital %1f "Marital status" _column(44) byte educ %2f "Education of respondent" _column(65) byte health %1f "Health status" _column(203) long drvisit %7f "Dr visits last 12 mons" }
The Stata log below illustrates the execution of the do-file and verification of the results with a describe command and a list of the first 10 observations.
. do nhis_example.do . * A sample do file to illustrate the use of an ICPSR compressed, . * raw data file and a Stata dictionary to read selected variables . . version 9 . . * Decompress raw data file and write to scratch space - zcat is . * an external shell command, so is preceded by ! . !zcat /opt/archive/icpsr/s8659/08659-0002-Data.txt.gz > /scratch/bobj.txt . . * Read the raw data file using dictionary file s8659-part2.dct . infile using s8659-part2 dictionary using /scratch/bobj.txt { _lrecl(260) _column(1) byte rectype %2f "Record type" _column(13) byte hnum %2f "Household number" _column(15) byte pnum %2f "Person number" _column(22) byte sex %1f "Sex of respondent" _column(23) byte age %2f "Age of respondent" _column(31) int yob %4f "Year of birth" _column(36) byte race %1f "Race recode 1" _column(41) byte marital %1f "Marital status" _column(44) byte educ %2f "Education of respondent" _column(65) byte health %1f "Health status" _column(203) long drvisit %7f "Dr visits last 12 mons" } (105290 observations read) . . * Delete the scratch data set . rm /scratch/bobj.txt . end of do-file . describe Contains data obs: 105,290 vars: 11 size: 2,000,510 (80.9% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- rectype byte %8.0g Record type hnum byte %8.0g Household number pnum byte %8.0g Person number sex byte %8.0g Sex of respondent age byte %8.0g Age of respondent yob int %8.0g Year of birth race byte %8.0g Race recode 1 marital byte %8.0g Marital status educ byte %8.0g Education of respondent health byte %8.0g Health status drvisit long %12.0g Dr visits last 12 mons ------------------------------------------------------------------------------- Sorted by: Note: dataset has changed since last saved . list in 1/10 +-------------------------------------------------------------------------------------+ | rectype hnum pnum sex age yob race marital educ health drvisit | |-------------------------------------------------------------------------------------| 1. | 20 1 1 2 21 1962 1 4 10 3 8489 | 2. | 20 1 2 1 2 1981 1 0 20 1 24189 | 3. | 20 2 1 2 70 1913 1 3 5 2 16286 | 4. | 20 3 1 1 31 1952 1 1 12 3 0 | 5. | 20 3 2 2 35 1948 1 1 12 3 15864 | |-------------------------------------------------------------------------------------| 6. | 20 3 3 1 10 1973 1 0 4 5 477480 | 7. | 20 3 4 1 4 1979 1 0 20 3 241890 | 8. | 20 4 1 1 58 1925 1 1 17 3 41005 | 9. | 20 4 2 2 57 1926 1 1 12 2 81200 | 10. | 20 5 1 1 68 1915 1 1 8 5 162880 | +-------------------------------------------------------------------------------------+ .