Stat 301 – Working with a large data file

 

In Investigation 2.1, we want to explore birth weights of newborns in the United States. The CDC’s Vital Statistics Online Data Portal (www.cdc.gov/nchs/data_access/vitalstatsonline.htm) gives you access to ALL births in a year. 

 

(a) Open this website and follow the Downloadable Data Files link for Birth. How large is the zip file for the U.S. data?

 

(b) Open the User’s Guide for 2020.  How many births were recorded in the U.S. in 2020?

 

 

(c) According to this documentation file, how many total “positions” are there in the data file? What information is in the first 8 positions?  What information is in the next 4 positions?  How would you determine the birth month?

 

 

If you were to download this file, it will take 1-2 minutes.

(d) Unzip the contents of the .zip file.

·         Right click on the .zip file and select Extract All. 

 

The unzipped file is 5GB! This can be problematic to work with, especially if you were most interested in just a few of the variables.  It can also be cumbersome to deal with all of the rows, especially if you just want to see general patterns.  Most packages can now work with datafiles this large, but we are still pushing the limit a bit. JMP and R still take several minutes to read in the data, and each row is still a single string (not separated into columns or tabs).One approach is to preprocess the data a bit first.

 

To read a “fixed width” data file, one approach is to use the “awk” language

·         Download the awk.exe file and the GetMonth.bat file into the same folder as the data file.

·         Right click on getMonth.bat and select Edit (should open NotePad)

o   The goal of this program is to extract all the records for one month of the year.  This happens by extracting out the rows where strings in positions 13 and 14 are equal to 01. The results are stored in Jan.txt.

 

(e) What does substr($0, 13,2) ==\"01\" do? (Hint: I just told you, so what do the 2 and 01 do?)

 

 

·         To run this batch file, double click on it (getMonth.bat). (A “Command Prompt window” should open. You may have to give your ok.) When it’s done, you should see a new file: Jan.txt (Press any key to continue.)

 

This is still a pretty large file and each row is one long string (“fixed width”).  If you open this file in JMP or Excel, they at least try to separate the columns. You can use read.fwf in R to specify about the positions/variable, but you need to know all of the widths, e.g.,

w <- c(7, 11, 10, 5, 22, 10, 9, 9, 9, 9, 12)

read.fwf("myfile.dat", w)

 

(f) Where is information on how much weight the mother gained during pregnancy?

 


 

(g) How many positions are between birth month and mom’s weight gain? Hint: What is the “length” between these two variables?

 

 

(h) Where is information on the “5 minute Apgar score”?  What is “apgar score”?

 

 

(i) How many positions between mom weight gain and 5 min Apgar score?

 

 

(j) Find OEGest_R3. Where is it and what is it? How many positions between?

 

 

(k) Which positions contain the baby birth weight? What are the measurement units?

 

 

So now we want to extract data on birth weight, whether the baby was full term (gestation over 36 weeks), the 5 minute apgar score (an immediate measure of the infant’s health), and the amount of weight gained by the mother during pregnancy (in lbs).

 

(k) Confirm the following widths

            14 to include positions 1-14

            488 to include positions 15-502

            1 to include OEGest_R3 (503)

            4 to include birthweight (504-507)

            823 to include 508-1330

> w = c(14, -488, 1, 4, -823)

> JanSub = read.fwf("C:/Users/bchance/Downloads/Nat2020us/Jan.txt", w)
> names(JanSub) = c("date", "OEG", "birthweight")

 

This should take 2-3 minutes. Let’s extract less data instead.

·         In the text editor of the getMonth.bat file, type rem in front of the existing line.

·         Copy/type the following as the next line (before pause)

awk "{ if (substr($0, 13,2) ==\"01\") print (substr($0, 504, 4) \",\" substr($0, 503, 1)  ) }" < Nat2020PublicUS.c20210506.r20210812.txt > Jan.txt

pause

This prints out positions 503, comma separated, positions 504-507. Run the batch file.

 

(l) Read these data into R or JMP (use a file open dialog and navigate to this folder)

RStudio:  Import Dataset >  From Text (base)

> names(Jan) = c("birthweight”, "OEGest")

JMP: File > Open (probably then “Data, using best guess), and name the columns

 

(m) Now extract the birth weights, mother’s weight gain, Apgar score, and OEGest_R3 for your assigned month. For example, for February:

awk "{ if (substr($0, 13,2) ==\"02\") print (substr($0, 504, 4) \",\", substr($0, 304, 2) \",\" substr($0, 444, 2) \",\" substr($0, 503, 1)  ) }" < Nat2020PublicUS.txt > Feb2.csv