Stat 301 – Working with a large data
file
In Investigation 2.1, we want to explore
birth weights of newborns in the United States. The CDC’s Vital Statistics
Online Data Portal (www.cdc.gov/nchs/data_access/vitalstatsonline.htm) gives you access to ALL births in a
year.
(a) Open this website and follow the
Downloadable Data Files link for Birth. How large is the zip file for the U.S.
data?
(b) Open the User’s Guide for 2020. How many births were recorded in the U.S. in 2020?
(c) According to this documentation
file, how many total “positions” are there in the data file? What information
is in the first 8 positions? What
information is in the next 4 positions?
How would you determine the birth month?
If you were to download this file, it will
take 1-2 minutes.
(d) Unzip the contents of the .zip file.
·
Right
click on the .zip file and select Extract All.
The unzipped file is 5GB! This can be
problematic to work with, especially if you were most interested in just a few
of the variables. It can also be
cumbersome to deal with all of the rows, especially if
you just want to see general patterns.
Most packages can now work with datafiles this large, but we are still
pushing the limit a bit. JMP and R still take several minutes to read in the
data, and each row is still a single string (not separated into columns or tabs).One
approach is to preprocess the data a bit first.
To read a “fixed width” data file, one
approach is to use the “awk” language
·
Download
the awk.exe file and the GetMonth.bat file into the same folder as the data
file.
·
Right
click on getMonth.bat and select Edit (should open NotePad)
o
The
goal of this program is to extract all the records for one month of the
year. This happens by extracting out the
rows where strings in positions 13 and 14 are equal to 01. The results are
stored in Jan.txt.
(e) What does substr($0,
13,2) ==\"01\" do? (Hint: I
just told you, so what do the 2 and 01 do?)
·
To
run this batch file, double click on it (getMonth.bat).
(A “Command Prompt window” should open. You may have to give your ok.) When it’s done, you should see a new file: Jan.txt (Press any key to continue.)
This is still a pretty large file and each row is one long string (“fixed width”). If you open this file in JMP or Excel, they at
least try to separate the columns. You can use read.fwf in
R to specify about the positions/variable, but you need to know all of the widths, e.g.,
w <- c(7, 11, 10, 5, 22, 10, 9, 9, 9, 9, 12)
read.fwf("myfile.dat", w)
(f) Where is information on how much
weight the mother gained during pregnancy?
(g) How many positions are between birth
month and mom’s weight gain? Hint:
What is the “length” between these two variables?
(h) Where is information on the “5 minute Apgar score”?
What is “apgar score”?
(i) How many positions between mom
weight gain and 5 min Apgar score?
(j) Find OEGest_R3. Where is it and what is it? How many
positions between?
(k) Which positions contain the baby birth weight? What are the
measurement units?
So now we want to extract data on birth weight, whether the baby
was full term (gestation over 36 weeks), the 5 minute apgar score (an immediate measure of the infant’s health), and the amount of weight
gained by the mother during pregnancy (in lbs).
(k) Confirm the following widths
14
to include positions 1-14
488
to include positions 15-502
1
to include OEGest_R3 (503)
4
to include birthweight (504-507)
823
to include 508-1330
> w = c(14, -488, 1, 4, -823)
> JanSub = read.fwf("C:/Users/bchance/Downloads/Nat2020us/Jan.txt", w)
> names(JanSub) = c("date", "OEG", "birthweight")
This should take 2-3 minutes. Let’s
extract less data instead.
·
In
the text editor of the getMonth.bat file, type rem in front of the existing line.
·
Copy/type
the following as the next line (before pause)
awk "{ if (substr($0, 13,2) ==\"01\") print (substr($0, 504, 4) \",\" substr($0,
503, 1) ) }" < Nat2020PublicUS.c20210506.r20210812.txt
> Jan.txt
pause
This prints out positions 503, comma
separated, positions 504-507. Run the batch file.
(l) Read these data into R or JMP (use a
file open dialog and navigate to this folder)
RStudio:
Import Dataset > From Text
(base)
> names(Jan) = c("birthweight”, "OEGest")
JMP: File > Open (probably then
“Data, using best guess), and name the columns
(m) Now extract the birth weights,
mother’s weight gain, Apgar score, and OEGest_R3 for your assigned month. For example, for February:
awk "{ if (substr($0, 13,2) ==\"02\") print (substr($0, 504, 4) \",\", substr($0,
304, 2) \",\" substr($0, 444, 2) \",\"
substr($0, 503, 1)
) }" < Nat2020PublicUS.txt > Feb2.csv