Wednesday, March 30, 2016

Labels for ACS Pums csv download - Data Ferrett to the rescure

US Census Bureau conducts the American Community Survey (ACS) annually. One of the data product that made available to the public is the Public Use Microdata Sample (PUMS) data, that can be used by researchers to derive statistical results.

The PUMS data were made available in two formats: SAS and CSV file format. The SAS file format is a proprietary file format and contains long text that give meaning to shortened mnemonic for variables and categorical values. The CSV file format, on the other hand, is  supported by almost all software but does not contain helping texts. US Census Bureau does provide document in pdf or text format that describes the mnemonics used. However, the pdf and the text format isn't the easiest to use for statistical software.

Personally, I believe the XML format is ideal to store the description info for mnemonics and I have suggested this approach to ACS help group. In the mean time, I discovered that the Data Ferrett application provided by the Census Bureau can serve as an alternative source for the XML file. By selecting all variables using the Data Ferrett and save the session to a local file, you obtained an XML file that contains the descriptions and the mnemonics.

By using XSLT, people can easily construct statements that, when run by statistical software, will assign descriptions to mnemonics.

* A side note, at this point, the XML file generated by the Data Ferrett isn't perfect. The & sing isn't correctly coded accounting to XML standard and some XSLT software may complain about the file. This, however, can easily corrected by search and replace the & sign.