Skip to content

Reading .dta with value labels #74

@pdeffebach

Description

@pdeffebach

As you know, Stata basically stores value-labeled data as a vector of integers or doubles, not necessarily an ordered sequence starting at 1, and a Dict going from Int => String.

Accessing the string values, which we generally care the most about, is hard with ReadStat. You have to

  1. Use ReadStat not StatFiles to access the internal fields of the Stata File
  2. Construct the DataFame from the data and header fields
    3 . Use the value_label_dict field to perform the replacement
  3. Use get on the DataValue elements of the array

This is not the most user friendly thing.

There isn't a great solution for this in Julia as we dont have a CategoricalArray equivalent where the base dict maps arbitrary types to strings. So converting to categorical array will drop the underlying integers, which are useful to keep due to inter-operability.

haven in R recently made a change with how this is handled with the <dbl+lbl> vector type. Though working with it is a bit of a pain, see here.

I can email a data-set to someone with an MWE for more information.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions