Frequently Asked Questions
* For any questions relating to Individual and/or Organisational Licences, please refer to the DSS Fact Sheets.
Using the Data
- How do I navigate through all the documentation in HILDA?
- What is the distinction between "employee of own business" and "employer/self-employed"?
- Which weight should I use?
- Should I weight an unbalanced panel?
- What weight should I use if I pool sample across waves?
- How do I match people across waves?
- How do I match people within households?
- How do I match couples together?
- How do I match children to their parents?
- Why are some weights zero?
- How do I reference HILDA?
- How do I calculate if someone is retired in wave 4?
The online HILDA User Manual is the best place to start. The manual describes missing data conventions; introduction to the derived variables; how to match the wave data files together and create longitudinal files; income imputation; expenditure imputation; industry and occupation variables in both Australian and derived International coding schemes; overview of the data documentation; summary of the data quality issues; how to use the weights; summary of the survey design and data collection procedures; and answers many other frequently asked questions.
The Documentation zip file contains:
a. Coding frameworks (pdfs) for all variables; and a cross wave index, master file coding framework and longitudinal weights file coding framework.
b. Marked-up questionnaires and showcards (pdfs) showing the associated variable names excluding derived and history variables.
c. Frequencies are provided for each wave. String variables (ids and timestamps) are usually excluded from the frequencies.
To quickly find variable names, the manual should be used in conjunction with the cross-wave index (Cross Wave Index i90c.pdf), which can be searched on question number or keyword or variable name (excluding the first character wave identifier) and shows which waves a variable is available in and the source questionnaire (or DV: or History: for derived variables).
The HILDA Survey mostly adopts standard ABS definitions of labour market variables. The treatment of the self-employed by the ABS, however, is something we are not comfortable with.
To quote from ABS, Labour Statistics: Concepts, Sources and Methods, Aug 2006 (cat.6102.0.55.001), their definition of employee is "a person who works for a public or private employer and receives remuneration in wages, salary, a retainer fee from their employer while working on a commission basis, tips, piece-rates, or payment in kind; or a person who operates his or her own incorporated enterprise with or without hiring employees".
In other words, their definition of employee includes owner managers who operate their own incorporated businesses (they are treated as "employees of their own business"). In contrast, a person who operates their own unincorporated business is treated as an "own account worker" (i.e., self-employed).
We believe that for many research purposes this distinction is misleading and thus in our data release we provide all the information necessary for researchers to construct their own definition of employees and self-employed. If you wish to adopt the ABS definition of "employee" then you should take the variable _esempst and combine the two groups "employee" (1) and "employee of own business" (2). (Alternatively you can just simply use the variable _es, which is a derived variable that reproduces the ABS definition of employment status.)
Whether you combine "employee of own business" and "employer/self-employed" into one group depends on your research question. If you wish to conform to ABS definitions you would never combine them (you would combine "employee" and "employee of own business"). In Mark Wooden's own research of labour market behaviour, for example, he almost always discards the ABS definition and combines "employee of own business" with the "employer/self-employed" group.
You use weights to make inferences from the sample to the population. Which weight you use depends on the question you are answering. The HILDA User Manual provides some guidance on which weight to use in which circumstances.
4. Should I weight an unbalanced panel?
Maybe. When you construct an unbalanced panel of responding persons, you take all of the responding persons from each wave and stack them into a long file that has one record per person per wave. The weight that could be used to weight this sample is the cross-sectional responding person weight from each wave. That is, in their wave 1 observation the person would be weighted by their wave 1 cross-sectional responding person weight, their wave 2 observation would be weighted by their wave 2 cross-sectional responding person weight, and so on. Similarly, if you are constructing an unbalanced panel of enumerated persons, then you could use the cross-sectional enumerated person weight.
If you pool, say, 5 waves of data together, the sum of the weights will be around 100 million (that is, 5 times the average population size between 2001 and 2005). Therefore, you may wish to rescale the weights by dividing by the number of waves that you have included in the unbalanced panel.
It will depend on the type of analysis you are doing on this unbalanced panel as to whether weighting the sample in this way makes sense. For example:
- If your analysis is of uncommon events and you are effectively taking a pooled sample, then the weighting strategy suggested above should be fine.
- If your analysis requires at least two observations on the same individual, then you will be dropping those people who are only interviewed once. The cross-sectional weights will, therefore, not be appropriate (nor will the longitudinal weights).
When you are analysising a uncommon event (for example, divorce) you can pool the sample across waves. As the sample is subject to attrition that is not random, you will need to weight your pooled sample.
If you have pooled responding persons across waves, you should use the cross-sectional responding person weight for the wave from which the case has been contributed.
Use the cross wave identifier xwaveid to match people across waves.
People within the same household have the same household identifier _hhrhid (replace the underscore with the appropriate letter for the wave, where 'a' corresponds to wave 1, etc). The household identifier will change from wave to wave. You can only match people over time via their cross wave identifier xwaveid.
People who are married or in a defacto relationship can be matched to their partner via either:
- _hhpxid, the partner's cross wave identifier; or
- _hhprtrid, the partner's two digit person number which can be concatenated to the end of the household identifier _hhrhid to create the partner's identifer for that wave.
A partner identifier is only available for partners living in the same household. Same sex couples will have a partner identifier.
Note: Replace the underscore with the appropriate letter for the wave, where 'a' corresponds to wave 1, etc.
A child can be matched to their mother or father via either:
- _hhfxid and _hhmxid, the father and mother's cross wave identifier; or
- _hhfid and _hhmid, the father and mother's two digit person number which can be concatenated to the end of the household identifier _hhrhid to create the father and mother's identifer for that wave.
Mother and father identifiers are only available for people whose parent(s) live in the same household.
Note: Replace the underscore with the appropriate letter for the wave, where 'a' corresponds to wave 1, etc.
Zero weights can occur for two reasons.
- The HILDA sample in wave 1 excluded people living in institutions (such as hospitals and other health care institutions, military and police installations, correctional and penal institutions, convents and monasteries) and other non-private dwellings (such as hotels and motels). As a result, the HILDA sample is not representative of people living in these non-private dwellings and people that move into these dwellings after wave 1 are given zero cross-sectional weights and zero longitudinal weights for the balanced panel starting in the wave they were in a non-private dwelling.
- The HILDA sample also excluded people living in remote and sparsely populated areas. Some of these areas are excluded from the population benchmarks provided by the Australian Bureau of Statistics (ABS) which are used in the weighting process. For Release 1 to 4, the benchmarks only excluded remote and sparsely populated areas in the Northern Territory. After this, the ABS revised the areas considered remote and sparsely populated and it is now defined as very remote parts of New South Wales, Queensland, South Australia, Western Australia and Northern Territory as determined by the Remoteness Area classification (that is, has a value of greater than 10.53 on the Accessibility/Remoteness Index of Australia). As a result, from Release 5, a small number of sample members living in these areas are given zero cross-sectional weights and zero longitudinal weights.
A requirement of the Deed of Licence or Deed of Confidentiality that you signed to obtain the HILDA data is that you MUST include the following paragraph in any work written that uses the HILDA data:
- This paper uses unit record data from the Household, Income and Labour Dynamics in Australia (HILDA) Survey. The HILDA Project was initiated and is funded by the Australian Government Department of Families, Housing, Community Services and Indigenous Affairs (FaHCSIA) and is managed by the Melbourne Institute of Applied Economic and Social Research (Melbourne Institute). The findings and views reported in this paper, however, are those of the author and should not be attributed to either FaHCSIA or the Melbourne Institute.
The following reference is also suggested if you wish to refer to the study design:
- Wooden, M. and Watson, N. (2007), The HILDA Survey and its Contribution to Economic and Social Research (So Far), The Economic Record, vol. 83, no. 261, pp. 208231.
There was an oversight in Wave 4, when questions on retirement status contained in the Wave 2 continuing person questionnaire were not reinstated. The questions were removed in Wave 3 because of a more comprehensive set of retirement-related questions, included as part of a retirement module. Removal of this retirement module for Wave 4 should have been accompanied by re-inclusion of the original retirement questions, but this was overlooked and not rectified until Wave 5. Retirement status in Wave 4 is problematic. You can define it solely based on age and labour force status, but to be consistent across waves you would need to apply the same criteria across all waves. The other alternative is to exclude Wave 4.