Notes for data users – Full dataset

Data Management

Important information you may need to know as a data user can be found on the data documentation page. On this page you will find information about the surveys, responses for each of the items asked in the surveys (Data Books) and information on how variables have been derived (Data Dictionary and Data Dictionary Supplement) just click on the links to learn more. Importantly please read the following prior to getting started:

Variable names, labels and formats for each cohort and survey can be accessed by clicking on the ‘Survey Variables’ option or directly by following this link.

You can weight for area of residence at Survey 1 (y1wtarea, m1wtarea, o1wtarea) in all crosstabs, frequencies and analyses to adjust for the initial deliberate oversampling in rural and remote areas. This is not required when running models that include area of residence.

Check the data map, the data dictionary and Data Dictionary Supplement for further information about survey items and derived variables. They are available at here.

Data must be downloaded and stored onto a secured environment as soon as it is received. Analysis undertaken must only be in accordance with the approved EOI. Changes to the nature of the analysis must be approved by the ALSWH Data Access Committee.

All publications must include the appropriate acknowledgments. more information can be found here.

How to read in a SAS dataset with formats already attached

Linking Administrative dataset to ALSWH

If your project includes linkage of datasets, use the ‘IDproj’ ’ key variable for joining the datasets.

You should have the list of women opting out of the data linkage project(s). These women will not be in the linked data and should be considered.

Make sure when merging your survey and administrative datasets you ensure only consented women have been added to the combined outcome file. Note: in some cases all participants are required for the analysis but this should be confirmed by your ALSWH liaison.

Useful notes for data users linking the PBS and MBS data may be found in Tech Report 38 – December 2015 page 118 and Tech Report 39 page 71.

Medicare variable formats may be found on the ALSWH website at https://alswh.org.au/how-to-access-the-data/external-linked-datasets

Dummy PBS and MBS data are available for testing and development here. Information regarding these data is available here.

Statistics Analysis

Useful programming code

Reliable programming code to join multiple files may be found for SAS and Stata programmers at the following webpages:

Useful SAS code clearly explained by Wieczkowski, Michael J. Alternatives to Merging SAS Data Sets. But Be Careful. IMS HEALTH, Plymouth Meeting, PA (http://stats.idre.ucla.edu/wp-content/uploads/2016/02/bt150.pdf ). Also see http://support.sas.com/resources/papers/proceedings09/036-2009.pdf .

Stata code: http://stats.idre.ucla.edu/stata/modules/combining-data/

Included on our website is the stripping program to change variable names – making wide to long transformation easier: http://stats.idre.ucla.edu/sas/faq/how-can-i-reshape-pairs-data-long-to-wide/

Information on enduring conditions is in Tech Report #29 here https://alswh.org.au/2007_technical-report_29#page=95

(Datasets with these variables may be requested by following the link https://alswh.org.au/who-is-involved/staff ))

Derived Variables

Questions related to the Dietary Questionnaires may be found here.

Be careful that you do not inappropriately analyse single items from a scale. For example, the 36 items in the SF-36 should not be considered as separate items, other than the first self-rated health item. The Data Dictionary Supplement has details about which scales have been included in the surveys.

Commonly used data variable cut-points may be found in the Data Dictionary Supplement or here for the following:

Physical activity in Report 21 page 104
Mental health cut-points for possible psychosocial distress in Report 16 pages 48 and 66
Notes regarding methods of standardising life events may be found in Report 31 page 106

Sampling scheme for the 1921-26, 1946-51, 1973-78 and 1989-95 cohorts

Selection of the sample 1973-78, 1946-51 and 1921-26 cohorts

The study sample was selected by Medicare Australia (previously known as the Health Insurance Commission) from three zones – urban, rural and remote defined according to the Australian Standard Geographical Classification RRMA scheme where urban includes Capital City and Other Metropolitan Centres; rural, Large Rural Centres, Small Rural Centres and Other Rural Areas; and remote, Remote Centres and Other Remote Areas.

The age groups sampled from the Medicare database in April 1996 were 18-22 years, 45-49 years and 70-74 years. By the time the invitations to participate were mailed later in 1996, some women at the upper limit of the age groups had had their birthday and were a year older. Hence, some women recruited were 23, 50 and 75 years old and so the cohort age ranges in the study are: 18-23, 45-50, and 70-75 years (although there are relatively fewer women in the oldest year of each cohort). The cohorts are now referred to by their years of birth but some study material may refer to them as ‘Young’, ‘Mid-aged’ and ‘Older’ and datasets use ‘y’, ‘m’, and ‘o’ (further information below).

Sampling from the population was random within each age group, except that women from rural and remote areas were selected in twice the proportions of the Australian population living in these areas. Women from capital cities and other metropolitan areas made up the balance of the samples.

There were also a small number of women invited to participate whose age was outside the cohort birth years (by a year or two), possibly due to errors in date of birth in the Medicare database. However, the survey data for these women have been retained. We recommend that when using the data, these women are either excluded or their age set to the nearest valid age.

Selection of the sample 1989-95 cohort

Please note that some variables in Surveys 1 and 2 of the 1989-95 cohort were renamed for consistency in April 2016.

See: Renaming of Variables in the Surveys 1 and 2 for the 1989-95 cohort.

Recruitment for the 1989-95 cohort was different from the other cohorts. A variety of recruitment strategies were used (see the Report I, section 3.) A brief summary is given here.

For inclusion in the 1989-95 cohort, respondents needed to:

meet the eligibility criteria of being female, aged 18 to 23 and having a Medicare
number;
answer at least some survey questions; and
meet the requirements for data linkage

A total of 17,567 women met the above inclusion criteria. To establish a pilot study group for the cohort, the first 498 young women that met the above criteria were removed from the main cohort. As a result, the pilot study group included all women recruited in October 2012 who were verified by the Department of Human Services. Of the remaining sample, 17,069 participants were verified by the Department of Human Services.

Some participants in this cohort were later found to be ineligible due to their birth years being out of range and they have been removed from the cohort. In April 2018, there were 17010 participants in the cohort.

Calculation of the sample weights

1973-78, 1946-51 and 1921-26 cohorts
The women were selected based on their postcodes recorded by Medicare. The variable in the datasets called ‘inarea’ reflects the area from which the women were sampled (urban, rural, remote). However by the time the survey was ailed, some women, particularly in the younger age group, had moved. The variable ‘y1area’ reflects their actual area of residence when completing the survey. The number of respondents who lived in urban, rural and remote areas at the time of completing the first survey in 1996 (wave 1 area) was used to create the sample weights for each age group for each area (urban, rural, remote), by comparing these numbers of respondents to 1996 Census figures. The sample weights
appear in the datasets and are labelled y1wtarea, m1wtarea and o1wtarea.

1989-95 cohort
Sample weights were calculated for the 1989-95 cohort based on the women’s ages and areas of residence (urban, rural and remote). The 2011 Census was used as the best available measure of Australia’s population of women aged 18 to 23.

Weights for women in the sample of age x (at baseline) residing in geographical region z:
(??, ??) = [P(??, ??)/P] ÷ [N(??, ??)/N]

Where N is the total number of women in the sample and N(??, ??) is the number of women aged ?? years residing in geographical region z in the sample. Similarly, P is the total number of women aged 18 to 23 in the Australian population, and P(??, ??) is the number of women in the Australian population aged ?? years residing in geographical region z.

Representativeness and attrition

These papers explain representativeness and attrition:

Lee C, Dobson AJ, Brown WJ, Bryson L, Byles J, Warner-Smith P, Young AF. (2005) Cohort Profile: The Australian Longitudinal Study on Women’s Health. International Journal of Epidemiology; 34: 987-991.
Young AF, Powers JR, Bell SL. Attrition in longitudinal studies: who do you lose? Australian and New Zealand Journal of Public Health. 2006 Aug;
Brilleman SL, Pachana NA, Dobson AJ. The impact of attrition on the representativeness of cohort studies of older people. BMC Medical Research Methodology. 2010 Aug;10.
Powers J, Loxton D. The Impact of Attrition in an 11-Year Prospective Longitudinal Study of Younger Women. Annals of Epidemiology 2010; 20(4):318-21.)

For representativeness for the 1989-95 cohort see:

Health and wellbeing of women aged 18 to 23 in 2013 and 1996: Findings from the Australian Longitudinal Study on Women’s Health. Mishra G, Loxton D, Anderson A, Hockey R, Powers J, Brown W, Dobson A, Duffy L, Graves A, Harris M, Harris S, Lucke J, McLaughlin D, Mooney R, Pachana N, Pease S, Tavener M, Thomson C, Tooth L, Townsend N, Tuckerman R and Byles J. Report prepared for the Australian Government Department of Health, June 2014. (Section 4).

Longitudinal analysis

When doing longitudinal analyses with the cohorts beginning in 1996, remember to weight for area of residence at Survey 1 (y1wtarea, m1wtarea, o1wtarea) in all crosstabs, frequencies and analyses to adjust for the initial deliberate oversampling in rural and remote areas. This weighting may not be required in models that include a geographic area of residence variable. For information on geographic area of residence, see below in Notes about specific variables.

Key Longitudinal Variable Datasets

Available for each of the four ALSWH cohorts, these datasets contain key longitudinal variables that have been harmonised across survey waves to save data users time by reducing duplication of work and programming errors.

There is one longitudinal KLV dataset per cohort. As of April 2023, the following survey waves are
included: 1989‐95 Waves 1‐6, 1973‐78 Waves 1‐9, 1946‐51 Waves 1‐9, and 1921‐26 Waves 1‐6. The
included variables are presented in the linked document. The raw survey variables used to
derive each longitudinal variable can be identified by viewing the source code.

Missing data

Some participants completed a short survey instead of the full survey, accounting for some missing data. This occurred in Survey 2 for the three original cohorts and Survey 3 for the 1921-26 and 1946-51 cohorts. The variable ‘**survey’, has the value 2 for a short survey and one otherwise. The type of survey completed is identified with variables such as y2survey for Survey 2 of the 1973-78 cohort. Survey 2 of the 1946-51 cohort Q70 on income is missing the first category ($1-$119). There are large amounts of missing data in some income questions. Surveys 2, 3 and 4 of the 1946-51 cohort are missing the question about being admitted to hospital. Survey 2 of the 1973-78 cohort is missing the question about ability to manage on income. Survey 2 of the 1946-51 cohort Q67 is unreliable as the instruction was incorrectly stated as “mark one only” rather than “mark all that apply.” Many participants realised that this was an error and answered the question, as it should have been. Others may not have done so.

The first survey of the 1989-95 cohort has 167 records whose data are almost all missing. These records are identified by the allmissing variable. This variable has the value 1 for those records that are almost all missing, zero otherwise. These records represent eligible respondents who did complete the first survey but we unfortunately lost their data. They are kept in the dataset so that the first wave’s dataset contains the whole sample.

Notes about data files

The quantitative survey data are available as SAS, STATA and SPSS data files, or as tab delimited text files. The dataset files include almost all survey items as well as all derived and calculated variables.

Naming conventions for datasets

The analysis datasets without formats and labels attached are named WHAsurveycohortB

Where survey is the survey wave number

Where ‘cohort’ is the three-letter cohort abbreviation:

yng (1973-78 cohort), mid (1946-51 cohort) , old (1921-26 cohort), nyc (1989-95 cohort)

B = level B data (identifying information removed). For example, wha1yngB.txt is the text dataset for Survey 1 of the 1973-78 cohort.

The analysis datasets with formats and labels attached are named WsurveycohortBF

Where survey is the survey wave number

Where ‘cohort’ is the one letter cohort abbreviation:
y (1973-78 cohort), m (1946-51 cohort) , o (1921-26 cohort), z (1989-95 cohort)

B = level B data (identifying information removed) and F refers to formats and labels attached. For example, w2mBf.sas is the SAS dataset with formats for Survey 2 of the 1946-51 cohort.

Naming conventions for variables

The variables in the three original cohorts are named with a two-letter prefix, e.g. ‘m1’ that identifies the cohort and survey wave.

The letters are y (1973-78 cohort), m (1946-51 cohort), and o (1921-26 cohort)

The 1989-95 cohort, also referred to as the New Young Cohort, or NYC, has been allocated the one-letter abbreviation ‘z’ because it follows on from the first young cohort, which used ‘y’. However, the variable names in the 1989-95 cohort data do not use the prefixes that are used in the other cohorts.

Renaming of Variables in the Surveys 1 and 2 for 1989-95 Cohort

Some variables in the first two waves of the 1989-95 cohort have been renamed to achieve consistency with the Data Dictionary and within all the 1989-95 surveys. This was done in April 2016. Some of the variable names in this cohort had become inconsistent with the Data Dictionary Index Numbers, which are the standard reference for variables, and also between the various waves of the surveys.

The variables in this cohort are different from the other cohorts’ variables in that they do not have paper questionnaire names, e.g., in the 1973-78 cohort y6q18 is the variable for the 18th question in the sixth wave of this (Young) cohort. The 1989-95 cohort data have different variable names that are from the Data Dictionary Index Numbers. For example, the question ‘In general, would you say your health is …‘ has Index Number SF36-001 and the variable is named SF36001. However, the first two surveys, as they were initially released, had some variables that ended up with names that were not from the Data Dictionary Index Numbers and therefore they have been renamed so they are consistent with the Data Dictionary.

Examples of inconsistent naming in Survey 1

Variable G1_HSRV201 changed to HSRV201

(“Where do you get information about your health? Other”)

ALCS032 changed to ALCS033

(“Have you ever drunk alcohol?”)

The first example above removes the prefix ‘G1_’ because it has no meaning in the analysis data set and removing it matches the variable with the Data Dictionary Index Number. The second example had a misleading name since the Data Dictionary Index Number was ALCS-033 but the variable was labelled ALCS032 – not a good name since the Data Dictionary Index number ALCS-032 is for another variable altogether.

After this variable name change, all the questionnaire variables are now named the same as their Data Dictionary Index name. This is not necessarily true for the derived variables, that is, those not on the questionnaire. The derived variables have names that are designed for easy reference. For example, the BMI variable is called ‘BMI’ on the data set, but its Index Number on the Data Dictionary is WTSH-088.

Amendment from the first version of this document

Note that the variables ending in ‘TEXT’ with a number, e.g. ‘TEXT2’, have all had their final number removed.
In Survey 2 there were some renamed variables in the Composite Abuse Scale that were not included in an earlier version of this document. These were variables whose names both needed to be changed from and also variables were renamed to these names. For example, CASC128 is the new name for what was called CASC119, furthermore, the variable that was called CASC128 is now changed to CASC140. These variables are now all in the lists below.

Survey 1 1989-95 cohort renamed variables

This table has all the variables that were renamed in Survey 1 of the 1989-95 cohort.

Earlier Variable Name	New Variable Name
ALCS032	ALCS033
CASC119	CASC128
CASC120	CASC129
CASC100	CASC132
CASC123	CASC133
CASC124	CASC134
CASC125	CASC135
CASC117	CASC136
CASC106	CASC137
CASC095	CASC138
CASC118	CASC139
CPRB305	CPRB181
CPRB304	CPRB230
DEMO06__NO	DEMO062
DEMO06__A	DEMO063
DEMO06__TSI	DEMO064
G6_DEMO156	DEMO156
6_DEMO157	DEMO157
G6_DEMO158	DEMO158
G6_DEMO159	DEMO159
G6_DEMO160	DEMO160
G6_DEMO161	DEMO161
DEMO152	DEMO168
G6_DEMO162	DEMO169
G6_DEMO162_TEXT	DEMO169_TEXT
EMPL087	EMPL093
EMPL088	EMPL094
G1_HSRV201	HSRV201
G1_HSRV202	HSRV202
G1_HSRV203	HSRV203
G1_HSRV203_TEXT2	HSRV203_TEXT
G1_HSRV204	HSRV204
G1_HSRV205	HSRV205
G1_HSRV206	HSRV206
G1_HSRV207	HSRV207
G1_HSRV208	HSRV208
G1_HSRV209	HSRV209
G1_HSRV210	HSRV210
G1_HSRV210_TEXT	HSRV210_TEXT
G1_HSRV211	HSRV211
REPH217	HSRV217
LFEVPGSK	LFEV283
LFEVUNSEX	LFEV284
LFEVBULLY	LFEV285
G2_MEDH375	MEDH375
G2_MEDH376	MEDH376
G2_MEDH377	MEDH377
G2_MEDH378	MEDH378
G2_MEDH379	MEDH379
G2_MEDH380	MEDH380
G2_MEDH381	MEDH381
G2_MEDH382	MEDH382
G2_MEDH383	MEDH383
G2_MEDH384	MEDH384
G2_MEDH385	MEDH385
G2_MEDH385_TEXT	MEDH385_TEXT
G2_MEDH386	MEDH386
G2_MEDH386_TEXT2	MEDH386_TEXT
G2_MEDH388	MEDH466
G3_MEDH389	MEDH389
G3_MEDH390	MEDH390
3_MEDH391	MEDH391
G3_MEDH392	MEDH392
G3_MEDH393	MEDH393
G3_MEDH394	MEDH394
G3_MEDH395	MEDH395
G3_MEDH395_TEXT4	MEDH395_TEXT
G4_MEDH396	MEDH396
G4_MEDH397	MEDH397
G4_MEDH398	MEDH398
G4_MEDH388	MEDH388
G4_MEDH398_TEXT3	MEDH398_TEXT
G2_MEDH374	MEDH419
G2_MEDH387	MEDH420
G2_MEDH387_TEXT5	MEDH420_TEXT
G3_MEDH388	MEDH452
PWEL001	PWEL005
PWEL002	PWEL006
REPH215	REPH028
REPH218	REPH040
REPH220	REPH041
REPH226	REPH160
REPH234	REPH242
REPH236	REPH243
REPH228	REPH245
REPH230	REPH246
REPH232	REPH247
REPH216	REPH271
REPH219	REPH272
G5_REPH221	REPH273
G5_REPH222	REPH274
G5_REPH237	REPH275
G5_REPH238	REPH276
G5_REPH225	REPH277
G5_REPH225_TEXT	REPH277_TEXT
G5_REPH226	REPH278
REPH225	REPH279
REPH227	REPH280
REPH229	REPH281
REPH231	REPH282
REPH235	REPH283
SMOK034	SMOK038
SMOK035	SMOK039
K10001	KTEN001
K10002	KTEN002
K10003	KTEN003
K10004	KTEN004
K10005	KTEN005
K10006	KTEN006
K10007	KTEN007
K10008	KTEN008
K10009	KTEN009
K10010	KTEN010
DEMO155_TEXT4	DEMO155_TEXT

Survey 2 1989-95 cohort renamed variables

This table has all the variables that were renamed in Survey 2 of the 1989-95 cohort.

Earlier Variable Name	New Variable Name
CASC119	CASC128
CASC120	CASC129
CASC100	CASC132
CASC123	CASC133
CASC124	CASC134
CASC125	CASC135
CASC117	CASC136
CASC106	CASC137
CASC095	CASC138
CASC118	CASC139

CASC128	CASC140 Repeated
CASC129	CASC141 Repeated
CASC132	CASC142 Repeated
CASC133	CASC143 Repeated
CASC134	CASC144 Repeated
CASC135	CASC145 Repeated

CPRB305	CPRB181
CPRB304	CPRB230
DEMO06__NO	DEMO062
DEMO06__A	DEMO063
DEMO06__TSI	DEMO064
G6_DEMO156	DEMO156
G6_DEMO157	DEMO157
G6_DEMO158	DEMO158
G6_DEMO159	DEMO159
G6_DEMO160	DEMO160
G6_DEMO161	DEMO161
G7_EATS032	EATS032
G7_EATS033	EATS033
G7_EATS034	EATS034
G7_EATS040	EATS040
G7_EATS064	EATS064
G7_EATS065	EATS065
EMPL087	EMPL093
EMPL088	EMPL094
G1_HSRV201	HSRV201
G1_HSRV202	HSRV202
G1_HSRV203	HSRV203
G1_HSRV204	HSRV204
G1_HSRV205	HSRV205
G1_HSRV206	HSRV206
G1_HSRV207	HSRV207
G1_HSRV208	HSRV208
G1_HSRV209	HSRV209
G1_HSRV211	HSRV211
G1_HSRV213	HSRV213
G1_HSRV214	HSRV214
REPH217	HSRV217
G2_MEDH375	MEDH375
G2_MEDH376	MEDH376
G2_MEDH377	MEDH377
G2_MEDH378	MEDH378
G2_MEDH379	MEDH379
G2_MEDH380	MEDH380
G2_MEDH381	MEDH381
G2_MEDH382	MEDH382
G2_MEDH383	MEDH383
G2_MEDH384	MEDH384
G2_MEDH386	MEDH386
G2_MEDH388	MEDH466
G4_MEDH388	MEDH388
G3_MEDH389	MEDH389
G3_MEDH390	MEDH390
G3_MEDH391	MEDH391
G3_MEDH392	MEDH392
G3_MEDH394	MEDH394
G4_MEDH396	MEDH396
G4_MEDH397	MEDH397
G4_MEDH398	MEDH398
G4_MEDH413	MEDH413
G4_MEDH414	MEDH414
G4_MEDH415	MEDH415
G4_MEDH416	MEDH416
G3_MEDH417	MEDH417
G3_MEDH418	MEDH418
G2_MEDH374	MEDH419
G3_MEDH388	MEDH452
G4_MEDH421	MEDH454
G4_MEDH419	MEDH455
G4_MEDH420	MEDH456
PWEL001	PWEL005
PWEL002	PWEL006
REPH215	REPH028
REPH216	REPH271
REPH219	REPH272
G5_REPH221	REPH273
G5_REPH222	REPH274
G5_REPH237	REPH275
G5_REPH238	REPH276
G5_REPH225	REPH277
G5_REPH226	REPH278
SMOK018	SMOK029
SMOK038	SMOK043
G2_MEDH386_TEXT2	MEDH386_TEXT
G4_MEDH398_TEXT	MEDH398_TEXT
K10001	KTEN001
K10002	KTEN002
K10003	KTEN003
K10004	KTEN004
K10005	KTEN005
K10006	KTEN006
K10007	KTEN007
K10008	KTEN008
K10009	KTEN009
K10010	KTEN010
REPH244	REPH160

Associated documentation files

Label files allocate meanings to variables. E.g., m1q1=’How is your health now?’

Format files allocate meanings to the values of variables. E.g., 1=very good, 2=good etc.

Other data files

As well as the survey datasets, there are some supplementary datasets that have been created. Information about dates of deaths and withdrawal of participants is available in the participant status file.

The qualitative data recorded on the back page are also available for analyses. For further information, refer to the Qualitative processing protocols here.

Birth Events
There is a Birth Events dataset for the 1973-78 cohort and another one for the 1989-95 cohort. These were referred to as the ‘Child’ datasets before November 2022. These datasets contain information on birth deliveries, birth complications, and some information about the child. The data is from all the relevant survey waves for the cohort. They are structured so there is a record for each child. Each record is unique based on the mother’s ID, the date of birth, and the multiple birth count variable. The Birth Events datasets get updated with each new survey that has relevant information.

Medications datasets
The fourth survey of the 1921-26 cohort, the fifth and sixth of the 1946-51 cohort and the fifth and sixth of the 1973-78 cohort have data on self-reported medications the respondents are taking. These data are available on separate datasets. Where possible, the medications are given by name and ATC code.

Participant Status and Cause of Death files
For a detailed description of Participant Status and Cause of Death files please see section 8 of the Data Dictionary Supplement page.

Extra resources to support data analysis

The Data Dictionary is a Microsoft Access database that gives a detailed description of the questions used in the survey, their source and how they are used, as well as information on the derived and calculated variables. The Data Dictionary is constantly updated and is available here. (The table is over 1,000 pages long so do not try to print it).

The Data Dictionary Supplement is a series of documents that accompanies the Data Dictionary. The Data Dictionary Supplement contains information about scales and other measures used in the ALSWH surveys. Before using any summary or scale score included in an ALSWH dataset, the appropriate section of the Data Dictionary Supplement should be reviewed. The Data Dictionary and Data Dictionary Supplement are available here.

Check the survey data books if unsure about response frequencies. Electronic copies of the surveys and data books are available here.

Notes about analysing the data

In general, it is the responsibility of the analyst to become familiar with and carefully examine all data before proceeding with data analysis.

There are different naming conventions for survey items and derived items. IDalias is a unique de-identified participant number, present in all data files. This participant number can be used to merge data files across surveys. The survey questions and method used in the calculation of the derived variables are listed in the Data Dictionary. A few survey items at Survey 1 (birth date, country of birth, language spoken at home) were removed or aggregated into groups, as these were considered potentially able to make participants identifiable.

It is not recommended to arbitrarily replace missing values with the null value or any other value. Questions involving “mark all that apply” responses have been coded to 0 (no response) or 1 (yes response). In general, a “none of the above” response option was offered at the end of each set of “mark all that apply” questions. If responses to all sections of a specific question were missing, including the null option (“none of the above”), all responses were set to missing.

Notes about specific variables

Scales
Regarding items that form part of a scale, be careful that you do not inappropriately analyse single items from a scale. For example, the 36 items in the SF-36 should not be considered as separate items, other than the first self-rated health item. The Data Dictionary Supplement has details about which scales have been included in the surveys. Regarding items that form part of a scale, be careful that you do not inappropriately analyse single items from a scale. For example, the 36 items in the SF-36 should not be considered as separate items, other than the first self-rated health item. The Data Dictionary Supplement has details about which scales have been included in the surveys.

Counting symptoms
When looking at symptoms, the general rule is to count the number of women who had the symptom “sometimes” or “often”.

Measure of depressive symptoms
The 10-item CES-D scale has an extra item at the end (“I felt terrific”) which is not included in the calculation of the CES-D score. The CES-D score is available in the datasets.

Menopause
The menopause status variable was calculated at each survey incorporating previous surveys’ information for the 1946-51 cohort during the time the women were experiencing menopause.

Measures of physical activity
The physical activity questions were changed after Survey 1. The new physical activity measures from Survey 2 are not comparable to Survey 1 in longitudinal analysis. Refer to the Data Dictionary Supplement for more information.

Summary variables
There are a few “standard” ways to collapse some of the main categorical variables we collect. For example, education (highest qualification) can be dichotomised as “school only”, “post school” or in three categories: “no formal qualifications”, “school qualifications”, “trade/tertiary qualifications” and so on. There have been several variables created to summarise sets of items in the surveys (eg. the illicit drug use items) and it is important that
data analysts become familiar with these new variables (See Data Dictionary Supplement)

Area of residence
The recommended measures are ARIA+, present on all surveys, and Modified Monash Model, MMM, only present on surveys after 2012. ARIA+ is an index of accessibility/remoteness based on the distance to the nearest service centre. The scores range from 0 to 15 and the ABS has defined 5 categories for remoteness: major cities of
Australia, inner regional Australia, outer regional Australia, remote, and very remote. Only a few of the study’s women live in very remote areas, so the fourth and fifth categories are often grouped together. Aria+ and MMM are recommended over the previously used RRMA area classification. For more information see https://www.adelaide.edu.au/hugo-centre/news/list/2018/11/21/accessibilityremoteness-index-of-australia-plus-aria-2016. For the Modified Monash Model, see the Data Dictionary Supplement section.

ATSI status
Asked at Survey 1 in all age groups. This variable can be used in statistical models but results should not be reported separately by ATSI status in any reports. See Indigenous data policy for more information.

Short surveys

Shorter questionnaires have been used for some respondents in Women’s Health Australia when the women had not responded and was contacted late and offered a short survey to complete. The short surveys were only offered in the second surveys of the 1921-26, 1946- 51, and 1973-78 cohorts, and the third survey of the 1946-51.

The short surveys only contained those questions that were considered particularly important. These questions are listed in the Short Surveys document. The researcher can identify which respondent did the short survey because their ‘survey’ variable will have the value 2 rather than 1. These records will have many variables that are entirely missing; the variables that were not included in the short survey.

Resources to help you get started

This link has good examples of data analysis using SAS, STATA, R and SPSS https://stats.idre.ucla.edu/other/dae/

More information

For more information about using study data and applying to the Data Access Committee for access to the data please refer to the how to access the data page.

Notes for data users - Full dataset

Examples of inconsistent naming in Survey 1

Amendment from the first version of this document

Survey 1 1989-95 cohort renamed variables

Survey 2 1989-95 cohort renamed variables