India’s first nationally representative time use survey was conducted by the National Statistical Office (NSO) from January 2019 to December 2019. While summary findings and notes on method are included in a report issued by the NSO, the unit level data was released in September 2020. The motivation for writing this note is to help researchers interested in using this data set get a basic understanding of working with unit-level data in STATA. This blog post is a collation of some notes based upon our experience working with the data set, including reading the data, merging files, and converting the time variables in the dataset. We also include some STATA code that we hope will be helpful to researchers.
The basics: the number of complete records
1. The NSS TUS is provided in 5 files. Files 1 and 3 contain only household level data, while files 2, 3 and 5 contain person level data. File 2 contains demographic data at the individual level, File 4 contains some preliminary data on the time use survey, while file 5 contains the actual time spent per activity recorded.
2. File 2 contains all persons, including children under 6. File 5, containing the time use data, is only for children above 6 years. Thus there are 71,494 individuals under age 6 for whom we have demographic details, but for whom all the time use data is missing.
3. An additional 1,951 individuals aged 6 and above also have missing time use data within file 5, even before any merging. Once again, we have demographic details but all time use data is missing.
4. Thus in the final merged file we have time use data for a total of 445, 299 individuals and 9,436,777 records ( this is the number of records that NSS specifies for file 5 with time use data, so we know that we did not mis-read the data). 72,445 persons have missing time use data. This means that all file 4 and 5 variables, including activity number, show up as missing for 72,445 individuals in the fully merged file.
The basics: reading in the data
1. The multiplier data is the same across all 5 files.
2. There is an error in the original NSS TUS file 5 whereby NSC and MLT are read into the same variable as per the layout file. Either separate NSC and MLT before reading in the file, or use the already read-in file 5 that Ashoka University has very helpfully made available.
3. Another way of going about this – If you have downloaded NSS TUS file 5, then after merging with file 2, drop the multipliers of file 5 and use the multipliers of file 2.
Notes on using the data (all code included is for STATA)
1.Accounting for multiple records per respondent
For the 445,299 individuals with complete time use data, note that there will be multiple records per individual, each record corresponding to a different recorded activity by that person. Thus individual 1 may have 20 records: all 20 records will repeat the same demographic information, but the time use data will vary because each record represents a new activity with a different activity number, activity code etc.
In the fully merged file, if you are looking to analyse any data by individual, and you wish to avoid multiple records for a single individual, you must use activity number==1 as a condition to ensure only a single record for each individual. Note, however, that setting activity number==1 will leave out all individuals with missing time use data.
To give you an example of the implications of this particular characteristic of the dataset, here is what happened when trying to compute the total number of children under 5 in each household.
This required sorting the records by hhid to make sure to count all members of the household under age 5. Sorting by hhid alone would give us multiple entries for each person in the household (due to multiple activity records per person). On the other hand, sorting by activity number=1 excluded all children under 5, for whom data on time use, and thus data on activity number, is missing.
One solution for cases where you want to count/analyse children is to create a new variable called activity_counter = 0 if (activity_no==1 | activity_no==.).
You can then sort by activity_counter rather than activity number and successfully implement your code for counting the number of children under 5.
2. Conversion of time variables, and using time data.
The variables ‘time from’ and ‘time to’ are in string format. We can go about calculating the time spent on a particular activity in two possible ways. One is calculating the time spent in hh:mm format (using double and format commands) and another way is calculating the time spent in hour format only. For the former, if time spent on one activity is half an hour, it will be read as 00:30 and for the latter as 0.5.
Here is some sample code for the hh:mm format:
//Calculating activity length for each activity (note that activity times are string variables and need to be read as time lengths)
gen double activitytime_from = clock(time_from, “hm”)
gen double activitytime_to = clock(time_to, “hm”)
gen length_timeslot=((activitytime_to – activitytime_from))/60000 if activitytime_to>=activitytime_from
replace length_timeslot=((((86340000- activitytime_from) + activitytime_to))/60000)+1 if activitytime_to<activitytime_from
If the analysis requires the time use distribution of 24 hours for any category of persons, drop the minor activities and take the major activities only. Note, however, that classifications of ‘major’ and ‘minor’ are based upon self-reporting by the respondent and might thus be subject to social biases. Thus, as feminists would point out, activities considered to be “women’s work” might be reported as minor, even if they require considerable effort/skill and have a high use value.
3. On adjusting activity lengths for multiple activities
Simultaneous and multiple activities are not coded very helpfully in the TUS unit-level data. For example, in a slot of 16:00 to 17:00, let’s say a person watches TV, talks to someone, and drinks something. Wherever there are multiple activities, only the first activity is coded as ‘multiple’ and then ‘simultaneous’/non-simultaneous. For any other activities in that same time slot the variable ‘multiple activity’ is recorded as ‘missing’. In this example, only watching TV is coded as multiple and simultaneous. The variable for ‘multiple’ is coded as missing for the next two activities of talking and drinking. This is inconsistent, as other variables like major/minor activities, paid/unpaid status are coded for every single activity.
Nevertheless, we have to adjust activity lengths for the fact that three activities are being performed in the same time slot – the ITUS records a maximum of three such activities in any time slot.
The ITUS report is based upon equally dividing the time period by the number of activities in that time slot. Note that since there can be either two or three multiple activities recorded, you need to:
- Determine how many multiple activities are recorded in that particular slot, and then
- Divide the total length of the slot by the number of activities
- Compute the sum of all occurrences of a particular activity across the person’s TUS to find the total time they spend on that activity.
Here is a code that does this and is able to reproduce the time use numbers in the ITUS report.
a. Identifying how many multiple activities are performed in a particular time period.
The code below uses the fact that variable B5_v07, renamed here as ‘multiple_activity’, takes the value 1 if, in that particular time slot, multiple activities were performed, and 2 if only a single activity was recorded. As noted above, this is the case only for the first of the multiple activities in the time slot, the variable has missing values for any second or third multiple activities.
Thus, the code first sorts the file by person id and activity no., and then counts the number of missing value rows for the ‘multiple activity’ variable that follow for the same time slot, for the same person.
//Code to count the number of multiple activities in the same slot
sort person_id activity_no
gen number_multiple=2 if activity_no!=. & activity_no[_n+1]!=. & multiple_activity==1 & multiple_activity[_n+1]==.
replace number_multiple=3 if activity_no!=. & activity_no[_n+2]!=. & multiple_activity==1 & multiple_activity[_n+2]==.
replace number_multiple=number_multiple[_n-1] if activity_no!=. & multiple_activity==. & person_id==person_id[_n-1]
b. Then adjust the activity length by dividing the time spent equally by the number of multiple activities.
gen activity_length=length_timeslot if multiple_activity==2 & activity_no!=.
replace activity_length=length_timeslot/number_multiple if activity_no!=. & multiple_activity!=2
c. Create code to summarize the total occurrence of any particular work type you are interested in, so that you can compute the total time spent by a person on that activity/work type in the 24 hour period.
Thus, for example, to compute the total time spent on all employment and related activities by each respondent here is the code used (if you have one that is more efficient, please do share!)
// code to compute total time spent on employment and related activities, for each respondent.
gen employment_activities=(activity_code>100 & activity_code<200)
sort person_id activity_no
gen i=1 if activity_no==1
replace i=999 if activity_no!=1
gen sum_employment_time=0
replace sum_employment_time=activity_length if activity_no==1 & employment_activities==1
// counting all possible occurrences of employment activities – the maximum number of activities recorded for a person in this dataset is 70.
while i<70 {
replace sum_employment_time=sum_employment_time+activity_length[_n+i] if employment_activities[_n+i]==1 & person_id==person_id[_n+i] & activity_no==1
replace i=i+1
}
replace sum_employment_time=sum_employment_time[_n-1] if person_id==person_id[_n-1] & activity_no!=1
We hope this was helpful, and please don’t hesitate to contact us if you have comments or corrections.
The authors may be contacted using these email ids: srao[at]assumption[dot]edu and vijayamba1201[at]gmail[dot]com
Please email us to add/edit/correct these notes, all suggestions are welcome!