Handling Unit-level Data of NSSO’s SAS 2019 Using R: A Guide for Beginners

The sample surveys conducted by India’s National Sample Surveys or NSS (formerly known as the National Sample Survey Office and the National Sample Survey Organization – NSSO) under the Ministry of Statistics and Programme Implementation (MoSPI), are a valuable resource for social science researchers. Since its inception in 1950, the NSSO has been conducting large-scale sample surveys to generate data and statistical indicators on diverse socio-economic aspects.

In 2019, as part of the 77th round of NSS, schedule 33.1 called the Land and Livestock Holdings of Households and Situation Assessment of Agricultural Households (2018-19) or SAS 2019, was canvassed. SAS 2019 covers a great amount of information on India’s rural households that engage in agricultural activities, such as incomes and land holdings of farmer households. A report and unit-level data from this survey was released in 2021. The Foundation has looked at this data closely, and you can find several blogs that examine various aspects from this survey here. Previously, two rounds of SAS surveys were conducted by NSS, during the agricultural years 2002–03 and 2012–13. (It is useful to first have an overview of the current and previous rounds of SAS before delving into unit-level analysis of the latest data.)

This blog is an introduction on handling the unit-level data from SAS 2019 using the statistical programming language R. [1] The blog assumes an elementary understanding of R, and gives a broad overview of the processes involved, with links to more detailed scripts.

Understanding Data and Documentation

The unit-level data are usually made available by NSSO as fixed width text files. Fixed width text file is a type of file format commonly used in storing large sample survey data. In this file format, each field of data is given a fixed width, and each record or line in the file consists of a fixed number of these fields.

Along with the data, NSSO provides detailed documentation that can be used to extract (identify what the text data stands for) and analyse (such as estimating the national- or state-level aggregates) the data. The documentation accompanying unit-level data usually includes:

A Readme document which contains information regarding the number of records, the formula for calculating the weight of the observations, formula for creating a primary key across data sets, the list of documents, and some other relevant information
A spreadsheet detailing the layout of the data,
Instructions to Field Staff (which contains concepts, definitions, design, and coverage),
A document detailing the estimation procedure, and
A published report (that includes the survey schedule

The most important aspect of handling unit-level data is familiarity with the documentation. A careful reading of the documentation is essential to understand the nature of the data, and accordingly, engage with it. Owing to the large size of the data, we need statistical tools to handle it. Our choice, R, is a capable, open, and free/libre project that runs on a wide variety of platforms. There are alternatives, but they are not often open or free to use.

First, we need to understand the layout of the data. The documentation shows that the data for SAS 2019 comprise household level information of a sample of the population of India, from two visits by enumerators in 2019. NSSO has provided the data for different aspects such as demographic information, cost of cultivation and so on, in different blocks of the questionnaire. These blocks are, in turn, assembled in multiple “levels” (data files) for each visit.

We can use the “Instructions to Field Staff” to understand the definitions of variables contained in the data. By using the survey schedule along with the spreadsheet containing the layout information, we can identify the variables we need and their location (widths).

Data Extraction

The next step is to extract data from the fixed width files on to a more legible format. In this exercise, we will read the unit-level data into data frames (a data frame in R, is a table-like data structure that stores data in rows and columns). To do this, we first need the fixed widths of the variables and their names. We copy this information from the documentation (the spreadsheet with the layout information) into a new spreadsheet. We call this spreadsheet List_Level_Codes.xlsx and populate each sheet with the variable names from each level and their fixed widths. We then use the read_fwf() function from the readr library in R to read the fixed width files into data frames. Here is an example:

# Read the spreadsheet into a data frame library(readxl) Level3Codes <- read_excel("List_Level_Codes.xlsx", sheet = "Level3")

# Read the raw data using this information library(readr) L3_V1 <- read_fwf("Raw data/r77s331v1L03.txt",

fwf_widths (widths = Level3Codes$Length,

col_names = make.names(Level3Codes$Name)),

col_types = cols(.default = col_number())

)

Here, data from a fixed width file is being read into a data frame. The fixed width file is called r77s331v1L03.txt, indicating that it contains Level 3 data from Visit 1 of the survey, for the schedule 33.1 as part of round 77 by NSSO. This data is read into a data frame called L3_V1 using the widths and column names taken from the sheet Level3 of the spreadsheet List_Level_Codes.xlsx and stored in Level3Codes. The column types are set to default to numbers.

Once extraction is completed, we verify that the number of records match with the number provided in the Readme file.

Data Estimation

SAS 2019 being a sample survey implies that we would estimate the characteristics of the population based on the data obtained from the sample. We do this by applying some “weights” to the sample values, to reflect the true composition of the population being studied. These sample weights are calculated based on the probability of each unit in the sample being selected, as well as some other relevant factors. In simple terms, we use the sample data to infer or predict the characteristics of the entire population.

The formula for calculating weights for SAS 2019, is provided in the documentation. Accordingly, we create a new variable called weights using the following code:

L3_V1$Weights_V1 <- L3_V1$Multiplier/100

In order to estimate a value, we apply these weights. As an example, let us estimate the number of rural households in India. From the documentation we can see that the variable called value.of.agricultural.production.from.self.employment.activities.during.the.last.365.days.code. returns the value 1 for non-agricultural households and 2 for agricultural households. We can use the following code to estimate the number of rural households:

library(dplyr)

Rural_Households <- L3_V1 %>%

group_by(value.of.agricultural.production.from.self.employment. activities.during.the.last.365.days.code.)%>%

summarise (Sample = n(),

Estimated_number = sum(Weights))

This results in the creation of a data frame called Rural_Households as follows:

This data frame shows that there is an estimated number of 93.09 million rural agricultural households and 79.35 million rural non-agricultural households at the national level. The report published by NSSO confirm these figures (page ii in the report). Through validating the result of this analysis with the published report, we confirm that our data extraction process was successful.

Estimating Household Incomes

Let us now approach something more complex, such as monthly incomes of agricultural households in India during 2018-19. Household income is a complex variable, particularly in rural areas where a household may have multiple sources of income. Agricultural households, in particular, often have diverse streams of income, including from crops, animals, wage work, non-farm businesses, remittances, and so on. Additionally, there are various ways to define and calculate household income. Factoring in paid-out costs alone can be one method, accounting for imputed costs can lead to another method, and so on. For instance, the Foundation has published a note on the methodology it employs to calculate household incomes in its surveys.

One way to calculate household incomes from this data, is to separately calculate different components of the household income and then add them together. The different components include income from farming, income from wages, and so on. In this approach, we need to first load the appropriate fixed width files, extract the relevant variables, do the necessary calculations, and estimate the component income for each of such components. After this, we merge them together in appropriate fashion to calculate and estimate the monthly household income of the agricultural households.

Such processes require close examination of the documentation, to gain clarity of both the nature of data and the concepts. Important details required to correctly calculate what we want can be missed otherwise. For example, the documentation shows that the farm income information is collected for a period of six months prior to the survey date, while the animal income information is collected only for thirty days prior to the survey date. Unless this information is factored in, our estimations of the total household income will be inaccurate.

Scripts to Estimate Household Incomes

This publicly available git repository contains R scripts I wrote to estimate what we discussed above, along with detailed commentary and instructions. The repository documentation along with the comments between the codes, discuss the particularities involved with estimating monthly household incomes of agricultural households in detail. The code in the repository is free and under a very permissive license. The inspiration for this work was a previous attempt on the same topic by Deepak Johnson.

Further analyses and interpretation of the data are important aspects which, however, lie beyond the scope of this blog. Also, even though we specifically examined data from one survey, much of what is dealt with here is applicable to NSSO’s other surveys as well. A similar exercise with Time-Use Survey data from NSSO can be found in this blog as well.

To conclude, I would like to stress the importance of open research culture. There are three aspects of this culture that I want to highlight – resource and knowledge sharing, doing it in an accessible manner (for instance by leveraging free/libre software), and communicating the work widely. Embracing this culture has at least two benefits. First, it becomes easier for new researchers to enter and contribute to any area of research. Second, researchers can create a transparent, reliable, and a growing body of work. Collaboration and building on previous research is critical for advancing scientific understanding, and open research practices make this process more accessible and effective. By working together, sharing the work, and prioritizing open research practices, we can create a more inclusive and collaborative research community that benefits everyone.

[1] Using RStudio (an environment for R) is highly recommended. Here is a quick and basic introduction to R. Here are some cheat sheets from RStudio.