9 Update microdata in the P drive

Before explaining how to update our P drive, you need to make sure you understand what the P drive is as it is explained in section 2.1.

The welfare data, most of it microdata but also bin and group data, reside in the folder P:/01.PovcalNet/01.Vintage_control. Inside, there are two kinds of folders, [1] country folders and [2] auxiliary folder. The former are identified by the tree-letter ISO3 code of each country, whereas the latter start with underscore “". You must be careful of never creating a folder inside”01.vintage_control" unless it starts with an underscore. Otherwise, you might be breaking the functionality of many scripts that relies in this structure.

The internal structure of the country folders is the same and it follows the International Household Survey Network](http://ihsn.org/)) standards. It looks something like this,

fs::dir_tree("p:/01.PovcalNet/01.Vintage_control/COL/", 
             regexp = "COL/COL_2015_GEIH")
#> p:/01.PovcalNet/01.Vintage_control/COL/
#> \-- COL_2015_GEIH
#>     +-- COL_2015_GEIH_V01_M_V02_A_GMD
#>     |   \-- Data
#>     |       +-- COL_2015_GEIH_V01_M_V02_A_GMD_GPWG.dta
#>     |       +-- COL_2015_GEIH_V01_M_V02_A_GMD_GPWG.dtasig
#>     |       +-- COL_2015_GEIH_v01_M_v02_A_GMD_PCN.dta
#>     |       +-- COL_2015_GEIH_v01_M_v02_A_GMD_PCN.dtasig
#>     |       \-- COL_2015_GEIH_v01_M_v02_A_GMD_PCN.txt
#>     +-- COL_2015_GEIH_V01_M_V03_A_GMD
#>     |   \-- Data
#>     |       +-- COL_2015_GEIH_V01_M_V03_A_GMD_GPWG.dta
#>     |       +-- COL_2015_GEIH_V01_M_V03_A_GMD_GPWG.dtasig
#>     |       +-- COL_2015_GEIH_v01_M_v03_A_GMD_PCN.dta
#>     |       +-- COL_2015_GEIH_v01_M_v03_A_GMD_PCN.dtasig
#>     |       \-- COL_2015_GEIH_v01_M_v03_A_GMD_PCN.txt
#>     \-- COL_2015_GEIH_V01_M_V04_A_GMD
#>         \-- Data
#>             +-- COL_2015_GEIH_V01_M_V04_A_GMD_GPWG.dta
#>             +-- COL_2015_GEIH_V01_M_V04_A_GMD_GPWG.dtasig
#>             +-- COL_2015_GEIH_V01_M_V04_A_GMD_PCN.dta
#>             \-- COL_2015_GEIH_V01_M_V04_A_GMD_PCN.dtasig

Within the Colombian surveys (COL), there are as many folders as surveys available in this country. Each folder has three components, CCC_YYYY_SSSS, where CCC stands for the country code, yyyy for the year, and SSSS for the survey acronym. In this case, it is COL_2015_GEIH. Inside this folder we have the different version available. This is represented by the convention CCC_YYYY_SSSS_Vmm_M_Vaa_A_TTT, where mm stands for the version of the master data (i.e., the one released by the Government of the country), and aa stands for the version of the adaptation of the survey. Finally, TTT refers to the collection of the adaptation. In PovcalNet we only have the GMD collection. Finally, inside each of the vintage control folder, you will find the data folder and within you will find the different modules of the survey. This module is represented by the last acronym of the name of the file, say GPWG or PCN. This two modules are explained further below.

9.1 Get PovcalNet inventory up to date

Once everything has been approved in PRIMUS (see section 8.4), we can download the most recent data into our system using the directive pcn download gpwg. The process of downloading the data can be done in several ways, but we suggest two.

9.1.1 Update all the years of selected countries

This method is easy but highly inefficient because it could be the case that only one year of a particular country was updated or added, but the code below will check if all the years of that country have changed as well. You may think that it does not make sense to download the data in this way, and you might be right. However, downloading only the data that you need is not as straightforward as you may think, and thus the code is a little more complex than usual. We recommend this inefficient way because it is easy to implement, and pcn is smart enough to verify whether or not the microdata has changed in a particular year, so in case it has not changed, pcn won’t update the microdata file in the P drive. It is just slow because it needs to load the microdata from datalibweb.

primus query, overallstatus(approved)

local filtdate "2020-04-01"   // change this date YYYY-MM-DD
keep if  date_modified >= clock("`filtdate'", "YMD")
levelsof country, local(countries) clean  // code of countries that changed

pcn download gpwg, countr(`countries') 

The code above is simple. First, you query all the approved transactions in PRIMUS using the primus command and keep only those transactions that were created after the date in the local filtdate. Then, save in the local countries the code of the countries that changed and parse it into the pcn call. From there, pcn will take care of the update and will provide you with a file summarizing what data was updated, added, skipped, or failed.

9.1.2 Update file by file

The code below may seem complex, but it is actually very simple. The MATA function only reads the observation i in the while loop of the already-filtered data obtained with primus query. You can see that it is very similar to the code above but the main difference is that it goes file by file instead of checking all the files of one country.

//========================================================
// Mata function
//========================================================
cap mata: mata drop get_ind()
mata:
    void get_ind(string matrix R) {
        i = strtoreal(st_local("i"))
        vars = tokens(st_local("varlist"))
        for (j =1; j<=cols(vars); j++) {
            st_local(vars[j], R[i,j] )
        }
    } // end of IDs variables
    
end

//========================================================
// PRIMUS query
//========================================================

primus query, overallstatus(approved)

local filtdate "2020-04-01"   // change this date
keep if  date_modified >= clock("`filtdate'", "YMD")
levelsof country, local(countries) clean  // code of countries that changed
tostring _all, replace

local varlist = "country year"
mata: R = st_sdata(.,tokens(st_local("varlist")))
local n = _N

//========================================================
// Loop over surveys
//========================================================

local i = 0
while (`i' < `n') {
    local ++i
    
    mata: get_ind(R)
    disp "`country' - `year'"
    pcn download gpwg, countr(`country')  year(`year')
    
}

Keep in mind the following,

  1. This code works fine if you copy it and paste it in your do-file editor and then run it directly in Stata. If you save it and then execute it from a different file using either do or run it will fail because the end command at the end of the MATA code will finish the process.
  2. This code could be implemented in another subcommand of pcn without running into the problem above. However, we don’t have time to do this change. The MATA function already exists in the pcn repository and it is being used by other pcn subroutines like pcn_download_gpwg. If you want to contribute to the pcn by adding this feature, you’re more than welcome! 😉.
  3. The only problem with this approach is that you won’t get at the end of the process a nice file summarizing what happened with each file. This could also be fixed if this routine is included as part of pcn.

9.2 Create the _PCN files

The final step before having the microdata ready to be converted to .pcb files and ingested by the PovcalNet system is to create the “PCN” files. These files are just an adaptation of the GPWG microdata in the P drive. They are called “PCN” because that is their suffix in the naming convention. For instance, the “_PCN” file of COL_2017_GEIH_V01_M_V01_A_GMD_GPWG.dta is COL_2017_GEIH_v01_M_v01_A_GMD_PCN.dta. This conversion is done through the pcn create directive. In the case above, for instance, you only need to execute the directive,

pcn create, countries(COL) year(2017) clear

Also, you can create the “PCN” files of all surveys, by just typing,

pcn create, countries(ALL)

You may need to use the option replace in the directive above, in case you want to replace an existing “PCN” file. In similar fashion as pcn download gpwg, pcn create compares current data in the P drive and does not replace anything unless you make this option explicit.

Also, if you want to understand what the pcn create directive does, you can look at the file pcn_create.ado, but it general it makes sure to load the data in the P drive and standardize the output to make it ready for PovcalNet. For instance, it replace zeros and missing values of the welfare and weight variables like this,

* drop missing values
drop if welfare < 0 | welfare == .
drop if weight <= 0 | weight == .

Or make sure the weight variable is exactly the same for all the files,

cap confirm var weight, exact
if (_rc) {
    cap confirm var weight_p, exact
    if (_rc == 0) rename weight_p weight
    else {
        cap confirm var weight_h, exact
        if (_rc == 0) rename weight_h weight
        else {
            noi disp in red "no weight variable found for country(`country') year(`year') veralt(`veralt') "
            local status "error. cleaning"
            local dlwnote "no weight variable found for country(`country') year(`year') "
            mata: P = pcn_info(P)
            noi _dots `i' 1
            continue
        }
    }
}

This command also makes sure to do the proper adjustments to the India and Indonesia datasets.

Daniel, could you please provide a short explanation of the IND and IDN adjustment?

Make sure that for all the new GPWG that you download from datalibweb based on the PRIMUS catalog, you create all the corresponding PCN files.

9.3 China synthetic files

For the early years of Indonesia and India, and a large part of China’s, we calculate the national inequality measures using separate urban and rural grouped data. This data is synthetic data created following a simple process:

  1. Query the fitted Lorenz curve parameters for these country-years.

For this kind of data (grouped data), Povcalnet fits a Lorenz curve using two functional forms: a linear approximation and the other using a quadratic one. By Querying Povcalnet, one can recover the critical parameters used to fit the curve over the data. These estimated parameters and the poverty estimates of each country-year are provided for both urban and rural areas following a simple query:

http://iresearch.worldbank.org/PovcalNet/Detail.aspx?Format=Detail&C0=CHN_2&PPP0=3.0392219&PL0=1.90&Y0=1981&NumOfCountries=1

Checking the query in some detail shows that it is composed of a few key components:

  • The server root: This is the first part of the query. In the example, it corresponds to “http://iresearch.worldbank.org/PovcalNet.” This root indicates the Povacalnet server to be used to query the data. Keep in mind that there are a few servers: the production and the dev or internal. The example uses the production server; in case you intend to use the dev server, change the root to “http://wbgmsrech001/povcalnet.”

  • The country: Following the server definition, the API query deepens on details, just after “Format=Detail.” The first one to appear is “CO,” which refers to the country. To query a given country, following the “=” symbols, add the country’s three-letter code. In our case, add CHN to query data for China.

  • The coverage level: Just after the country code and the "_" follows a number. This number indicates the data coverage level, urban or rural. If the desired level is rural, this number must be one (1); otherwise, if the sought level is urban, the number should set to two (2). The example queries data at the urban level.

  • The PPP: The next component is the PPP. The PPP is set after “&PPP0=.” In the example, the PPP to be used is 3.0392219.

  • The poverty line: The poverty line is set following the characters “&PL0.” In the example, the line is set to the extreme poverty line of 1.90.

  • The year: Is set using “&Y0=.” In the example query, the requested year is 1981.

Once a given observation is queried from Povcalnet, the estimated parameters are recovered and stored. An auxiliary ado file carries out all this, “lorenz_query.ado.” The extraction is close to a web scrapping process. Using regular expressions, each of the values of interest is stored and returned as a scalar. For example, the piece of code that recovers the Lorenz estimates is as follows:

foreach sec in GQ beta final {

        if regexm(`sec'," A[ ]+([0-9|\.|-]+)") scalar `sec'c_A = regexs(1)
        if regexm(`sec'," B[ ]+([0-9|\.|-]+)") scalar `sec'c_B = regexs(1)
        if regexm(`sec'," C[ ]+([0-9|\.|-]+)") scalar `sec'c_C = regexs(1)
        
        if ("`sec'" == "beta"){
            if regexm(`sec'," Theta:[ ]+([0-9|\.|-]+)") scalar `sec'_theta = regexs(1)
            if regexm(`sec'," Gamma:[ ]+([0-9|\.|-]+)") scalar `sec'_gamma = regexs(1)
            if regexm(`sec'," Delta:[ ]+([0-9|\.|-]+)") scalar `sec'_delta = regexs(1)
        }
  1. Estimate the slope parameters The following code estimates the slope parameters:
foreach cv of local cover{
        loc cvl = substr("`cv'", 1,1) // level_sufix
        
        // - Create parameters to calculate the slope of the GQ lorenz curve 
        scalar e_`cvl' = -(`=scalar(GQcoeffA_`cvl')' + `=scalar(GQcoeffB_`cvl')' + `=scalar(GQcoeffC_`cvl')' + 1)
        scalar m_`cvl' = (`=scalar(GQcoeffB_`cvl')')^2 - 4*`=scalar(GQcoeffA_`cvl')'
        scalar n_`cvl' =2*`=scalar(GQcoeffB_`cvl')'*`=scalar(e_`cvl')'-4*`=scalar(GQcoeffC_`cvl')'
        scalar r_`cvl' =((`=scalar(n_`cvl')')^2-4*`=scalar(m_`cvl')'*((`=scalar(e_`cvl')')^2))^(0.5)
        
        // - Convert GQ mean from monthly to daily 
        scalar GQmean_`cvl' = `=scalar(GQmean_`cvl')' * 12 / 365
        scalar betamean_`cvl' = `=scalar(betamean_`cvl')' * 12 / 365
    }

Notice that these parameters are created using the parameters delivered by the API. Namely, all the scalars, which name starts by GQcoeff, in the previous chunck of code.

  1. Simulate the data For each coverage level, urban and rural, 100 000 observations are simulated; and the observations’ weights are simply the population by coverage level.
    clear
        set type double
        set seed 12345

        scalar nobs=100000
        scalar first=1/(2*`=nobs')
        scalar last=1-(1/(2*`=nobs'))
        set obs `=nobs'                 
        range _F  `=first' `=last'

Now using the slope parameters previously calculated the Lorenz curve, the simulated incomes are calculated in the following lines:

** Calculate the slope of GQ lorenz curve and income (= mu * slope of LC)
        gen double x_F_GQ = `=scalar(GQmean_`cvl')'*(-(`=scalar(GQcoeffB_`cvl')')/2 -(2*`=scalar(m_`cvl')'*(_F)+`=scalar(n_`cvl')')*(`=scalar(m_`cvl')'*(_F)^2+`=scalar(n_`cvl')'*(_F)+(`=scalar(e_`cvl')')^2)^(-0.5)/4)
        
** Calculate the slope of beta lorenz curve and income (= mu * slope of LC)
        gen double x_F_beta = `=scalar(betamean_`cvl')'*(1-`=scalar(betatheta_`cvl')'*((_F)^`=scalar(betagamma_`cvl')')*((1-(_F))^`=scalar(betadelta_`cvl')')*((`=scalar(betagamma_`cvl')'/(_F)) - (`=scalar(betadelta_`cvl')'/(1-(_F)))))

The simlated incomes are just the mean times the Lorenz curve slope, for both the linear and cuadratic approach.

Once this is calculated, only the weights (population) and simulated income values are kept in memory. This is it, that’s the simulated data used we use.