9 Update microdata in the P drive
Before explaining how to update our P drive, you need to make sure you understand what the P drive is as it is explained in section 2.1.
The welfare data, most of it microdata but also bin and group data, reside in
the folder P:/01.PovcalNet/01.Vintage_control
. Inside, there are two kinds of
folders, [1] country folders and [2] auxiliary folder. The former are identified
by the tree-letter ISO3 code of each country, whereas the latter start with
underscore “". You must be careful of never creating a folder inside”01.vintage_control" unless it starts with an underscore. Otherwise, you might
be breaking the functionality of many scripts that relies in this structure.
The internal structure of the country folders is the same and it follows the International Household Survey Network](http://ihsn.org/)) standards. It looks something like this,
fs::dir_tree("p:/01.PovcalNet/01.Vintage_control/COL/",
regexp = "COL/COL_2015_GEIH")
#> p:/01.PovcalNet/01.Vintage_control/COL/
#> \-- COL_2015_GEIH
#> +-- COL_2015_GEIH_V01_M_V02_A_GMD
#> | \-- Data
#> | +-- COL_2015_GEIH_V01_M_V02_A_GMD_GPWG.dta
#> | +-- COL_2015_GEIH_V01_M_V02_A_GMD_GPWG.dtasig
#> | +-- COL_2015_GEIH_v01_M_v02_A_GMD_PCN.dta
#> | +-- COL_2015_GEIH_v01_M_v02_A_GMD_PCN.dtasig
#> | \-- COL_2015_GEIH_v01_M_v02_A_GMD_PCN.txt
#> +-- COL_2015_GEIH_V01_M_V03_A_GMD
#> | \-- Data
#> | +-- COL_2015_GEIH_V01_M_V03_A_GMD_GPWG.dta
#> | +-- COL_2015_GEIH_V01_M_V03_A_GMD_GPWG.dtasig
#> | +-- COL_2015_GEIH_v01_M_v03_A_GMD_PCN.dta
#> | +-- COL_2015_GEIH_v01_M_v03_A_GMD_PCN.dtasig
#> | \-- COL_2015_GEIH_v01_M_v03_A_GMD_PCN.txt
#> \-- COL_2015_GEIH_V01_M_V04_A_GMD
#> \-- Data
#> +-- COL_2015_GEIH_V01_M_V04_A_GMD_GPWG.dta
#> +-- COL_2015_GEIH_V01_M_V04_A_GMD_GPWG.dtasig
#> +-- COL_2015_GEIH_V01_M_V04_A_GMD_PCN.dta
#> \-- COL_2015_GEIH_V01_M_V04_A_GMD_PCN.dtasig
Within the Colombian surveys (COL), there are as many folders as surveys
available in this country. Each folder has three components, CCC_YYYY_SSSS
,
where CCC
stands for the country code, yyyy
for the year, and SSSS
for the
survey acronym. In this case, it is COL_2015_GEIH
. Inside this folder we have
the different version available. This is represented by the convention
CCC_YYYY_SSSS_Vmm_M_Vaa_A_TTT
, where mm
stands for the version of the master
data (i.e., the one released by the Government of the country), and aa
stands
for the version of the adaptation of the survey. Finally, TTT
refers to the
collection of the adaptation. In PovcalNet we only have the GMD
collection.
Finally, inside each of the vintage control folder, you will find the data
folder and within you will find the different modules of the survey. This module
is represented by the last acronym of the name of the file, say GPWG
or PCN
.
This two modules are explained further below.
9.1 Get PovcalNet inventory up to date
Once everything has been approved in PRIMUS (see section 8.4),
we can download the most recent data into our system using the directive
pcn download gpwg
. The process of downloading the data can be done in several
ways, but we suggest two.
9.1.1 Update all the years of selected countries
This method is easy but highly inefficient because it could be the case that
only one year of a particular country was updated or added, but the code below
will check if all the years of that country have changed as well. You may think
that it does not make sense to download the data in this way, and you might be
right. However, downloading only the data that you need is not as
straightforward as you may think, and thus the code is a little more complex
than usual. We recommend this inefficient way because it is easy to implement,
and pcn
is smart enough to verify whether or not the microdata has changed in
a particular year, so in case it has not changed, pcn
won’t update the
microdata file in the P drive. It is just slow because it needs to load the
microdata from datalibweb
.
query, overallstatus(approved)
primus
local filtdate "2020-04-01" // change this date YYYY-MM-DD
keep if date_modified >= clock("`filtdate'", "YMD")
levelsof country, local(countries) clean // code of countries that changed
`countries') pcn download gpwg, countr(
The code above is simple. First, you query all the approved transactions in
PRIMUS using the primus
command and keep only those transactions that were
created after the date in the local filtdate
. Then, save in the local
countries
the code of the countries that changed and parse it into the pcn
call. From there, pcn
will take care of the update and will provide you with a
file summarizing what data was updated, added, skipped, or failed.
9.1.2 Update file by file
The code below may seem complex, but it is actually very simple. The MATA
function only reads the observation i
in the while
loop of the
already-filtered data obtained with primus query
. You can see that it is very
similar to the code above but the main difference is that it goes file by file
instead of checking all the files of one country.
//========================================================
// Mata function
//========================================================
mata: mata drop get_ind()
cap mata:
void get_ind(string matrix R) {
"i"))
i = strtoreal(st_local("varlist"))
vars = tokens(st_local(for (j =1; j<=cols(vars); j++) {
st_local(vars[j], R[i,j] )
}// end of IDs variables
}
end
//========================================================
// PRIMUS query
//========================================================
query, overallstatus(approved)
primus
local filtdate "2020-04-01" // change this date
keep if date_modified >= clock("`filtdate'", "YMD")
levelsof country, local(countries) clean // code of countries that changed
tostring _all, replace
local varlist = "country year"
mata: R = st_sdata(.,tokens(st_local("varlist")))
local n = _N
//========================================================
// Loop over surveys
//========================================================
local i = 0
while (`i' < `n') {
local ++i
mata: get_ind(R)
"`country' - `year'"
disp `country') year(`year')
pcn download gpwg, countr(
}
Keep in mind the following,
- This code works fine if you copy it and paste it in your do-file editor and
then run it directly in Stata. If you save it and then execute it from a
different file using either
do
orrun
it will fail because theend
command at the end of the MATA code will finish the process. - This code could be implemented in another subcommand of
pcn
without running into the problem above. However, we don’t have time to do this change. The MATA function already exists in thepcn
repository and it is being used by otherpcn
subroutines likepcn_download_gpwg
. If you want to contribute to thepcn
by adding this feature, you’re more than welcome! 😉. - The only problem with this approach is that you won’t get at the end of the
process a nice file summarizing what happened with each file. This could
also be fixed if this routine is included as part of
pcn
.
9.2 Create the _PCN files
The final step before having the microdata ready to be converted to .pcb files
and ingested by the PovcalNet system is to create the “PCN” files. These files
are just an adaptation of the GPWG microdata in the P drive. They are called
“PCN” because that is their suffix in the naming convention. For instance, the
“_PCN” file of COL_2017_GEIH_V01_M_V01_A_GMD_GPWG.dta is
COL_2017_GEIH_v01_M_v01_A_GMD_PCN.dta. This conversion is done
through the pcn create
directive. In the case above, for instance, you only
need to execute the directive,
year(2017) clear pcn create, countries(COL)
Also, you can create the “PCN” files of all surveys, by just typing,
pcn create, countries(ALL)
You may need to use the option replace
in the directive above, in case you
want to replace an existing “PCN” file. In similar fashion as
pcn download gpwg
, pcn create
compares current data in the P drive and does
not replace anything unless you make this option explicit.
Also, if you want to understand what the pcn create
directive does, you can
look at the file pcn_create.ado
, but it general it makes sure to load the data
in the P drive and standardize the output to make it ready for PovcalNet. For
instance, it replace zeros and missing values of the welfare and weight
variables like this,
drop missing values
* drop if welfare < 0 | welfare == .
drop if weight <= 0 | weight == .
Or make sure the weight
variable is exactly the same for all the files,
confirm var weight, exact
cap if (_rc) {
confirm var weight_p, exact
cap if (_rc == 0) rename weight_p weight
else {
confirm var weight_h, exact
cap if (_rc == 0) rename weight_h weight
else {
noi disp in red "no weight variable found for country(`country') year(`year') veralt(`veralt') "
local status "error. cleaning"
local dlwnote "no weight variable found for country(`country') year(`year') "
mata: P = pcn_info(P)
noi _dots `i' 1
continue
}
} }
This command also makes sure to do the proper adjustments to the India and Indonesia datasets.
Daniel, could you please provide a short explanation of the IND and IDN adjustment?
Make sure that for all the new GPWG that you download from datalibweb
based on
the PRIMUS catalog, you create all the corresponding PCN files.
9.3 China synthetic files
For the early years of Indonesia and India, and a large part of China’s, we calculate the national inequality measures using separate urban and rural grouped data. This data is synthetic data created following a simple process:
- Query the fitted Lorenz curve parameters for these country-years.
For this kind of data (grouped data), Povcalnet fits a Lorenz curve using two functional forms: a linear approximation and the other using a quadratic one. By Querying Povcalnet, one can recover the critical parameters used to fit the curve over the data. These estimated parameters and the poverty estimates of each country-year are provided for both urban and rural areas following a simple query:
Checking the query in some detail shows that it is composed of a few key components:
The server root: This is the first part of the query. In the example, it corresponds to “http://iresearch.worldbank.org/PovcalNet.” This root indicates the Povacalnet server to be used to query the data. Keep in mind that there are a few servers: the production and the dev or internal. The example uses the production server; in case you intend to use the dev server, change the root to “http://wbgmsrech001/povcalnet.”
The country: Following the server definition, the API query deepens on details, just after “Format=Detail.” The first one to appear is “CO,” which refers to the country. To query a given country, following the “=” symbols, add the country’s three-letter code. In our case, add CHN to query data for China.
The coverage level: Just after the country code and the "_" follows a number. This number indicates the data coverage level, urban or rural. If the desired level is rural, this number must be one (1); otherwise, if the sought level is urban, the number should set to two (2). The example queries data at the urban level.
The PPP: The next component is the PPP. The PPP is set after “&PPP0=.” In the example, the PPP to be used is 3.0392219.
The poverty line: The poverty line is set following the characters “&PL0.” In the example, the line is set to the extreme poverty line of 1.90.
The year: Is set using “&Y0=.” In the example query, the requested year is 1981.
Once a given observation is queried from Povcalnet, the estimated parameters are recovered and stored. An auxiliary ado file carries out all this, “lorenz_query.ado.” The extraction is close to a web scrapping process. Using regular expressions, each of the values of interest is stored and returned as a scalar. For example, the piece of code that recovers the Lorenz estimates is as follows:
foreach sec in GQ beta final {
if regexm(`sec'," A[ ]+([0-9|\.|-]+)") scalar `sec'c_A = regexs(1)
if regexm(`sec'," B[ ]+([0-9|\.|-]+)") scalar `sec'c_B = regexs(1)
if regexm(`sec'," C[ ]+([0-9|\.|-]+)") scalar `sec'c_C = regexs(1)
if ("`sec'" == "beta"){
if regexm(`sec'," Theta:[ ]+([0-9|\.|-]+)") scalar `sec'_theta = regexs(1)
if regexm(`sec'," Gamma:[ ]+([0-9|\.|-]+)") scalar `sec'_gamma = regexs(1)
if regexm(`sec'," Delta:[ ]+([0-9|\.|-]+)") scalar `sec'_delta = regexs(1)
}
- Estimate the slope parameters The following code estimates the slope parameters:
foreach cv of local cover{
substr("`cv'", 1,1) // level_sufix
loc cvl =
// - Create parameters to calculate the slope of the GQ lorenz curve
scalar e_`cvl' = -(`=scalar(GQcoeffA_`cvl')' + `=scalar(GQcoeffB_`cvl')' + `=scalar(GQcoeffC_`cvl')' + 1)
scalar m_`cvl' = (`=scalar(GQcoeffB_`cvl')')^2 - 4*`=scalar(GQcoeffA_`cvl')'
scalar n_`cvl' =2*`=scalar(GQcoeffB_`cvl')'*`=scalar(e_`cvl')'-4*`=scalar(GQcoeffC_`cvl')'
scalar r_`cvl' =((`=scalar(n_`cvl')')^2-4*`=scalar(m_`cvl')'*((`=scalar(e_`cvl')')^2))^(0.5)
// - Convert GQ mean from monthly to daily
scalar GQmean_`cvl' = `=scalar(GQmean_`cvl')' * 12 / 365
scalar betamean_`cvl' = `=scalar(betamean_`cvl')' * 12 / 365
}
Notice that these parameters are created using the parameters delivered by the API. Namely, all the scalars, which name starts by GQcoeff, in the previous chunck of code.
- Simulate the data For each coverage level, urban and rural, 100 000 observations are simulated; and the observations’ weights are simply the population by coverage level.
clear
set type double
set seed 12345
scalar nobs=100000
scalar first=1/(2*`=nobs')
scalar last=1-(1/(2*`=nobs'))
set obs `=nobs'
range _F `=first' `=last'
Now using the slope parameters previously calculated the Lorenz curve, the simulated incomes are calculated in the following lines:
of GQ lorenz curve and income (= mu * slope of LC)
** Calculate the slope gen double x_F_GQ = `=scalar(GQmean_`cvl')'*(-(`=scalar(GQcoeffB_`cvl')')/2 -(2*`=scalar(m_`cvl')'*(_F)+`=scalar(n_`cvl')')*(`=scalar(m_`cvl')'*(_F)^2+`=scalar(n_`cvl')'*(_F)+(`=scalar(e_`cvl')')^2)^(-0.5)/4)
of beta lorenz curve and income (= mu * slope of LC)
** Calculate the slope gen double x_F_beta = `=scalar(betamean_`cvl')'*(1-`=scalar(betatheta_`cvl')'*((_F)^`=scalar(betagamma_`cvl')')*((1-(_F))^`=scalar(betadelta_`cvl')')*((`=scalar(betagamma_`cvl')'/(_F)) - (`=scalar(betadelta_`cvl')'/(1-(_F)))))
The simlated incomes are just the mean times the Lorenz curve slope, for both the linear and cuadratic approach.
Once this is calculated, only the weights (population) and simulated income values are kept in memory. This is it, that’s the simulated data used we use.