11 Create .PCB files

As explained in the introduction of this part of the book, creating the .PCB files is the last step of the process before uploading everything to the PovcalNet system.

11.1 What the heck is a PCB file?!!

PCB files are only relevant for micro-data. They basically store the micro-data welfare and weight vectors, as well as pre-computed statistics. They are custom binary files (data is stored as a bunch of 0s and 1s) that were created to support the computations in PovcalNet. It is not a standard file format like .csv or .dta.

They are used because they:

  • Take less space
  • Can be read more efficiently

PCB files contain four pieces of information:

  • Metadata

    • The PCB ID number. There are actually two types of .PCB files, an old and a new format. The ID number is a way to differentiate them.
    • The number of microdata records
    • The number of points on the Lorenz curve
  • Lorenz Curve

  • Pre-computed statistics
    Statistics that need to be computed only once (i.e. not on-the-fly, since they are not sensitive to the poverty line)

  • Microdata

PCB file represention

11.2 The povcalnet_update repository

Once you have updated all the sheets in the master file–besides the “SurveyMean” sheet–and have updated the microdata in the P drive, the next step is to create the .pcb files and update the “SurveyMean” sheet. All of this is done with the PovcalNet-Team/povcalnet_update repository. Make sure you clone the repo and open it as a project in Rstudio. You will find the project has three .R files only. If everything goes as expected, you’ll only need to use the file 00.master.R.

In rare cases, you will need to modify the functions in the other two files. The utils.R file has generic functions such as loading the master file into the system, or creating survey IDs. These functions are used all along the process. The process_functions.R file contains functions for specific parts of the projects which are executed usually once along the whole process. Basically, the 00.master.R calls these functions in order in the same way that a master do-file calls other do-files that do specific things.

In this chapter, we break down the 00.master.R file, so you understand how to run it, the logic behind, and what to do in case it needs to be fixed. Let’s start by installing the minimum necessary packages. The 00.master.R file assumes you already have them installed, so you should run the code below before you start.

# pkg_load <- knitr::combine_words(pkgs, before = "`")
pkgs <- c("janitor", "data.table", "tidyverse", "writexl", "readxl", "here", "devtools")
no_installed <- pkgs[!(pkgs  %in%  installed.packages())]
installed.packages(no_installed)

11.3 Generate the .pcb files

11.3.1 Directories

The first section includes the directories in which you’re going to be working. As of today (2020-11-20), the datadir directory is for 2020_JUL as it was the last release of PovcalNet. However, make sure you create a new folder with the year and month of the “tentative” release.

datadir   <- "p:/01.PovcalNet/02.Production/2020_JUL/"
cpi_path  <- "p:/01.PovcalNet/01.Vintage_control/_aux/price_framework/price_framework.dta"
sheet     <- "SurveyMean"
mdir      <- "p:/01.PovcalNet/00.Master/"

11.3.2 Surveys that have changed

Now, we have to specify what countries/years have been added or changed to the PovcalNet repository. We recommend you do this country by country.

#--------- to modify in each round ---------
countries <- "CHN"
years     <- NULL

If you leave the argument years equal to NULL, the code will update all the years for the country select. In this case, CHN. However, you could specify what years to update for that particular country, like,

countries <- "IND"
years     <- c(1993, 2004, 2009, 2012)

It is important to note that unless you want to update all the years available in more than one country, you should not include in the countries variable more than one country.

11.3.3 Prepare metadata

In the next part we prepare the data. Function pcn_datafind finds the directory path and filenames of the corresponding countries in variable countries. It returns a list with two objects, fail and pcn. Object fail lets you know if there is any particular data that could not be loaded. Object pcn contains a data frame with the metadata of the countries and years selected above.

# Find countries
tmp       <- pcn_datafind(country = countries)
pcn_fails <- tmp$fail
pcn       <- as.data.table(tmp$pcn)

# Fix metadata
pcn       <- fix_metadata(pcn)

# Get reference year
pcn       <- get_ref_year(pcn, cpi_path)

Then pcn object is then passed to the fix_metadata function in which some columns like welfare type and survey coverage are added. Finally, function get_ref_year merges the price framework data from datalibweb to assign the right reference year to each survey. Keep in mind that the get_ref_year() function makes some hard-coded adjustments that could not be identified programmatically. That is, they follow ad-hoc rules to be included into (or excluded from) the final Povcalnet inventory. One of the most problematic rules is the correct assignment of the reference year to the survey-ID year. Check those those cases if the resulting reference year is incorrect.

11.3.4 Generate .pcb files

Now that the metadata is ready, we can create the .pcb file. This is done with this code,

# -------------------- Create PCB --------------------
replace_file <- TRUE
pcb_status <- 
  generate_pcb_files(df          = pcn, 
                    countries    = countries,
                    years        = years, 
                    replace_file = replace_file,
                    datadir      = datadir

The function generate_pcb_files takes the directory paths of the microdata in the pcn object, loads the microdata, and, inside the datadir directory, creates the .pcb file into the /01.pcb/ subdirectory and and .rds file (R readable) into the /02.rds/ sub-directory. The creation of the .rds is for convenience. It allows you to check the data in the .pcb in an easy way. The .pcb file, in contrast, is harder to read directly in R. Except for the file format, two files are identical.

One feature of the generate_pcb_files function is that you can add additional filters by country and year. By default, only the years objects defined above is being used until this point. In fact, you could create another object, say countries2, and parse it into the argument countries of the generate_pcb_files function.

Up to this point, the generation of the .pcb files is concluded, but the PovcalNet system requires two types of inputs, welfare data and the master file. We still need to update the “SurveyMean” sheet of the Master file.

11.4 Update the Master file

Updating the Master file is the most challenging part of the whole process because we need to make sure that,

  1. whatever is correct must remain correct.
  2. whatever is wrong should be fixed
  3. whatever is not necessary should be removed
  4. whatever is missing should be added
  5. whatever is duplicated must be unified.

Thus, we recommend that you run this sections one by one and check the results in between. This is specially important for countries with urban/rural coverage like China, India, or Indonesia; for countries with lagging reference years like EU-SILC countries, or for tricky countries like … (Macedonia?).

Updating the “SurveyMean” sheet

The first step is to extract some important metadata information from the .rds files generated in the previous step. This is done with the following code,

lf <- 
  update_master_info(df       = pcn, 
                    countries = countries,
                    years     = years)
st <- lf$s
df <- lf$df

table(st$status)
filter(st, status != "OK")

The object lf is a list with two objects, st and df. Object st is merely the status of each survey in the update_master_info process, whereas df is the actual metadata of the new data. Now, the following code loads the data in the most recent version of the Master file,

lmf     <- load_masterfile()
mf      <- lmf$data$SurveyMean

reg_ctry <- lmf$data$CountryList %>% 
  select(
    Region      = WBRegionCode, 
    CountryCode = CountryCode, 
    countryName = CountryName
  )

The function load_masterfile returns a list that is then bound to the name lmf (this function takes a while to run especially if you’re working remotely). The main object of lmf is another list, data, that contains a data frame per sheet. So, in the code above, you’re creating object mf with the “SurveyMean” and red_ctry, with the “CountryList” sheet. The reason why we load the whole master file is that, when we create a new version we want that version to include all the sheets, even those that were not modified. Function writexl::write_xls, which is the function that saves the new version of the master file, uses a list with all the sheets to save the file.

Now, we need to organize the new information into the “SurveyMean” format. This is done with function survey_mean_info. However, and this one of the tricky parts, we need to make sure that the five objectives above are met when we include the information of the new data. The survey_mean_info function handles all the generic cases, but there are some cases that need to be removed manually. This is why we have the object condition that goes directly as one of the arguments of survey_mean_info. If it is necessary to manage any special cases, this object should be set to an empty string, condition <- "". Let’s see some examples of real cases in which we had to use the object condition.

condition <- '!(countrycode == "IND" & year == 2012)'

In this case, we needed to remove the observation for India 2012, because even though the CPI_Time variable in the master file is 2012, the SurveyTime variable is 2011.5.

condition <- '!(countrycode == "CHN" & grepl("A$", module))'

Here we needed to remove all the observations of China for which the module finished in a letter A.

condition <- '!(countrycode == "CHN" & year >= 1990 & welfaretype3 == "y")'

Here, there was a problem in the metadata and we needed to remove all the observations for China after 1990 for which the welfare type was coded as income, when in reality it was consumption.

condition <- paste0('!(countrycode  %in% ', 
                    deparse(countries),
                    ' & year  %in%  ', 
                    deparse(years),')'
                    )

This final example is a general form in which we remove old observations for all the countries and years for which there is new data. This is very useful if we want to start a country from scratch.

Now, we just need to identify the data that remains unchanged in the “SurveyMean” sheet using the function unchanged_data() and then append together the new data, dfn, and the unchanged data. Finally, we update the Master file using the function update_master_file, which receives four arguments, lmf, the current version of the master file with all its sheets; vintage, the vintage control sheet which is loaded separately and it is useful only for institutional-memory purposes; new_mf, which is the new “SurveyMean” sheet with unchanged and new data; and mdir, which is the directory of the master file.

You should be fine if you execute all these steps one by one, while checking the intermediate outputs.