Basic unit-level models

The basic unit-level model (Battese, Harter, and Fuller 1988; Rao and Molina 2015) is given by $y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2)$ where $$j$$ runs from 1 to $$n$$, the number of unit-level observations, $$\beta$$ is a vector of regression coefficients for given covariates $$x_j$$, and $$v_i$$ are random area intercepts.

We use the api dataset included in packages survey.

library(survey)
data(api)
apipop$cname <- as.factor(apipop$cname)
apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname)) The apipop data frame contains the complete population whereas apisrs is a simple random sample from it. The variable cname is the county name, and we will be interested in estimation at the county level. Not all counties in the population are sampled. In order to be able to make predictions for out-of-sample areas we make sure that the levels of the sample’s cname variable match those of its population counterpart. The basic unit-level model with county random area effects is fit as follows library(mcmcsae) mod <- api00 ~ reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") + gen(factor = ~ iid(cname), name="v") sampler <- create_sampler(mod, data=apisrs) sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE) (summary(sim)) ## llh_ : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## llh_ -1090 4.47 -244 0.12 -1097 -1089 -1083 1400 1 ## ## sigma_ : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## sigma_ 56.2 3.09 18.2 0.0666 51.4 56.1 61.4 2161 1 ## ## beta : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## (Intercept) 802.649 25.161 31.90 0.49578 761.570 802.762 843.6561 2576 0.999 ## ell -1.957 0.307 -6.37 0.00580 -2.477 -1.959 -1.4546 2805 1.001 ## meals -1.966 0.284 -6.92 0.00542 -2.436 -1.967 -1.5031 2751 1.000 ## stypeH -106.923 13.391 -7.98 0.24736 -128.806 -106.825 -84.3755 2931 0.999 ## stypeM -60.304 11.562 -5.22 0.21109 -79.320 -60.454 -40.9672 3000 1.000 ## hsg -0.616 0.412 -1.49 0.00752 -1.290 -0.611 0.0538 3000 1.000 ## col.grad 0.581 0.508 1.14 0.01007 -0.234 0.576 1.4234 2542 0.999 ## grad.sch 2.124 0.481 4.42 0.00936 1.343 2.108 2.9256 2642 1.000 ## ## v_sigma : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## v_sigma 23.3 6.28 3.72 0.203 13.7 22.9 34.2 956 1 ## ## v : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## Alameda -25.8712 15.2 -1.69748 0.330 -51.7 -2.54e+01 -1.11 2131 1 ## Amador 0.0396 24.2 0.00164 0.450 -40.2 -5.30e-02 39.76 2890 1 ## Butte -0.4771 23.6 -0.02020 0.453 -39.0 -1.97e-01 37.69 2721 1 ## Calaveras 8.2061 21.9 0.37429 0.435 -26.5 7.59e+00 44.85 2538 1 ## Colusa 0.5383 24.5 0.02198 0.453 -40.4 8.97e-01 40.15 2918 1 ## Contra Costa -10.1490 15.3 -0.66427 0.279 -35.4 -9.50e+00 13.53 3000 1 ## Del Norte 0.9160 23.8 0.03841 0.435 -38.3 2.72e-01 40.80 3000 1 ## El Dorado 0.5042 24.3 0.02077 0.456 -38.3 8.19e-01 39.16 2836 1 ## Fresno 0.2572 15.5 0.01656 0.283 -25.7 4.45e-01 25.70 3000 1 ## Glenn -0.6469 23.9 -0.02711 0.436 -39.9 8.09e-04 37.51 3000 1 ## ... 47 elements suppressed ... We wish to estimate the area population means $\theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,,$ where $$U_i$$ is the set of units in area $$i$$ of size $$N_i$$. The MCMC output in variable sim can be used to obtain draws from the posterior distribution for $$\theta_i$$. The $$r$$th draw can be expressed as $\theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i} - n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i} \epsilon_{j;r} \right)\,,$ where $$\bar{y}_i$$ is the sample mean of $$y$$ in area $$i$$ and $$t_{x;i}$$ is a vector of population totals for area $$i$$. N <- table(apipop$cname)
samplesums <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
m <- match(apisrs$cds, apipop$cds)  # population units in the sample
res <- predict(sim, newdata=apipop, labels=names(N),
fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N, show.progress=FALSE) (summ <- summary(res)) ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## Alameda 678 14.60 46.4 0.315 654 678 701 2144 1.000 ## Amador 739 31.81 23.2 0.581 686 738 791 3000 1.000 ## Butte 679 26.37 25.7 0.499 637 679 722 2791 0.999 ## Calaveras 744 27.43 27.1 0.520 700 743 790 2781 1.000 ## Colusa 553 31.97 17.3 0.590 499 553 605 2939 1.000 ## Contra Costa 727 14.53 50.0 0.266 702 727 750 2984 1.000 ## Del Norte 679 32.12 21.2 0.586 626 680 731 3000 1.000 ## El Dorado 754 26.78 28.2 0.496 712 754 797 2910 0.999 ## Fresno 595 15.18 39.2 0.277 570 595 620 3000 1.000 ## Glenn 625 31.84 19.6 0.581 572 625 677 3000 1.000 ## Humboldt 701 27.48 25.5 0.505 657 701 747 2960 1.001 ## Imperial 558 25.27 22.1 0.492 518 557 601 2638 0.999 ## Inyo 707 33.61 21.0 0.614 651 708 760 3000 1.000 ## Kern 596 15.16 39.3 0.315 571 596 621 2318 0.999 ## Kings 594 22.95 25.9 0.490 554 595 631 2191 1.001 ## Lake 663 25.31 26.2 0.462 622 663 704 3000 0.999 ## Lassen 706 26.55 26.6 0.485 661 707 750 3000 1.001 ## Los Angeles 636 8.21 77.4 0.150 622 636 649 3000 0.999 ## Madera 615 20.21 30.4 0.369 583 615 649 3000 0.999 ## Marin 817 20.36 40.1 0.380 785 817 852 2869 1.000 ## Mariposa 723 35.66 20.3 0.667 665 723 781 2859 0.999 ## Mendocino 662 27.57 24.0 0.513 618 662 708 2888 1.001 ## Merced 579 23.81 24.3 0.435 541 579 619 3000 1.000 ## Modoc 672 29.46 22.8 0.597 624 671 721 2433 1.000 ## Mono 704 40.57 17.3 0.769 636 704 770 2782 1.000 ## Monterey 629 18.26 34.4 0.364 599 629 659 2515 1.000 ## Napa 702 21.29 33.0 0.395 668 703 737 2909 0.999 ## Nevada 799 29.83 26.8 0.545 749 799 847 3000 0.999 ## Orange 693 15.12 45.8 0.300 669 693 718 2543 1.000 ## Placer 772 23.99 32.2 0.459 736 772 812 2726 0.999 ## Plumas 690 32.38 21.3 0.615 638 690 744 2772 1.000 ## Riverside 636 14.24 44.6 0.274 612 636 659 2697 1.001 ## Sacramento 669 15.38 43.5 0.281 644 669 695 3000 1.000 ## San Benito 710 30.37 23.4 0.572 659 710 761 2816 1.000 ## San Bernardino 647 13.10 49.4 0.239 625 647 668 3000 1.000 ## San Diego 705 15.07 46.8 0.321 680 705 730 2197 1.000 ## San Francisco 638 19.73 32.4 0.360 606 638 672 3000 0.999 ## San Joaquin 630 17.56 35.9 0.376 600 630 658 2177 1.000 ## San Luis Obispo 753 23.99 31.4 0.438 714 753 792 3000 1.000 ## San Mateo 735 21.63 34.0 0.407 700 735 772 2821 1.001 ## Santa Barbara 679 19.62 34.6 0.358 646 679 711 3000 1.002 ## Santa Clara 733 15.79 46.4 0.295 706 733 758 2859 1.000 ## Santa Cruz 680 20.05 33.9 0.416 646 680 712 2328 0.999 ## Shasta 696 22.19 31.4 0.418 661 695 734 2821 1.002 ## Sierra 708 42.19 16.8 0.794 639 708 777 2826 1.000 ## Siskiyou 698 25.45 27.4 0.470 656 697 739 2938 0.999 ## Solano 712 20.14 35.3 0.374 679 712 745 2900 1.000 ## Sonoma 725 22.45 32.3 0.415 687 725 760 2924 1.000 ## Stanislaus 661 19.82 33.4 0.374 628 661 694 2812 1.000 ## Sutter 653 24.53 26.6 0.455 612 653 692 2900 1.001 ## Tehama 660 27.93 23.6 0.554 614 660 705 2546 1.000 ## Trinity 650 38.22 17.0 0.698 587 649 714 3000 0.999 ## Tulare 583 21.75 26.8 0.402 546 583 618 2927 1.001 ## Tuolumne 719 30.82 23.3 0.569 670 720 769 2930 1.001 ## Ventura 709 17.44 40.7 0.332 681 709 739 2752 1.000 ## Yolo 687 24.24 28.3 0.443 647 687 726 2998 1.000 ## Yuba 627 28.68 21.9 0.528 581 627 674 2951 1.000 theta <- c(tapply(apipop$api00, apipop$cname, mean)) # true population quantities plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30) Binomial Unit-Level Model A model with binomial likelihood can also be fit. We now model the target variable sch.wide, a binary variable indicating whether a school-wide growth target has been met. We use the same mean model structure as above for the linear model, but now using a logistic link function, $y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\ \mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2)$ apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes") sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs) sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE) summary(sim) ## llh_ : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## llh_ -83.3 2.5 -33.3 0.0616 -87.7 -83.2 -79.6 1651 1 ## ## beta : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## (Intercept) 4.56473 1.4712 3.1028 0.053839 2.2341 4.5078 7.11085 747 1.002 ## ell -0.03652 0.0148 -2.4708 0.000397 -0.0615 -0.0362 -0.01266 1389 1.000 ## meals 0.00124 0.0139 0.0889 0.000360 -0.0217 0.0013 0.02444 1495 1.001 ## stypeH -2.83181 0.6298 -4.4966 0.018468 -3.8958 -2.8147 -1.80091 1163 1.000 ## stypeM -1.66401 0.5866 -2.8368 0.018267 -2.6488 -1.6688 -0.69783 1031 1.000 ## hsg -0.02638 0.0222 -1.1890 0.000655 -0.0630 -0.0261 0.01058 1147 1.001 ## col.grad -0.03601 0.0267 -1.3478 0.000850 -0.0783 -0.0362 0.00844 988 1.000 ## grad.sch 0.01267 0.0327 0.3868 0.001280 -0.0400 0.0118 0.06877 655 0.999 ## ## v_sigma : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## v_sigma 0.372 0.267 1.39 0.0117 0.0336 0.325 0.868 525 1.01 ## ## v : ## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat ## Alameda -0.19361 0.428 -0.45204 0.01245 -1.041 -0.082856 0.331 1184 1.001 ## Amador 0.01073 0.456 0.02353 0.00835 -0.704 0.002050 0.749 2983 0.999 ## Butte 0.00163 0.451 0.00362 0.00824 -0.697 0.004466 0.692 2996 1.000 ## Calaveras 0.04540 0.454 0.10000 0.00978 -0.613 0.004618 0.818 2157 1.001 ## Colusa -0.00671 0.455 -0.01474 0.00831 -0.729 -0.004081 0.711 3000 0.999 ## Contra Costa -0.00822 0.396 -0.02073 0.00873 -0.666 0.000834 0.620 2061 1.000 ## Del Norte -0.00788 0.471 -0.01674 0.00884 -0.761 -0.002417 0.712 2839 1.000 ## El Dorado -0.00810 0.454 -0.01785 0.00933 -0.755 0.000714 0.707 2368 1.000 ## Fresno -0.04653 0.365 -0.12749 0.00714 -0.704 -0.015266 0.517 2612 1.000 ## Glenn 0.01084 0.464 0.02336 0.00847 -0.687 0.002663 0.745 3000 1.000 ## ... 47 elements suppressed ... To predict the population fractions of schools that meet the growth target by county, samplesums <- tapply(apisrs$target.met, apisrs$cname, sum) samplesums[is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas res <- predict(sim, newdata=apipop, labels=names(N), fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
show.progress=FALSE)
(summ <- summary(res))
##                  Mean     SD t-value     MCSE q0.05  q0.5 q0.95 n_eff R_hat
## Alameda         0.784 0.0603   13.00 0.001665 0.674 0.796 0.864  1312 1.002
## Amador          0.845 0.1156    7.32 0.002174 0.600 0.900 1.000  2826 0.999
## Butte           0.838 0.0745   11.25 0.001509 0.708 0.854 0.938  2440 1.000
## Calaveras       0.878 0.0928    9.46 0.001834 0.700 0.900 1.000  2562 1.000
## Colusa          0.643 0.1598    4.02 0.003052 0.333 0.667 0.889  2741 1.000
## Contra Costa    0.816 0.0572   14.26 0.001321 0.715 0.821 0.899  1876 1.001
## Del Norte       0.881 0.1080    8.16 0.001972 0.750 0.875 1.000  3000 1.000
## El Dorado       0.850 0.0744   11.42 0.001713 0.725 0.850 0.950  1886 1.000
## Fresno          0.782 0.0592   13.20 0.001232 0.677 0.785 0.871  2314 0.999
## Glenn           0.798 0.1388    5.75 0.002569 0.556 0.778 1.000  2919 1.000
## Humboldt        0.854 0.0710   12.03 0.001489 0.725 0.850 0.950  2275 1.000
## Imperial        0.698 0.1045    6.68 0.002128 0.525 0.700 0.850  2412 1.000
## Inyo            0.777 0.1541    5.04 0.003090 0.571 0.857 1.000  2488 1.000
## Kern            0.806 0.0522   15.44 0.001136 0.711 0.811 0.878  2109 1.000
## Kings           0.851 0.0803   10.60 0.001618 0.720 0.840 0.960  2460 1.000
## Lake            0.828 0.0953    8.69 0.001822 0.682 0.818 0.955  2739 1.000
## Lassen          0.836 0.1105    7.56 0.002173 0.636 0.818 1.000  2586 0.999
## Los Angeles     0.799 0.0427   18.70 0.001183 0.731 0.799 0.872  1306 1.002
## Madera          0.816 0.0757   10.78 0.001520 0.677 0.839 0.935  2482 1.001
## Marin           0.839 0.0688   12.21 0.001600 0.720 0.840 0.940  1847 1.000
## Mariposa        0.863 0.1473    5.86 0.002882 0.600 0.800 1.000  2611 1.000
## Mendocino       0.817 0.0898    9.10 0.001709 0.680 0.840 0.960  2760 1.000
## Merced          0.791 0.0790   10.02 0.001553 0.651 0.794 0.905  2590 1.001
## Modoc           0.836 0.1569    5.33 0.002864 0.600 0.800 1.000  3000 0.999
## Mono            0.752 0.2432    3.09 0.004440 0.333 0.667 1.000  3000 1.000
## Monterey        0.802 0.0688   11.66 0.001543 0.687 0.807 0.916  1989 1.000
## Napa            0.850 0.0787   10.80 0.001588 0.704 0.852 0.963  2457 1.001
## Nevada          0.871 0.0940    9.26 0.001848 0.714 0.857 1.000  2588 1.000
## Orange          0.774 0.0611   12.67 0.001340 0.670 0.778 0.871  2078 1.000
## Placer          0.817 0.0740   11.05 0.001755 0.682 0.833 0.924  1777 1.002
## Plumas          0.743 0.1509    4.93 0.002975 0.444 0.778 1.000  2573 1.000
## Riverside       0.824 0.0533   15.47 0.001230 0.722 0.831 0.895  1875 0.999
## Sacramento      0.848 0.0484   17.53 0.001240 0.764 0.847 0.927  1521 1.003
## San Benito      0.825 0.1227    6.73 0.002539 0.636 0.818 1.000  2335 1.003
## San Bernardino  0.861 0.0424   20.33 0.001116 0.790 0.862 0.928  1441 1.000
## San Diego       0.849 0.0426   19.91 0.000968 0.778 0.852 0.913  1942 1.001
## San Francisco   0.761 0.0752   10.12 0.001544 0.630 0.770 0.870  2372 1.000
## San Joaquin     0.829 0.0580   14.29 0.001271 0.726 0.839 0.911  2085 0.999
## San Luis Obispo 0.856 0.0699   12.25 0.001505 0.749 0.850 0.950  2158 1.002
## San Mateo       0.801 0.0731   10.96 0.001744 0.678 0.811 0.895  1756 1.002
## Santa Barbara   0.830 0.0641   12.95 0.001326 0.728 0.840 0.926  2337 1.002
## Santa Clara     0.785 0.0685   11.45 0.001975 0.652 0.796 0.875  1203 1.002
## Santa Cruz      0.789 0.0744   10.61 0.001521 0.653 0.796 0.898  2393 1.001
## Shasta          0.817 0.0765   10.68 0.001579 0.675 0.825 0.925  2347 1.000
## Sierra          0.733 0.2346    3.12 0.004376 0.333 0.667 1.000  2875 1.000
## Siskiyou        0.846 0.0977    8.65 0.001925 0.667 0.867 1.000  2577 1.001
## Solano          0.865 0.0634   13.64 0.001670 0.750 0.867 0.967  1442 1.001
## Sonoma          0.846 0.0622   13.60 0.001276 0.741 0.852 0.935  2379 1.001
## Stanislaus      0.812 0.0723   11.24 0.001798 0.681 0.824 0.901  1614 1.000
## Sutter          0.832 0.0919    9.05 0.001805 0.650 0.850 0.950  2590 1.000
## Tehama          0.842 0.0992    8.49 0.001840 0.647 0.882 1.000  2905 1.000
## Trinity         0.759 0.2016    3.76 0.003843 0.500 0.750 1.000  2752 1.000
## Tulare          0.831 0.0633   13.12 0.001422 0.718 0.836 0.927  1985 0.999
## Tuolumne        0.883 0.0932    9.48 0.001790 0.750 0.917 1.000  2709 1.000
## Ventura         0.838 0.0511   16.38 0.001071 0.745 0.845 0.913  2281 1.000
## Yolo            0.850 0.0752   11.30 0.001456 0.714 0.857 0.971  2670 1.000
## Yuba            0.794 0.1019    7.79 0.001984 0.632 0.789 0.947  2640 1.000
theta <- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

References

Battese, G. E., R. M. Harter, and W. A. Fuller. 1988. “An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data.” Journal of the American Statistical Association 83 (401): 28–36.
Rao, J. N. K., and I. Molina. 2015. Small Area Estimation. John Wiley & Sons.