Pay for performance for prenatal care and newborn health: Evidence from a developing country

Abstract Empirical literature analyzing the effect of pay-for-performance programs (P4P) for healthcare providers on maternal care and newborn health outcomes is scarce. In 2008, Uruguay’s Ministry of Public Health implemented a P4P called Metas Asistenciales (Healthcare Goals), a country-wide program that grants healthcare providers an economic incentive for complying with certain maternal and newborn healthcare goals. Health organizations use these funds to provide maternal and child health services. Using administrative records and a difference-in-difference methodology, we evaluate the effect of the Metas Asistenciales program on maternal and newborn health outcomes. We find that in the institutions affected by the program, the number of women receiving an adequate number of prenatal controls increased by 10 percentage points and pregnancy detection in the first trimester improved by 4.5 percentage points. We also found better results among newborns for indicators related to birth weight, premature births, and stillbirths. In sum, the program had a positive, significant impact on the rate of pregnant women’s utilization of health services and on newborn health outcomes. This study thus provides evidence supporting the idea that economic incentives are a promising tool for incentivizing healthcare providers to achieve better health services in developing countries.


Introduction
Investment in childhood development is a fundamental factor in achieving better outcomes in individual lives. Evidence suggests that early childhood interventions are more cost effective and achieve better returns than interventions at later stages in life (Heckman 1995(Heckman , 2000. Specifically, newborn health and the reproductive behavior of mothers are essential variables that explain children's and adult's well-being (Heckman, 2007;Di Cesare y Sabates, 2017;Cunha et al., 2008;Heckman et al. 2013). Poor health conditions among newborns are associated with less adequate cognitive skills, health problems, poor educational performance, and lower income later in life (Boardman et al., 2002;Behrman and Rosenzwieg, 2004;Black et al. 2007;Currie and Moretti, 2007;Royer, 2009). Health deficiencies among newborns lead to large healthcare burdens throughout life, which translate to higher costs for society. Health programs can generate direct improvements in pregnant women's health and can have an important positive effect on the health of newborns (Gertler, 2014). Guaranteeing universal access to quality maternal and child health services has thus become a fundamental tool for countries worldwide.
The costs and benefits associated with improving maternal and newborn health in developing countries have led governments to try a range of policies to achieve better outcomes. Pay-for-performance (P4P) schemes are one such attempt. Governments in developed countries have introduced them as incentives to increase the quality of healthcare service. The main rationale behind P4P is that allocating resources to the health sector without any incentive mechanism to improve performance will not result in better health outcomes (Toonen et al. 2009). The prospect of inducing healthcare providers to improve thus makes these types of programs an attractive tool for governments.
Several studies have shown that inducing healthcare providers to adopt new practices depends on a variety of determinants, including the characteristics of the institutions, their management, the costs of adopting new practices, and psychological factors (Berwick, 2003;Cabana et al., 1999). Celhay et al. (2019) discuss the role of incentives resulting from P$P programs. They state that if the institutions perceive the new practice will be beneficial in the future they are more willing to use the financial incentives to pay for the cost of implementing such new practices and vice versa. In terms of the principalagent model, P4P designs offer incentives for achieving results and thus reduce the need to monitor the agents' behavior (Eldrige and Palmer, 2008).
Creating financial incentives for improving maternal and newborn health leads healthcare institutions to develop strategies to encourage maternal care. The literature on the topic has shown that the introduction of different incentive schemes rewarding performance has had positive effects. Celhay et al. (2019) found that a P4P implemented in Argentina (Plan Nacer), which aimed to increase early initiation of care during pregnancy, increased the number of women who came for checkups during the first trimester of pregnancy and encouraged healthcare institutions to adopt new strategies to attract the targeted population. Celhay et al. (2019) found, that the program had no effect on newborn health. However, increased checkups for mothers persisted for at least 24 months after the financial incentives were removed. Messen et al. (2007) highlights the important role of P4P systems in improving the poor service provided by healthcare facilities in Rwanda. Olken et al. (2013) find that a performance-based financing system provided by the local government in Indonesia had positive effects on its intended outcomes, which were to improve health services and achieve better educational results. Regalia and Castro (2007) analyze a program in Nicaragua that grants an incentive to the managers of healthcare institutions to promote neonatal health services and transfers money in a conditional manner to the children who are beneficiaries of the program; they find increased use of health services and higher vaccination rates. Experimental evidence for Rwanda shows that public healthcare providers reacted positively to a P4P program created to improve the quality of the services provided for neonatal and infant health (Basinga et al., 2011;Gertler and Vermeersch, 2013;Walque, 2013). Although the evidence in these particular cases is strong, evidence for developing countries remains scarce and is mostly limited to small-scale experiments (Gertler, 2014;Gertler, 2015;Hullieri and Seban, 2014).
This article seeks to analyze the effects of a larger, national P4P scheme called Metas Asistenciales, which was implemented in the health sector in Uruguay in 2008. The program's main objective was to improve the quality and accessibility of the health services provided to mothers and children. The government offered an economic incentive to healthcare providers who complied with certain care goals. This program is similar to Plan Nacer in Argentina, but the latter was for uninsured pregnant women and children under six. To the best of our knowledge, the program is the first countrywide P4P to be introduced in the health sector in a developing country.
Even though we do not observe experimental variation in the introduction of the P4P scheme, its design allows us to estimate the causal effects and the lower bounds for its effects due to certain administrative peculiarities of the intervention that we explain later in the article. The identification strategy exploits the gap in the implementation of the P4P program within a range of public and private healthcare providers and the fact that within the private sector there was a disparity in the degree of adherence.
We compare different health indicators among treated and untreated mothers and newborns, before and after implementation, using data from the Perinatal Information System (SIP), a rich administrative dataset which accounts for all births in the country between 2002 and 2013. Our data allows us to control by mothers' characteristics, institutional characteristics, regional features, and any other fixed effects between treated and untreated groups.
We explore the channels that can affect newborn health outcomes and suggest that increasing the number of prenatal checkups, improving early detection of pregnancy and the gestational age are fundamental factors that explain the changes in newborns' indicators. An extensive body of medical and economics literature identifies early detection of pregnancy and maternal care as key determinants of birth outcomes. These variables are important determinants to detect diverse risk factors (e.g., diseases, infections, and anemia) that may affect gestational length, intrauterine life conditions, and birth weight (Carroli et al., 2001;Evans and Lien, 2005;Kramer, 1996;Rosenzweig and Schultz, 1983;Slattery and Morrison, 2002). Furthermore, according to Almond et al. (2008) there could be different channels that could generate negative effects on newborn health. On one hand, if the improvements in the intrauterine conditions lead to fewer fetal deaths, there could be a composition effect that generates a negative impact on the birth weight due to the survival of the fetal margin. On the other hand, if the program affects the fertility rates of women from more disadvantaged backgrounds, that could imply a negative impact on neonatal health.
Overall, our results show that Uruguay's P4P program increased, on average, the number of prenatal checkups and the number of times pregnancy was detected early.
We also observe modest improvements in newborn health outcomes, particularly in birth weight and neonatal mortality. We observe a 1.2 percentage points reduction in the probability of low birth weight, a 0.36 percentage points reduction in premature births, and a reduction in stillbirth´s rates of 2.4 per 1,000 live births. The average weight of newborns rose by 10.92 grams. The observed effects are lower than other papers (see Amarante et al., 2016 andHarris et al., 2015).
The article contributes to the small amount of research that has so far been published on P4Ps and their effects in developing countries. To produce our results, we use a rich source of information that includes data on the institutions affected by the program and information on the pregnant women and their children. The use of a copious nationwide administrative dataset allows us to estimate the causal effect of the program for nearly all of births nationwide.
The article is organized as follows: section 2 provides a brief description of the Uruguayan health system and the P4P program. Section 3 describes the methodology and the data. Section 4 presents the results as well as robustness checks. Finally, section 5 and 6 discusses the results and concludes.

The health system and the P4P scheme in Uruguay
Uruguay is a small country with a population of 3.4 million located in South America. In the early 2000s, its poverty level was over 30% and its inequality level measured by the Gini Index was 0.471, according to the data provided by National Institute of Statistics. The healthcare reform implemented in 2008 created the Integrated National Health System (SNIS). Under the principle of universal care, the SNIS sought to guarantee coverage to the entire population and to improve access to quality health service. Hence, all public and private providers became part of Uruguay's Health System. This system was financed through the contributions of workers in the formal sector and extended insurance access to the spouses of formal-sector workers, their children under 18 years old, and retired workers (Bergolo and Cruces, 2011). Funds were transferred to the National Health Fund (FONASA) and, from there, the healthcare institutions where the users registered received a per capita variable payment if they met certain requirements.
In this way, the funds financed a minimum set of integral benefits exhaustively defined by the Ministry of Public Health (MSP, 2010).
The implementation of the SNIS generated an overall increase in healthcare coverage.
Today, close to 100% of Uruguayans are registered with one of the comprehensive healthcare providers, either through social security, individual membership in the private sector, or through the public sector. Uruguay's health insurance system is thus now divided into these three sectors (public, private, and health insurance companies). Table   1 shows the relative percentages of affiliation with each insurance sector, including people registered through the social security system along with the rest of the users. The 2008 reform also generated a change in Uruguay's model of care by seeking to strengthen the country's Primary Healthcare (APS) strategy. In more recent years, new benefits have been added and coverage has been extended to mental health, sexual and reproductive health, and oral health services. To reinforce the model-of-care reform and improve service quality, the state implemented a series of health polices, the P4P program being the most important. P4P grants monetary transfers to healthcare institutions when they comply with prioritized services for specific sub-populations of interest. Among the sub-populations prioritized are maternal and neonatal, young adolescents, and the elderly (MSP, 2010).

The pay-for-performance program and maternal and neonatal health services in Uruguay
Both the public and private sectors provide maternal and child healthcare coverage.
Since 2008, prenatal care has been provided to patients for free. The program's novelty lies in its P4P strategy, which grants healthcare providers funds if they comply with specific goals determined by the Ministry of Public Health (MSP). The services prioritized seek to improve coverage and health outcomes (National Health Board -JUNASA, 2007), which were established in a series of guidelines. The maternal care goals the healthcare providers have had to perform are: increasing the number of prenatal visits (establishing a minimum of six prenatal care visits per pregnant woman), increasing the number of pregnancies detected during their first trimester (early detection of pregnancy), improving perinatal medical history records, increasing syphilis and HIV testing during the first and third trimesters, and improving dental care (MSP, 2010).
Initially, the program was going to be implemented across all public and private care providers altogether, but ended up targeting only private healthcare providers. This difference in the implementation stemmed primarily from administrative problems that required public sector institutions to receive a reduced budget credit if they received extra-budgetary resources. 2 The public sector was thus unable to benefit from the economic incentive established by this P4P (Law 18.211). This created a gap in implementation that serves as a source of exogenous variation. 3 Furthermore, within the private sector there was a disparity in the degree of adherence, which we also use as a source of identification.
The financing of the program is structured as follows. For each SNIS member registered with a given healthcare provider, USD$2.50 is allocated to that institution to the extent to which it complies with the established goals. This represented approximately 6% of FONASA's health quota expenditures (MSP, 2010). The amount of the incentives given to the institutions thus varies with the number of affiliated people registered with them through the social security system. The program seeks to use this mechanism to improve health outcomes for mothers and newborns nationwide and to improve the quality of the services provided.
The Ministry of Public Health measured compliance with the goals using healthcare providers' reports. The threshold was determined by the Ministry in agreement with the healthcare providers that are part of the SNIS. Performance pay is divided equally between the various objectives to be met. An institution would need to comply with 70% of the established components to meet the Ministry-defined goals and thus receive the full payment; otherwise, a proportional payment would be transferred in accordance with the proportion of the goal achieved. The program randomly audits healthcare providers to verify the information they report (MSP, 2010).
An additional issue to consider is that the economic incentive is temporary, with an average duration of two years. After that period, some components of the P4P scheme change and the incentives are allocated in other areas for which improvements are sought. Table 2 shows the different components that have been the target of financial incentives through the years. This incentive structure is consistent with the economic literature on P4P schemes. They seek to induce beneficiaries to adopt practices that may be costly to implement, but that are economical in the long term (Celhay et al., 2019;Gertler, 2014). The program is producing a large institutional adaptation in terms of the availability of human resources, the incorporation of certain practices that did not exist until that moment, and institutional reorganization. It is important to clarify that these incentives can be used to finance the expenses of implementing the program, but cannot be used to finance salaries (MSP, 2010). hours. Other strategies implemented by the providers were the diffusion of maternal health services, advertisement campaigns, and telephone calls to recruit pregnant women, especially for the early detection of pregnancy. The organizations' managers also monitored the execution of the guidelines more intensely. Not all the institutions reacted in the same way to the incentives. Institutions with a high number of users registered through the social security system, for whom the economic incentive in absolute terms is greater, were more likely to adopt new strategies to fulfill the goals established by the program (MSP, 2012). In sum, the factors that might have led to changes in the analyzed outcomes included (1) the amount of the incentives offered to the healthcare providers, (2) the use of the incentives to pay for the costs of the program (human resources, advertisement campaigns, telephone calls, etc.), and (3) stricter monitoring of the provider organizations.

Methodology and data
Our analysis is focused on all births in both public and private sectors, nationwide. We attempt to identify the causal effect of Uruguay's P4P on children's health outcomes by exploiting the difference in the implementation of the P4P in the public and private sectors. Due to the heterogeneity of the individuals treated in both sectors, we engage in a comprehensive exploratory analysis to balance the groups in terms of the observable variables of interest, during the pre-implementation period.

Data Sources
The sources of information used are the aforementioned SIP, which provides individualized information on births and data on mothers, as reported by health centers.
We also used data from the National System of Information (SINADI) and System of Control and Analysis of Human Resources (SCARH) from the Ministry of Public Health to obtain information on the health centers. Finally, we used data provided by the Ministry's Division of Health Economics to measure the degree to which each health center adhered to policy guidelines. Data on prenatal care during pregnancy and newborn health outcomes is extracted from the SIP. When compared with birth certificates, this covers, on average, 80% of all births nationwide. Healthcare providers record several key indicators at the time of birth. Our analysis was performed for the period from 2002 to 2013, in order to include several years before and after the implementation of the policy. We are able to link individual information to each health center. We identified the nature of the sector (public or private), the province where the mother received prenatal services, and where she gave birth.
Several variables are used to measure maternal outcomes and newborn health in our analysis. For prenatal care, we look at whether mothers met the standards set by the Ministry of Public Health or the World Health Organization (WHO) (6 and 9 prenatal checkups during pregnancy, respectively), and whether pregnant women were seen by their healthcare provider during the first trimester of pregnancy.
Among newborn health outcomes, we include birth weight, number of weeks of gestation, and stillbirths. In order to account for birth weight, we used the percentage of children born with low birth weight (defined as less weighing than 2,500 grams) and the average birth weight. The variables related to premature birth include the number of weeks of gestation and the percentage of premature births (a birth is considered premature when the mother had fewer than 37 weeks of gestation). Finally, regarding fetal mortality, the number of stillbirths per 1,000 live births is used.
It is important to note is that many policy reforms were implemented at the same time as this one in Uruguay, and previous impact evaluations have shown that they affected newborn outcomes. For example 4 , Amarante et al. (2016) estimate, based on program and social security administrative microdata matched to longitudinal vital statistics, that participation in the PANES program led to a sizable reduction in the incidence of low birthweight in the same period. Along the same lines, Harris et al. (2015) consider the impact of different types of tobacco control policies that were implemented in the same period and find that quitting smoking during pregnancy increased birth weight significantly. Because of the possible effects of programs run simultaneously with the P4P, and to account for these issues in our model, we are including several control variables. First, we incorporated the percentage of households receiving the PANES and the CCT program by health sector (private and public), by province, and by year. Second, we included a dummy that reflected whether the pregnant women were smokers or not. Table 2 displays the details about the outcome and control variables used in each specification.

Identification strategy
As mentioned above, according to the literature, prenatal care is closely related to newborn health outcomes. Mothers with fewer checkups during pregnancy are more likely to have premature babies, and their children are more likely to experience intrauterine growth delays and low birth rates. Nevertheless, a degree of endogeneity in this relationship is caused by omitted variables or selection bias: mothers with certain socioeconomic characteristics have a higher (lower) propensity to access and use health services, which affects the health of their newborns. This means that the ordinary least squares estimation is not consistent.
To address this problem, we used an identification strategy that relies on a source of exogenous variation in the use of health services by pregnant mothers. As explained above, we used the Ministry of Public Health's P4P program, which targeted mothers and children in 2008, as a source of plausible exogenous variation in prenatal care.
Specifically, there are two sources of variation in the implementation of the P4P. On the one hand, there was a difference between the policy's implementation in the private and public health sectors, which was caused by issues unrelated to the outcome variables of interest. At first, the P4P was supposed to start nationwide, but there were administrative problems in the public sector, which established that the public sector would receive a reduced budget credit if they received extra-budgetary resources. Thus, the public sector could not benefit from the economic incentive established by the P4P (Law 18.211). In order to extend the P4P to the public sector, it is necessary to change existing laws. This created a gap in implementation that serves as a source of exogenous variation. On the other hand, within the private sector, there was a disparity in the degree of policy adherence, wherein some institutions reached the thresholds set by the Ministry of Public Health faster, and others did not. We also used this second source of variation to evaluate the effect of the treatment.
To sum up, the program design and its implementation created several degrees of exposure to treatment among treated units (health facilities belonging to SNIS). We exploited the heterogeneity observed in implementation for our main analysis and for the robustness controls. The estimations we performed allow us to recover causal estimates.
Counterintuitively, the first group affected were low-risk mothers. Since the program affected maternal care directly, and this variable could impact newborn health indicators, we estimated a difference-in-difference (DiD) model to evaluate maternal and child outcomes.
For the first analysis, since utilization by private and public sectors was very different, we performed an exploratory analysis to balance the groups before engaging in further empirical evaluation. Following a series of tests, we excluded teenage mothers (those under 18 years of age) from our analysis, for several reasons. First, they are the highest risk group of mothers among pregnant women. Second, they are predominantly in the public sector. Third, one of the main features of the 2008 healthcare reform was the inclusion of children (under 18 years of age) of formal-sector workers into the health system. The latter would affect our identification strategy by way of a compositional effect generated by a public-private crowding out.
Before the policy implementation, 74% of teenage mothers were in the public health sector, while this number went down 65% after the policy was implemented. Table 5 displays the percentage of teenage mothers in the public and private health sectors before and after the policy.  Tables 6 and 7 show the average of the covariates for the treatment and control groups before and after the policy, as well as the results of the DiD estimator for each of these variables, first including all mothers and then excluding teenage mothers. These results reveal the existence of a composition effect when teenage mothers are included. Table   6 shows that, when all mothers are considered, there is a change in the groups' composition, before and after the policy. When teenage mothers are excluded, the effect vanishes, with no significant differences in most of the covariates between the groups for pre-and post-policy periods.  It is worth pointing out that excluding teenage mothers, one of the groups most at risk during pregnancy, may result in an underestimation of the possible effects of the policy (Williamson, 2013). Given this possibility, it is important to be careful when extrapolating our results to all pregnant women and births, as it should be taken into consideration that the excluded population has specific characteristics, and that a policy like this P4P could have differential effects on the excluded sector.
We thus pursue two different strategies. In our first strategy, our treatment group consists of pregnant women in the private health sector, without teenage mothers, while our comparison group is made up of pregnant women in the public sector, also excluding teenage mothers. In our second strategy, we exploit the variability in the degree to which the private health sector was exposed to the policy. To try to correct the endogeneity problem that results from the fact that institutions adhered to the treatment according to the mothers' characteristics, to construct the treatment group we narrowed our sample to pregnant women living in provinces that, on average, exceeded the compliance threshold set by the Ministry of Public Health by a deviation of 0.05. It is important to note that private healthcare providers are distributed unequally across national territory. At the provincial level, we can say the institutions are distributed randomly: some institutions exceeded the threshold and others did not. Analyzing results geographically is therefore able to reduce selection bias. Table 8 reflects the number of public and private institutions by province that exceeded the threshold and the number that did not. Hence, we used this second source of variation to evaluate the effect of the treatment and its intensity. In that context, the treatment group in our specification is composed of pregnant women We took advantage of microdata made available from the perinatal information system.
We performed an analysis of the 2002-2013 period, which allows us to consider the implementation of the program for private healthcare providers in 2008 and analyze the effect some years after the policy implementation. Our sample records 430,851 births nationwide during this period, excluding children born to teenage mothers. We thus analyzed 83% of national births recorded in the SIP during the period under review.
It is important to highlight some of the limitations of SIP data: total births are underreported, and some reports are incomplete. To test the consistency of our data, we compared SIP records with data from the Certificate of Live Births (CNV) for the period under consideration. Figure 1 shows how records changed over time throughout the period for both sectors, for all pregnant women and for non-teen mothers.  Table 9 presents descriptive statistics of the variables of interest for the treatment and control groups, taking account of both strategies. In panel A of Table 9, we observe that, on average, the treatment group (women in the private sector) presented better newborn health outcomes, a higher percentage of women with an adequate number of checkups, and early detection of pregnancy according to the epidemiological literature, as compared to the control group (women in the public sector). We can also observe that both maternal care indicators and newborn health outcomes improve during this period.
In panel B, when the groups are compared according to level of adherence within the private sector, we observe that for the variables related to maternal care, the control group has better results on average than the treatment group in both periods. And when newborn health outcomes are considered, the pattern is the opposite. Public sector records, excluding adolescents Private sector records, excluding adolescents Total public sector records Total private sector records As previously mentioned, the validity of the identification strategy rests on the assumption that in the absence of the policy, the treatment and control groups would have evolved similarly. We could assume that, if both groups evolved similarly and there was then a change at the point in time when the policy was implemented, this change must have been generated by the policy implementation. Graphic analysis supports this assumption by demonstrating that the groups had similar trajectories prior to treatment.
In Charts A.1 to A.4 in the Annex, we observe that, although the sectors show distinct levels, the variables related to newborn health and the use of health services for both groups seem to present a similar trajectory prior to the implementation of the program. 5 As mentioned, the degree to which health facilities adhered to the policy varied somewhat, as did the behavior of the mothers at those facilities. We therefore distinguish between individuals affected by the program and untreated individuals and consider the differences in the degree of compliance of private healthcare providers where the mothers received medical care. This strategy is known as an intention-to-treat estimate; it considers the average effect of the program on those who participated.
As a natural experiment, we exploited the differences in healthcare providers' implementation and degree of exposure to the P4P program. The period that is considered for this estimation spans from 2002 to 2013. Because we expect a delay in timing between treatment and outcomes, we evaluate the results in t+1.
We estimate the following equation: We also estimate the equations for the second strategy, in which we exploited the variability in the degree to which the private health sector was exposed to the policy. In this strategy, our treatment group is composed of pregnant women living in provinces that, on average, exceeded the compliance threshold set by the Ministry of Public Health by a deviation of 0.05. Our comparison group is made up of pregnant women living in provinces that, on average, did not exceed, by the same deviation, the compliance threshold set by the Ministry. It is important to emphasize that this program may be a channel for improving newborn health, through changes in the number of adequate prenatal checkups and through early detection of pregnancy.
Our results are mainly short-term results, since the program is too recent to assess its medium-and long-term economic effects. It would be useful to conduct future studies to assess whether the program generated medium and long-term benefits, and whether it has contributed to reducing healthcare costs.
As mentioned above, one issue to considered is the co-existence of the P4P with other programs that could affect our outcomes. The P4P was implemented at the same time as the healthcare reform, the PANES program, the CCT program, and a tobacco control policy. This could make it difficult to isolate the effect of the P4P. To control for this, we analyze different sets of treatment and control groups, exploiting the different sources of variation the program allows. We also exclude teenage pregnant women, because they were the group most affected by the healthcare reform. This allows us to minimize the composition-effect problems. Last, we incorporate the percentage of households receiving the PANES and the CCT program by health sector (private and public), by province and by year, as well as a dummy that reflects whether or not the pregnant women are smokers, as control variables.
Another shortcoming of our analysis is that even if the different healthcare institutions received similar incentives related to the fulfilment of the goals, there are differences in the institutional behavior employed to achieve those goals. Several institutions conducted phone campaigns, reaching out to mothers to prompt them to get an adequate number of checkups. Others advertised inside the institutions, while still others may not have conducted any campaign because they had already achieved the goals (MSP, 2012). Table 10 presents the results of the estimation when we consider women of the private sector as a treatment group and women of the public sector as a comparison group. We evaluate the effect of the policy in terms of (1) the utilization of services by pregnant women excluding teenagers, and (2) newborn health outcomes, considering fixed effects by sector, region, year, and various other specifications. The variables that relate to maternal care are adequate prenatal checkups (according to WHO guidelines), adequate prenatal checkups (according to Ministry of Public Health guidelines), and early detection of pregnancy. The outcomes related to newborn health are newborn weight, low birth weight, stillbirths, number of weeks of gestation, and prematurity. The second row shows the impact of the P4P on the outcome variables we consider. As noted, in excluding teenage mothers, the analysis excludes from consideration a group of mothers that is among the most at-risk. The results in panel A show a significant increase in the number of prenatal checkups and rates of early detection of pregnancy. We can presume that this increase in prenatal care due to the P4P translates into an improvement in newborn health outcomes, with significant decreases, on average, in measures of low birth weight, premature births, and stillbirths 6 . We can evaluate the effectiveness of the policy in terms of the levels of use and access to health services, observing that the program increased the number of women who received an adequate number of prenatal checkups according to WHO guidelines (nine checkups) by 10 percentage points. According to Ministry of Public Health guidelines, the number rose around 6 percentage points. The number of women checked during their first trimester of pregnancy rose by around 4 percentage points.

Results
Regarding newborn health outcomes, we observe that the average newborn weight increased by 11 grams. Furthermore, the percentage of babies born with low birth weight fell by 1 percentage points. We observe a decrease in the number of weeks of gestation, although we also see a decline in premature infants by almost 0.35 percentage points.
Neonatal mortality fell by 2 stillbirths per 1,000 live births.
Panel B shows the estimates produced by adding different covariates. When different control covariates are used, we see that the results are quite similar in terms of sign and magnitude of the coefficients.
The following table shows the results for the case in which the treatment group are pregnant women living in provinces that, on average, exceeded the compliance threshold set by the Ministry of Public Health by a deviation of 0.05 and as a comparison group 6 In order to test for this link between improved pre-natal care and newborn outcomesas suggested by one of the anonymous reviewers-we conducted some additional estimations. We conducted several Instrumental Variables (IV) estimations, by instrumenting maternal care and early detection with several specifications of our treatment. The results are positive and statistically significant for all the specifications. We did not include them here due to space limitations, but they are available upon request.
pregnant women living in provinces that, on average, not exceeded the compliance threshold set by the Ministry by the same deviation. In this case we observe a similar pattern as before, the results in panel A and B of Table   11, show a significant increase in the number of prenatal checkups and rates of early detection of pregnancy. The number of women receiving an adequate number of prenatal checkups according to WHO guidelines (nine checkups) rose by 3 percentage points.
According to Ministry of Public Health guidelines, the number increased around 4 percentage points. The number of women checked during their first trimester of pregnancy rose by around 4 percentage points.
Regarding newborn health outcomes, we observe that the average newborn's weight increased by 39 grams. Furthermore, the percentage of babies born with low birth weight fell by 0.5 percentage points. Although the number of weeks of gestation is not statistically significant, we observed a decrease of almost 2.5 percentage points in the number premature births. Neonatal mortality fell by 2.8 stillbirths per 1,000 live births. In panel B, when we control for different covariates, we observe that the number of weeks of gestation increased on average and that the rest of the results are quite similar.
The next table shows the intensity of the treatment in relation to the percentage of compliance of the different healthcare providers. In this case, again, the results are along the same lines as the previous analysis.

Robustness checks
The DiD method requires certain assumptions to be satisfied in order to claim that our estimates are causal. The model uses changes in the results of the control group to estimate the counterfactual that shows how the treatment group would have evolved if the program had not been implemented. The most important identification assumption, therefore, is that changes in the control group's results serve as a valid counterfactual of what would have happened to the treatment group if the policy had not been applied.
We need to address several sources of confoundedness in our results. To ensure the validity of the outcomes, we run several robustness checks. First, we compared the differences between the private and public health sectors outside the main capital, Montevideo. We selected sectors outside Montevideo for two reasons: (1) the beneficiaries of the health system outside the capital more closely resemble each other than do the beneficiaries in Montevideo; and (2) deciding whether going to public vis a vis private health centers is less associated with population characteristics in the countryside due to the higher quality of the public sector outside of Montevideo. Second, as we mentioned before, at the beginning of the program (the first trimester) the institutions implemented a range of measures to reach the aims. In general, the institutions in this first step established goals close to the percentage of the pregnant women with proper care. When the program began, therefore, some institutions in the treatment group became part of the control group when the ministry established the goals and some in the comparison group that became part of the treatment group. We thus exploited this first variation in policy adherence when the program began. Third, we consider the intensity of the treatment in terms of the percentage of private clinics into the provinces. And last, we consider only the private healthcare providers, taking as comparison group those institutions that in the pre-P4P period had better health outcome results (assuming that the program did not affect them) and as treatment group those institutions with poorer results.
In our first robustness check, in which we consider the private and public healthcare providers outside the main capital, we observe positive and significant effects on the outcome variables considered, with a magnitude that is comparable to that found for the different specifications (Table A.1 of the statistical annex). We can safely assume that Montevideo does not exhibit any differential pattern in terms of visiting health facilities.
Our results are not driven from the results observed in Montevideo. In our second check, in which different healthcare providers implemented different measures to reach the program's objectives, we observe a lower effect than the one found in our main results ( One of the main threats to our identification strategy is that the people who use the private sector are different from the people who use public hospitals. There are two possible effects to consider in this regard. The first is that the first group might be more responsive to the policy, since they are better educated and more affluent, which would give our estimates an upward bias. The second effect is that the first group in the pre-P4P period had better health outcomes than the control group, which would lead us to expect lower levels of response to the policy. In that case, our estimates would be a downward biased. As previously mentioned, Table shows 9 that the percentage of women of the treatment group with an adequate number of checkups, according Ministry requirements, was 44% and 68% in the pre-and post-policy period. The percentages in the control group was 23% and 35% respectively.
In this context, the third exercise thus reinforces our estimations, enabling us to better understand the effect generated from the heterogeneity of the groups. Table A.3 shows the intensity of the treatment in relation to the percentage of private clinics into the provinces. 7 The results obtained are quite similar when we compare them with the main outcomes. However, the magnitude of the coefficients of maternal care, when the percentage-of-private-clinics variable is considered, is bigger, which leads us to think that women in the private clinics are more responsive than the women in the public ones.
Finally, Table A.4 shows only the private healthcare providers, taking as comparison group those institutions that in the pre-P4P period had better health outcome results and as treatment group those institutions with poorer results. The results are lower than our main outcomes. This result leads us to expect lower levels of response to the policy, so that our estimates would be a lower bound. Table 13 summarizes the results of the program. In the first column, we exhibit the results with pregnant women in the private sector considered as treatment group and women of the public sector as control group. The third column displays the average of the control group for each outcome variable. The fourth column shows the results when compliance within the private sector is considered. Finally, in the last column we observe the average for the control group.  , 2002-2013 For the maternal care outcomes, the first specification implies that adequate prenatal checkups, as laid out in WHO guidelines, rise from 34% to 45%. If we consider prenatal checkups, established in Ministry of Public Health guidelines, it implies an increase of 6 percentage points (rising from 49% to 55%). Finally, the early detection of pregnancy increases from 54% to 58.5%. When we observe the second specification, the outcomes related to maternal care, are smaller. The effect on the average of newborn health is higher for the second specification, but it is lower when we consider the impact on low birth weight. Estimates for the first case thus decrease from 9% to 8% and, for the second specification, go down from 5% to 4.5%. Last, the impact on stillbirth and prematurity is higher in the second case, when differences within the private sector are considered.

Discussion of the results
Overall, results are positive and encouraging and consistent with what is observed in some other studies. The range of the increase in birth weight for example is important.
To put the improvement in perfective and in a complete different context, Bozzoli and Quintana-Domeque (2014) found a decrease in average birth weight of 30g as a consequence of a deep economic crisis in Argentina, where GPD contracted more than 10%.
Our results are, in some cases, in line with previous studies, but they differ in other ways. Balsa and Triunfo (2015) analyzed the relationship between the prenatal care of lowincome pregnant women and newborn outcomes in Uruguay. They found that an adequate number of prenatal checkups (at least nine), along with early detection of pregnancy, reduced the likelihood of low birth weight by 6 percentage points and the probability of births occurring before 37 weeks of gestation by 11 percentage points.
They also found an increase in newborn weight by 149 grams. Our results are smaller in magnitude. Celhay et al. (2019) analyzed how temporary economic incentives offered to healthcare providers through Argentina's Plan Nacer to encourage the early detection of pregnancy affected this variable and the results related to health outcomes. In line with our results, they showed that the program increased the early detection of pregnancy by 34 percent in the treatment period and persisted 24 months after the program ended.
The results for health outcomes are not consistent with our results since they did not find any effects. However, Celhay et al. (2019) point out that they may lack statistical power to find effects on some of the health outcomes. Amarante et al. (2016) estimated the effect of a cash transfer program (PANES) on children's birth outcomes found a decline in low birth weight of 1.9 to 2.4 percentage points. But they did not find evidence of any increase in prenatal care. In sum, our results are smaller than the previous works.
Although the results of the causal effects have internal validity, we cannot confirm their external validity because of the specific characteristics of the program. Given this, we should be cautious when extrapolating conclusions from these results. The program may have worked well in a small country such as Uruguay, a country that enjoys very low level of corruption compared to other developing countries.
These types of programs might be an attractive tool for governments, for inducing healthcare providers to improve prenatal care services´ quality and to achieve better results on newborns´ health outcomes. As mentioned before, there is very limited evidence in Latin America on a program introducing incentives at the country level and for the whole population. Plan Nacer, implemented in Argentina has a different population coverage and also shower positive results in terms of increasing maternal care indicators. .

Final considerations
Uruguay's Metas Asistenciales is a nationwide P4P program that seeks to improve prenatal care and, through it, newborn health outcomes, by providing monetary incentives to healthcare providers. The program is part of the healthcare reform implemented in 2008, which attempts to move the health services towards a model of primary healthcare coverage. Since its implementation in 2008, the program has continued to reinforce policies that improve access to health services for pregnant women and children. The novelty of the program is the way health centers are financed, which is an innovation in terms of health service delivery in developing countries. The program offers a pay-for-performance (P4P) incentive, which provided funds to health facilities that adhere to compliance standards set by the Ministry of Public Health. Health facilities receive funding for complying with certain indicators measured in the target population. If the standards are met, the centers are paid a certain amount per patient.
This incentive may represent a fundamental channel for improving children's health outcomes.
We evaluated the effect of the incentive scheme on the utilization rates of maternal care services and newborn health outcomes. Given the non-experimental nature of the policy, we performed a DiD estimation to compare the outcomes of variables of interest among populations treated in health centers against those that were not treated. We use two main sources of variation to identify possible causal effects. The first is the gap between private and public health sectors in the implementation of incentives. This difference was unintended. It was caused by a standing regulation stating that the financial incentive granted to public sector institutions had to be discounted from their allocated budgets.
Second, we consider the difference in compliance within the private sector as a second source of variation.
We observe a significant improvement in prenatal care, which increased by 10 percentage points, while the level of early detection of pregnancy increased by 4.5 percentage points. Women receiving services in a treated institution had a lower likelihood of having a newborn with low birth weight (1.2 percentage points). Premature births decreased by 0.36 percentage points. In turn, we observed a decrease in neonatal mortality by 2.4 per 1,000 births. Although the results of the causal effects have internal validity, we cannot confirm their external validity because of the specific characteristics of the program. Given this, we should be cautious when extrapolating conclusions from these results. Finally, robustness checks confirmed that the results obtained are robust.
This study thus provides relevant evidence of the potential benefits of a P4P scheme put in place to improve maternal and birth outcomes. Our results are, to the best of our knowledge, the first to use national data and administrative statistics for a nationwide policy implementing P4P schemes in an effort to improve newborn health in a developing country. We must stress, however, that the policy should be fully implemented in the public health sector in order to serve the most vulnerable population. Our estimates represent a lower bound and we can conjecture they will be of greater magnitude when considering at-risk populations such as teenage mothers who are normally served by the public health sector. It will be important to examine whether greater compliance generates a differential effect.
Ultimately, however, we can conclude that this Uruguayan P4P program had a positive and statistically significant impact on prenatal care and on newborn health outcomes.
This finding provides evidence in support of the idea that monetary incentives are a promising model that should be considered for regional and international expansion.