STCI/10_Diffusion.Rmd at master · chabefer/STCI · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Diffusion effects {#Diffusion}

Up until now, we have assumed that the treatment received by one unit in the population did not have any impact on any other unit.
We have not encoded this assumption formally, but we have implicitly made it all along, starting with our encoding of Rubin Causal Model in Chapter \@ref(FPCI).
In this chapter, we are going to relax that assumption, and learn how to deal with the more general cases that then appear.
We are going to cover a host of very important applications, that go from identifying contagion effects to identifying the optimal proportion of individuals to treat at independent locations.
We are first going to start by introducing an extended Rubin Causal Model allowing for diffusion effects and introducing ways to discipline this model so that it becomes estimable.
We are then going to look at various ways to estimate this model and the precision of the resulting estimates, using RCTs, DID, and both parametric and non parametric approaches.
Most of these developments are fairly recent and will enable us to get rapidly in touch with the research frontier.

## Allowing for diffusion effects in Rubin Causal Model

In this section, we are going to detail how to encode causality in the presence of diffusion effects.
We are going to start with potential outcomes and a general framework, before considering two very important special cases: the case where diffusion effects are absent and the case where they take a specific form.

### Potential outcomes and treatment effects with diffusion effects

The main starting point for an extended Rubin Causal Model is to acknowledge that the treatment status of the $N^*$ observations in the population (with $N^*$ possibly infinite) might influence the observed outcome for individual $i$.
Let $\mathbf{d}=\left\{d_1,\dots,d_{N^*}\right\}$, with $d_j\in\left\{0,1\right\}$, $\forall j\in\left\{1,\dots,N^*\right\}$.
We can therefore write the generalized potential outcome for individual $i$ as $Y_i^{\mathbf{d}}$.
If we write $\mathbf{D}=\left\{D_1,\dots,D_{N^*}\right\}$, we can then write the observed outcome for individual $i$ as $Y_i^{\mathbf{D}}$.
The average effect of the treatment becomes:

\begin{align*}
  \Delta^Y_{ATE}(\mathbf{D}) & = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}}\\
                & = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=1}\Pr(D_i=1)+\esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=0}\Pr(D_i=0)\\
                & = \Delta^Y_{TT}(\mathbf{D})\Pr(D_i=1)+\Delta^Y_{TUT}(\mathbf{D})\Pr(D_i=0)\\
\end{align*}

where $\mathbf{0}$ is the null vector of length $N^*$.
Note that the average effect of the treatment is equal to a weighted average of the effect on the treated and the effect of the untreated.
Note also that these effects differ from the ones we defined in Chapters \@ref(FPCI) and \@ref(FPSI): they depend on the whole vector of treatment assignments.
Indeed, the effect on the untreated is not the one we defined in Section \@ref(BiasOLS): it is not the difference between taking the treatment and not taking the treatment for those who do not take it.
The TUT we have defined here is the difference in outcomes for the ones who do not take the treatment between a case where the treated individuals in the population receive the treatment and a case where no one receives the treatment.
The only effect of the treatment on the untreated is indirect: it is the effect that transits through diffusion of the treatment effects from the treated to the untreated.
It can be when farmers adopt technologies after seeing their treated neighbors adopt them, or when people contract less diseases because their neighbors are vaccinated.
These effects can also be negative, for example when untreated job seekers are crowded out of a job by the job counselling received by the treated.
In general, I like to call these effects **contagion** effects, to insist on the fact that they are indirect.

Note that the effect on the treated also is different and depends on the whole treatment vector.
In that case, we allow for the effect on the treated to depend on whether or not some or all of their neighbors are treated.
The effect of a vaccine might for example be higher when more people around us are vaccinated.
Or a technology is more likely to be adopted if more neighbors are informed that it exists and encourage to adopt it.
I call these types of effects **amplification** effects, to denote the fact that whether the treated react a lot or not to the treatment might depend on whether their neighbors are also treated.
These effects might also be negative, for example when more job seekers receive counselling, the effectiveness of counselling on the treated might very well decrease.

### Encoding the absence of diffusion effects

In this section, we are going to state the assumption of absence of diffusion effects, that is required for all our previous estimators to work.
This assumption, called the Stable Unit Treatment Value Assumption, is stated as follows:

```{hypothesis,SUTVA,name="Stable Unit Treatment Value Assumption"}
We assume that the effect of the treatment on individual $i$ only depends on whether $i$ receives the treatment or not, and not on whether other individuals in the population receive the treatment as well: $\forall i$, $D_i=D'_i\Rightarrow Y_i^{\mathbf{D}}=Y_i^{\mathbf{D'}}$, $\forall\mathbf{D}\neq\mathbf{D'}$.
```

```{remark}
SUTVA has been coined by Don Rubin in a series of papers: Rubin ([1978](https://doi.org/10.1214/aos/1176344064), [1980](https://doi.org/10.2307/2287653), [1990](https://doi.org/10.1214/ss/1177012032)).
```

SUTVA implies the version of Rubin Causal Model that we have introduced in Chapter \@ref(FPCI).
Indeed, SUTVA implies that the only treatment status that matters for the potential outcomes of individual $i$ is the treatment status of individual $i$.
As a consequence, we have the following results:

```{theorem,RCMSUTVA,name="Rubin Causal Model and Treatment Effects Under SUTVA"}
Under Assumption \@ref(hyp:SUTVA), the potential outcome of individual $i$ only depends on its treatment status: $\forall i$, $Y_i^{\mathbf{D}}=Y_i^{D_i}$.
As a consequence:

\begin{align*}
  \Delta^Y_{TT}(\mathbf{D}) & = \Delta^Y_{TT}\\
  \Delta^Y_{TUT}(\mathbf{D}) & = 0.
\end{align*}
```

```{proof}
The proof of the first result that $Y_i^{\mathbf{D}}=Y_i^{D_i}$ is straightforward from Assumption \@ref(hyp:SUTVA).
We therefore have

\begin{align*}
  \Delta^Y_{TT}(\mathbf{D}) & = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=1}\\
                            & = \esp{Y_i^{D_i}-Y_i^{0}|D_i=1}\\
                            & = \esp{Y_i^{1}-Y_i^{0}|D_i=1}\\
                            & =   \Delta^Y_{TT}\\
  \Delta^Y_{TUT}(\mathbf{D})& = \esp{Y_i^{\mathbf{D}}-Y_i^{\mathbf{0}}|D_i=0}\\
                            & = \esp{Y_i^{D_i}-Y_i^{0}|D_i=0}\\
                            & = \esp{Y_i^{0}-Y_i^{0}|D_i=0}\\
                            & = 0.
\end{align*}
```

### Treatment exposure {#TreatmentExposure}

In general, it is going to prove extremely difficult to do econometric analysis using the very general setting we have defined so far, with potential outcomes depending on the whole treatment vector in the population.
A useful simplifying assumption that we often have to resort to is to specify an exposure mapping, that relates the whole treatment vector to the specifications relevant for the outcomes of interest.
In order to specify the exposure mapping, we are going to assume that all units in the population are part of a network.
This network is summarized by an $N^*\times N^*$ contiguity matrix $A$ where each element $a_{j,i}$ (with $j$ denoting the line and $i$ the column) measures the strength of the relationship between $j$ and $i$.
For example, if $j$ mentions $i$ as a friend, $a_{j,i}=1$, whereas $a_{i,j}=1$ whenever $i$ mentions $j$ as a friend.
We can enforce the graph to be symmetric, that is $a_{j,i}=a_{i,j}$, $\forall (i,j)$, but it does not have to be the case.
For example, water quality at some point $i$ along a river stream depends on whether water is treated at a point $j$ upstream, but water quality in $j$ does not depend on treatments in a downstream point $i$.
Because water flows in one direction, the network is not symmetric.

Equipped with a network of links, and denoting $\mathbf{\Omega}=2^{N^*}$ the set of possible treatment allocations, and $\mathbf{\Theta}$ the set of parameters $\theta_i$ relevant for the value of treatment exposure of unit $i$ (possibly containing features of the $A$ matrix), we can define treatment exposure as a mapping $f$ from $\mathbf{\Omega}\times\mathbf{\Theta}$ to $\mathbf{\Delta}$, the set of possible treatment exposure: $\Delta_i=f(\mathbf{D},\theta_i)$.
A key assumption we are going to make is that the exposure mapping is propermy specified, that is that it captures perfectly the intricacies of the effects of various treatment vectors:

```{hypothesis,PropSpecifyExpMap,name="Properly specified exposure mapping"}
$\forall i$, $\forall\mathbf{D}\neq\mathbf{D'}\in\mathbf{\Omega}$, $\forall \theta_i\in\mathbf{\Theta}$, $f(\mathbf{D},\theta_i)=f(\mathbf{D'},\theta_i)\Rightarrow Y_i^{\mathbf{D}}=Y_i^{\mathbf{D'}}$.
```

Under Assumption \@ref(hyp:PropSpecifyExpMap), the potential outcomes can be written as functions of treatment exposure only: $Y_i^{\Delta_i}$.
As a consequence, we can now define the average treatment effect of the treatment on the treated as follows:

\begin{align*}
  \Delta^Y_{TT}(\mathbf{d}) & = \esp{Y_i^{\mathbf{d}}-Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}},
\end{align*}

where $\Delta^Y_{TT}(\mathbf{d})$ measures the impact of treatment exposure $\mathbf{d}$ on those who have received it.

```{remark}
The framework based on the use of an exposure mapping has been developped by [Manski (2013)](https://doi.org/10.1111/j.1368-423X.2012.00368.x) and [Aronow and Samii (2017)](https://doi.org/10.1214/16-AOAS1005).
```

```{remark}
I use the term ``average treatment effect on the treated'' because $\Delta^Y_{ATE}(\mathbf{d})$ measures the effect of receiving a treatment vector $d$ on those who receive it.
```

We are now equipped with tools that enable us to define treatment effects in the presence of diffusion effects, and to identify various types of diffusion effects.
The key concept that we are going to have to specify is treatment exposure: how does it change with various applications and how do we go around identifying it in various precise cases?
What can we do as well to test for features of treatment exposure without completely specifying it?
This is what we are going to see in what follows, first in the case of Randomized Controlled Trials, and then in the case of Difference in Differences.
We are going to go step by step, and first we ware going to start with simpler networks, that I call coarse networks, before looking at what we can do with more complex networks.

### Fundamental problem of causal inference for diffusion effects

With diffusion effects and treatment exposure, the Fundamental Problem of Causal Inference strikes again.
Let state the problem using our more general framework of treatment exposure:

```{theorem,FPCIDiff,name='Fundamental problem of causal inference with diffusion effects'}
It is impossible to observe $\Delta^Y_{TT}(\mathbf{d})$, $\forall d\in\mathbf{\Delta}$, either in the population or in the sample.
```

```{proof}
For the population TT:

\begin{align*}
  \Delta^Y_{TT}(\mathbf{d}) & = \esp{Y_i^{\mathbf{d}}-Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}} \\
                & = \esp{Y_i^{\mathbf{d}}|\Delta_i=\mathbf{d}}-\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}\\
                & = \esp{Y_i|\Delta_i=\mathbf{d}}-\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}.
\end{align*}

$\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}$ is unobserved, and so is $\Delta^Y_{TT}$.
A similar reasoning holds for the sample average treatment effect.
```

We also have a novel formulation of the bias of intuitive methods.
For example, selection bias now depends on $\mathbf{d}$:

\begin{align*}
\Delta^Y_{SB}(\mathbf{d}) & = \Delta^Y_{WW}(\mathbf{d})-\Delta^Y_{TT}(\mathbf{d}) \\
              & = \esp{Y_i|\Delta_i=\mathbf{d}}-\esp{Y_i|\Delta_i=\mathbf{0}}-\esp{Y_i^{\mathbf{d}}-Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}\\
              & = \esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{d}}-\esp{Y_i^{\mathbf{0}}|\Delta_i=\mathbf{0}}.
\end{align*}

```{remark}
Why is the with/without comparison of individuals with treatment exposure $\Delta_i=\mathbf{d}$ and those with treatment exposure $\Delta_i=\mathbf{0}$ biased for average treatment effect on the treated $\Delta^Y_{TT}(\mathbf{d})$?
This is because treatment exposure might be correlated with unobserved confounders: individuals with higher treatment exposure might be systematically different from those with the reference level of treatment exposure (here $\mathbf{0}$).
```

### Bias of one-step randomized controlled trials

**Explain direction of bias**

## Diffusion effects with coarse networks

Coarse networks are networks where we do not have a lot of information on the connections between individuals: we only know whether they belong to the same influence group or not.
This type of network characterizes for example of a group of villages, or municipalities, or classes, for which we do not know which links individuals have between each other other than they belong to the same group.
More formally, coarse networks can be characterized by the following property:

```{hypothesis,CoarseNetwork,name="Coarse network"}
We say that our population is characterized by a coarse network if the observed matrix of connections $A$ is block diagonal and we do not know which nodes are activated within each block.
```

```{remark}
A blog diagonal influence matrix is composed of a set of groups or clusters within which observations influence each other and across which we assume all influences are muted.
This is of course a simplification: some units within a cluster might not really be connected, while some units might be connected to units in an other group.
Also, not all units might be equivalent within a group, with some being more central (*e.g.* connected) than others.
In a coarse network, we are assuming these differences away.
```

```{remark}
Another way of framing coarse networks is to say that there is unknown interference within clusters (and no interference across).
This is [Viviano (2023)](http://arxiv.org/abs/2011.08174)'s definition.
With Viviano's approach to coarse networks, we do not know which units interfere within each network and how they do.
```

With a coarse network approach, under Assumption \@ref(hyp:CoarseNetwork), we might specialize the exposure mapping to things we might know, that is whether the unit itself is treated or not and the proportion of units that are treated in a given cluster $c$, $p_c$, or, more generally, the proportion of units with characteristics $X_i=x$ that are treated within clusters with characteristics $Z_c=z$: $p(x,z)$.
As a consequence, we might write potential outcomes as $Y_i^{D_i,p_c}$ or, more generally, $Y_i^{D_i,\left\{p(x,Z_c)\right\}_{x\in\mathcal{X}}}$, with $\mathcal{X}$ the support of $X_i$.

```{remark}
Lemma 2.1 in [Viviano (2023)](http://arxiv.org/abs/2011.08174) shows an example of assumptions under which we can simplify the exposiure mapping and obtain potential outcomes as a function of the proportion of treated units and cluster and unit characteristics.
```

Under Assumption \@ref(hyp:CoarseNetwork), we can write the average effect of treating a cluster with a proportion of treated $p_c=p$ as follows:

\begin{align*}
  \Delta^Y_{TT}(p) & = \esp{Y_i^{D_i,p}-Y_i^{0,0}|p_c=p}\\
                & = \esp{Y_i^{1,p}-Y_i^{0,0}|D_i=1,p_c=p}\Pr(D_i=1|p_c=p)\\
                & \phantom{=} +\esp{Y_i^{0,p}-Y_i^{0,0}|D_i=0,p_c=p}\Pr(D_i=0|p_c=p)\\
                & = \Delta^Y_{TDT}(p)\Pr(D_i=1|p_c=p)+\Delta^Y_{TIT}(p)\Pr(D_i=0|p_c=p),
\end{align*}

where $\Delta^Y_{TDT}(p)$ is the Average Treatment Effect on the Directly Treated and $\Delta^Y_{TIT}(p)$ is the Average Treatment Effect on the Indirectly Treated.

The main question under Assumption \@ref(hyp:CoarseNetwork) is to find the allocation of treated units that maximizes some objective function.
We are going to make a distinction between two different cases:

  1. In the first case, we have a pre-specified budget for treatment effort (in terms of number of treated units) and we have to choose how to spend it optimally.
  This often happens in practical policy applications where the budget has been pre-approved but you do not know how to implement it in the most optimal way possible.
  2. In the second case, we already have an existing policy in place, and we would like to know whether it is optimal, and in which direction we should take it if we happen to have some additional budget.

### Optimal treatment allocation under monotone response

I develop result on this setting in my own ongoing research.
In order to fix ideas, we are going to start with a simple network with two clusters.
We will then look at what happens with a more general network.
Finally, we will look at how we can use two-steps clustered RCTs to estimate the required parameters and decide on the optimal allocation.

#### A simple model

Let's start with a very simple example of a network with two clusters $1$ and $2$.
Let's also consider only the case of a discrete outcome (such as participation in a program, getting vaccinated, contracting a disease, adopting a technology, etc.).
For simplicity, we are also going to write that potential outcomes are realizations of a continuous utility variable crossing a threshold:

\begin{align*}
  Y_{i}^{0,P_c} & = \uns{\underbrace{\delta_0 + \beta_0 P_{c} -\epsilon_{i,0}}_{Y^*_{i,0}}\geq0}\\
  Y_{i}^{1,P_c} & = \uns{\underbrace{\delta_1 + \beta_1 P_{c} -\epsilon_{i,1}}_{Y^*_{i,1}}\geq0}.
\end{align*}

What this models tells us is that, when no one else in the cluster is treated ($P_c=0$), the individual level effect of being treated is equal to $\Delta^Y_i=\uns{\epsilon_{i,1}\leq\delta_1}-\uns{\epsilon_{i,0}\leq\delta_0}$.
When some units starts receiving the treatment, we have two indirect effects:

  * Increasing the proportion of treated units impacts the outcomes of untreated units, through $\beta_0$.
  This is what I call a **contagion** effect, in which untreated units are somehow contaminated by the treatment received by the treated individuals in the same cluster.
  Contagion might refer to receiving information about the existence of a program and eventually deciding the enroll, or being protected by the fact that some neighbors are taking a treatment (in that case, contagion effects might actually prevent some untreated units from being contaminated).
  Contagion effects might be negative, if for example treated individuals who receive job training or job search assistance end up finding jobs that would have been allocated to some of the untreated individuals in the absence of the treatment.
  * Increasing the proportion of treated units impacts the outcomes of treated units, through $\beta_1$.
  This is what I call an **amplification** effect.
  There is amplification each time a treated units increases its likelihood of a positive outcome because more units are treated.
  This might happen when technological adoption occurs only after most individuals in the cluster have been exposed to it and convinced to make a change.

In order to get even more intuition on this problem, we are going to specialize it even further by making the following set of assumptions:

```{hypothesis,SimpleAllocHyp,name="Simplified Allocation Problem"}
We assume that the allocation problem is characterized as follows:

  * There are only two nodes $c=1$ and $c=2$.
  * A mass of $1$ units reside at each node.
  * We can only treat a mass of $1$ units.
  * We assume the constraint is saturated so that we use all available treatments: $p_1+p_2=1$.
  * $\epsilon_1$ and $\epsilon_0$ are uniform on $\left[0,1\right]$.
```

Under Assumption \@ref(hyp:SimpleAllocHyp), we can set $p_1=p$ and $p_2=1-p$.
Let's assume our goal is to maximize the total amount of people with $Y_i=1$.
Under Assumption \@ref(hyp:SimpleAllocHyp), this is equivalent to maximizing the sum of the adoption rates at both nodes.
Using the fact that the constraint is saturated, we can write the objective function we aim to maximize as follows:

\begin{align*}
  W(p) & = \underbrace{pF_{\epsilon,1}(\alpha_1 + \beta_1p)+(1-p)F_{\epsilon,0}(\alpha_0 + \beta_0p)}_{A(p)}\\
  & \phantom{=}+\underbrace{(1-p)F_{\epsilon,1}(\alpha_1 + \beta_1(1-p))+pF_{\epsilon,0}(\alpha_0 + \beta_0(1-p))}_{A(1-p)}
\end{align*}

The $A$ function measures how much the probability of observing the favorable outcome $Y_i=1$ in a given cluster increases with the proportion of treated individuals in the cluster, $p$.
It turns out that the properties of the $A$ function are key to determine the optimal allocation of treatment effort across nodes in the general case with more than two nodes.
For now, in the two-node case and under substantial simplifications, we have the following result:

```{theorem,SimpleAlloc,name="Optimal allocation of treatment effort with two nodes"}
Under Assumptions \@ref(hyp:PropSpecifyExpMap), \@ref(hyp:CoarseNetwork) and \@ref(hyp:SimpleAllocHyp), we have three possible cases for the optimal allocation of treatment effort:

  * When amplification effects dominate ($\beta_1>\beta_0$): either $p^*=1$ or $p^*=0$
  * When contagion effects dominate ($\beta_0>\beta_1$): $p^*=\frac{1}{2}$
  * When amplification and contagion effects are of the same size ($\beta_0=\beta_1$): $p^*=\left[0,1\right]$.
```

```{proof}
Under Assumption \@ref(hyp:SimpleAllocHyp), we have:

\begin{align*}
  W(p) & = p(\alpha_1 + \beta_1p)+(1-p)(\alpha_0 + \beta_0p)+(1-p)(\alpha_1 + \beta_1(1-p))+p(\alpha_0 + \beta_0(1-p))\\
      & = \alpha_0+\alpha_1+\beta_1+2(\beta_0-\beta_1)p(1-p),
\end{align*}

where the second line follows after some algebra.
The problem $\max_{p\in\left[0,1\right]}W(p)$ has the folowing first order condition: $W'(p)=2(\beta_0-\beta_1)(1-2p)=0$ and the following second order condition: $W''(p)=-4(\beta_0-\beta_1)$.
When $\beta_0>\beta_1$, $W''(p)<0$, and the interior solution $p^*=\frac{1}{2}$ maximizes $W$.
When $\beta_0<\beta_1$, $W''(p)>0$, and the interior solution $p^*=\frac{1}{2}$ minimizes $W$.
In that case, the optimal solution is at a corner, either at $p^*=1$ or at $p^*=0$.
Since $W(1)=W(0)$, they are both maxima.
When $\beta_0=\beta_1$, $W$ is constant and any value in $\left[0,1\right]$ maximizes $W$.
```

```{remark}
Theorem \@ref(thm:SimpleAlloc) shows that when amplification effects dominate, it is optimal to focus all treatment effort on one of the two nodes (for example the first, but they are interchangeable).
This is because returns are increasing in this case: the $A$ function is convex, with more people responding to the treatment as more of them receive the treatment.
When contagion effects dominate, it is optimal to treat both nodes, with half of the observations receiving the treatment.
This is because in that case, the $A$ function is concave, and the marginal returns are decreasing when we treat more people.
When both contagion and amplification effects are equal, there is no optimum, or, equivalently, any allocation $p$ will yield the same result.
```

#### A general model

One open question is whether we can generalize the result in Theorem \@ref(thm:SimpleAlloc) to a much more general setting with several nodes and more general functional forms.
It is actually the case.
Let us now formulate a more general setting:

```{hypothesis,SymAllocHyp,name="Symmetric Allocation Problem"}
We assume that the allocation problem is characterized as follows:

  * $K$ nodes indexed from $1$ to $K$, and each node has size $n_k$.
  * At each node, we can choose to treat $r_k$ individuals.
  * The total number of individuals on the network is $N=\sum_{k=1}^Kn_k$.
  * The total number of treated individuals is $R=\sum_{k=1}^Kr_k$.
  * We cannot treat more than $\bar{R}$ individuals.
  * We cannot treat everyone: $\bar{R}<N$.
  * The expected outcome at each node (or response function) is only a function of $p$, that we denote $A(p)$, with $A'>0$.
```

**$A$ is two things at the same time: connection matrix and response function**

```{remark}
Assumption \@ref(hyp:SymAllocHyp) is mainly restrictive in making the problem symmetric: all nodes are treated in the same way.
The only thing that distinguishes nodes is their respective size.
Apart from that, they all respond in the same (average) way to the treatment.
We do not try to distinguish between nodes based on observed characteristics of the nodes.
We also do not try to vary the identity of treated units based on their observed characteristics.
Another restriction is that $A'>0$: we only consider treatments for which the response is always strictly increasing in $p$ (and not weakly).
```

Under Assumptions \@ref(hyp:PropSpecifyExpMap), \@ref(hyp:CoarseNetwork) and \@ref(hyp:SymAllocHyp), we can cast our optimization problem as follows:

\begin{align*}
  \max_{\left\{r_k\right\}_{k=1}^K} & \sum_{k=1}^K n_kA(\frac{r_k}{n_k})\label{eqn:MainProbMax}\\
   & \text{under the constraints} \nonumber\\
   R  & =\sum_{k=1}^Kr_k \leq \bar{R} \label{eqn:MainProbR}\\
   r_k & \leq n_k\text{, }\forall k\label{eqn:MainProbn}\\
   r_k & \geq 0\text{, }\forall k.\label{eqn:MainProbr}
\end{align*}

In my work, I have been able to solve this problem for a smooth response function $A$, in the following sense:

```{hypothesis,SmoothResponseHyp,name="Monotone Response Function"}
We assume that the reponse function $A$ has constant second derivative on its full support: either $A''(p)>0$ $\forall p\in\left[0,1\right]$ or $A''(p)<0$ $\forall p\in\left[0,1\right]$.
```

We can indeed prove the following result:

```{theorem,SmoothSymAlloc,name="Optimal allocation under monotone response with $K$ symmetric nodes"}
Under Assumptions \@ref(hyp:PropSpecifyExpMap), \@ref(hyp:CoarseNetwork), \@ref(hyp:SymAllocHyp) and \@ref(hyp:SmoothResponseHyp), the optimal allocation of treatment across nodes is as follows:

\begin{align*}
  \frac{r^*_k}{n_k} & =
\begin{cases}
 \frac{\bar{R}}{N}\text{, }\forall k & \text{ if }A''<0\\
 \begin{cases}
  0 & \text{ for a set of nodes } \mathcal{J} \text{ such that } \sum_{j\in\mathcal{J}}n_j=N-\bar{R},\\
  1 & \text{ for a set of nodes } \mathcal{L} \text{ such that } \sum_{l\in\mathcal{L}}n_j=\bar{R},
  \end{cases}
 &  \text{ if } A''>0.\\
  \end{cases}
\end{align*}
```

```{proof}
See Section \@ref(proofSmoothSymAlloc).
```

```{remark}
Theorem \@ref(thm:SmoothSymAlloc) shows that the very simple intuition that we got in the two nodes problem transports well to more complex settings.
The optimal allocation depends on the sign of the second derivative.
When returns are decreasing, we treat each node symmetrically with the same share $p^*=\frac{\bar{R}}{N}$ of the treatment effort.
When returns are increasing, we treat a share $\frac{\bar{R}}{N}$ of the nodes with $p^*=1$ and a share $1-\frac{\bar{R}}{N}$ with $p^*=0$.
```

```{remark}
There are several open questions on this research front.
To list but a few:

  * Can we relax Assumption \@ref(hyp:SmoothResponseHyp)?
  For example, we know that $A''$ has not constant sign when the error terms are normal in the model with two nodes, but we still have an optimal solution that has the same shape.
  * Can we relax Assumption \@ref(hyp:SymAllocHyp)?
  Especially, can we allow for responses that vary as a function of node characteristics and can we allow for treatment allocation based on unit characteristics?
```

#### Using two-step clustered randomized controlled trials to find the optimal treatment allocation

In this section, we are going to see that conducting a two-step clustered randomized controlled trial is going to enable us to identify the optimal treatment allocation under Assumptions \@ref(hyp:PropSpecifyExpMap), \@ref(hyp:CoarseNetwork) and \@ref(hyp:SimpleAllocHyp).
A two-step clustered randomized controlled trial works as follows:

  * In a first step, we randomly select three sets of nodes, $ST$ and $PT$ and $SC$, with $K_{ST}+K_{PT}+K_{SC}=\tilde{K}$ and $\tilde{K}\leq K$.
  When $\tilde{K}< K$, $\tilde{K}$ is a random subset of the $K$ nodes.
    + Nodes that belong to $ST$, the set of nodes of size $K_{ST}$, are called **Super Treated** nodes.
    The proportion of treated units is $p^R_c=1$, $\forall c \in ST$.
    + Nodes that belong to $PT$, the set of nodes of size $K_{ST}$, are called **Partially Treated** nodes.
    The proportion of treated units is $p^R_c=\frac{\bar{R}}{N}\equiv p^*$, $\forall c \in PT$.
    + Nodes that belong to $SC$, the set of nodes of size $K_{SC}$, are called **Super Control** nodes.
    The proportion of treated units is $p^R_c=0$, $\forall c \in SC$.
  * In a second step, we randomly select $N^1_c=\frac{\bar{R}}{N}N_c$ units to be treated (with $R_i=1$) and $N^0_c=N_c-N^1_c$ to be in the control group ($R_i=0$), $\forall c \in PT$, with $N_c$ the number of units in node $c$.

When implementing the treatment, all units in $ST$ are treated, only $N^1_c$ units are treated in $PT$ and no unit is treated in $SC$.

```{remark}
Note that rigorously, we should have $N_c^1=\lfloor\frac{\bar{R}}{N}N_c\rfloor$, but we disregard the complexities brought about by the fact that the number of units has to be an integer.
```

The one thing we need to identify now in order to apply Theorem \@ref(thm:SmoothSymAlloc) is the sign of the second derivative of the $A$ function.
We are going to show that the sign of $A''$ can be identified in a two-step clustered randomized controlled trial.
Before that, we are going to encode the validity of the two-step clustered randomized controlled trial:

```{hypothesis,independence2StepCluster,name="Independence in a two-step clustered design"}
We assume that the allocation of the proportion of neighbors treated and of the individual treatment level are independent of potential outcomes:

\begin{align*}
  (R_i,p^R_c)\Ind\left(\left\{Y_i^{0,p},Y_i^{1,p}\right\}_{p\in\left[0,1\right]}\right).
\end{align*}
```

We also assume that the randomized allocation does not interfere with how units respond to the treatment:

```{hypothesis,TwoStepClusterValidity,name="Validity of the 2-step clustered design"}
We assume that the randomized allocation of the program does not interfere with how potential outcomes are generated:

\begin{align*}
Y_i & =
  \begin{cases}
    Y_i^{1,p} & \text{ if } R_i=1 \text{ and } p^R_c=p\\
    Y_i^{0,p} & \text{ if } R_i=0 \text{ and } p^R_c=p
  \end{cases}
\end{align*}

with $Y_i^{1,p}$ and $Y_i^{0,p}$ the same potential outcomes as defined with a routine allocation of the treatment.
```

We are now equipped to prove the identification of $A''$:

```{theorem,IdentSmoothSymAlloc,name="Identification of $A''$ in a 2-step clustered randomized controlled trial"}
Under Assumptions \@ref(hyp:PropSpecifyExpMap), \@ref(hyp:CoarseNetwork), \@ref(hyp:SymAllocHyp), \@ref(hyp:SmoothResponseHyp), \@ref(hyp:independence2StepCluster), and \@ref(hyp:TwoStepClusterValidity), the numerator of $A''$ is identified by the following quantity:

\begin{align*}
  \text{sign}(A'') & = \text{sign}\left(\frac{\esp{Y_i|p^R_c=1}-\esp{Y_i|p^R_c=p^*}}{1-p^*}-\frac{\esp{Y_i|p^R_c=p^*}-\esp{Y_i|p^R_c=0}}{p^*}\right).
\end{align*}
```

```{proof}
See Section \@ref(proofIdentSmoothSymAlloc).
```

One thing that is pretty amazing is that we can relate the sign of $A''$ to the relative size of contagion and amplification effects:

```{theorem,ContagionDiffusionSmoothSymAlloc,name="The sign of $A''$ depends on the relative size of contagion vs amplification effects"}
Under Assumptions \@ref(hyp:PropSpecifyExpMap), \@ref(hyp:CoarseNetwork), \@ref(hyp:SymAllocHyp), \@ref(hyp:SmoothResponseHyp), \@ref(hyp:independence2StepCluster), and \@ref(hyp:TwoStepClusterValidity), we have:

\begin{align*}
  \text{sign}(A'') & = \text{sign}\left(\frac{\esp{Y^{1,1}_i-Y^{1,p^*}}}{1-p^*} -\frac{\esp{Y^{0,p^*}_i-Y^{0,0}_i}}{p^*}\right),
\end{align*}
where $\esp{Y^{1,1}_i-Y^{1,p^*}}$ measures the strength of amplification effects and $\esp{Y^{0,p^*}_i-Y^{0,0}_i}$ measures the strength of contagion effects.
```

```{proof}
See Section \@ref(proofContagionDiffusionSmoothSymAlloc).
```

Theorem \@ref(thm:ContagionDiffusionSmoothSymAlloc) suggests an alternative identification strategy for the sign of $A''$:

```{theorem,IdentContagionDiffusionSmoothSymAlloc,name="Identifying the sign of $A''$ from the relative size of contagion and amplification effects"}
Under Assumptions \@ref(hyp:PropSpecifyExpMap), \@ref(hyp:CoarseNetwork), \@ref(hyp:SymAllocHyp), \@ref(hyp:SmoothResponseHyp), \@ref(hyp:independence2StepCluster), and \@ref(hyp:TwoStepClusterValidity), we have:

\begin{align*}
    \text{sign}(A'') & = \text{sign}\left(\frac{\esp{Y_i|R_i=1,p^R_c=1}-\esp{Y_i|R_i=1,p^R_c=p^*}}{1-p^*}\right.\\
                      & \phantom{=\text{sign}\left(\right.}\left.-\frac{\esp{Y_i|R_i=0,p^R_c=p^*}-\esp{Y_i|R_i=0,p^R_c=0}}{p^*}\right)
\end{align*}
```

```{proof}
The proof is immediate using Theorem \@ref(thm:ContagionDiffusionSmoothSymAlloc) and Assumptions \@ref(hyp:independence2StepCluster) and \@ref(hyp:TwoStepClusterValidity).
```

Thanks to Theorems \@ref(thm:IdentSmoothSymAlloc) and \@ref(thm:IdentContagionDiffusionSmoothSymAlloc), we therefore have two ways to estimate the sign of $A''$: either by comparing the overall changes in expected outcomes when moving from $0$ to $p^*$ and from $p^*$ to $1$, or by comparing the relative size of amplification and contagion effects.
As a consequence, we can form two with/without estimators of $A''$:

\begin{align*}
  \hat{A}''_{All}(\frac{1}{2}) & = \frac{\frac{\sum_{i\in\mathcal{I}_{ST}}Y_i}{N_{ST}}-\frac{\sum_{i\in\mathcal{I}_{PT}}Y_i}{N_{PT}}}{1-p^*}-
                         \frac{\frac{\sum_{i\in\mathcal{I}_{SP}}Y_i}{N_{SP}}-\frac{\sum_{i\in\mathcal{I}_{SC}}Y_i}{N_{SC}}}{p^*}\\
  \hat{A}''_{Diff}(\frac{1}{2}) & = \frac{\frac{\sum_{i\in\mathcal{I}_{ST}}Y_i}{N_{ST}}-\frac{\sum_{i\in\mathcal{I}^1_{PT}}Y_i}{N^1_{PT}}}{1-p^*}-
                         \frac{\frac{\sum_{i\in\mathcal{I}^0_{SP}}Y_i}{N^0_{SP}}-\frac{\sum_{i\in\mathcal{I}_{SC}}Y_i}{N_{SC}}}{p^*},
\end{align*}

with $\mathcal{I}_{T}$, $T\in\left\{ST,SC,PT\right\}$, the set of units $i$ that belong to a cluster of type $T$, $\mathcal{I}^d_{PT}$, $d\in\left\{0,1\right\}$, the set of units that belong to a cluster of type $PT$ and have $R_i=d$, $N_{T}$, $T\in\left\{ST,SC,PT\right\}$, the number of units $i$ belonging to clusters of type $T$, and $\mathcal{N}^d_{PT}$, $d\in\left\{0,1\right\}$, the number of units that belong to clusters of type $PT$ and have $R_i=d$.
Following usual arguments, these estimators are both unbiased and consistent (as the number of clusters goes to infinity) for $A''(\frac{1}{2})$.
Their components can both be estimated separately by using OLS with a linear model on separate subsamples.
The covariance of each separate with/without comparison can be estimated by estimating both components jointly, for example by estimating the following model by OLS:

\begin{align*}
  Y_i & = \alpha^{All} + \beta^{All}_{PT}\uns{i\in\mathcal{I}_{PT}} + \beta^{All}_{ST}\uns{i\in\mathcal{I}_{ST}} + \epsilon_i^{All}\\
  Y_i & = \alpha^{Diff} + \alpha^{Diff}_{1}\uns{R_i=1} + \beta^{Diff}_{0}\uns{i\in\mathcal{I}^0_{PT}}+ \beta^{Diff}_{1}\uns{i\in\mathcal{I}_{ST}} + \epsilon_i^{All}.
\end{align*}

With these parameter estimates, we have:

\begin{align*}
  \hat{A}''_{All}(\frac{1}{2}) & = \frac{\hat\beta^{All}_{ST}}{1-p^*}-\frac{\hat\beta^{All}_{SP}}{p^*}\\
  \hat{A}''_{Diff}(\frac{1}{2}) & = \frac{\hat\beta^{Diff}_{1}}{1-p^*}-\frac{\hat\beta^{Diff}_{0}}{p^*}.
\end{align*}

To estimate the precision of each of the parameters, one has to use standard errors clustered at the cluster level.
To obtain the precision of $\hat{A}''(\frac{1}{2})$, one can simply use the Delta Method.

```{remark}
Note that in practice, the actual proportion of treated in each cluster of type $PT$ is going to differ from $p^*$.
Does this affect consistency and unbiasedness of both estimators?
Could we estimate $\hat p^*$ and try to use it to get access to a wider share of the $A$ function, or at least to an average effect?
See Davide's discussion of that issue.
```

### Identifying optimal treatment levels

In the previous section, we discussed ways of identifying diffusion effects, and we focused on the task of finding the optimal treatment allocation when total treatment capacity was fixed to a limited number of treatments.
In that scenario, what turned out to be super important was the shape of the returns to treatment effort (convex or concave), and it turned out to be related to whether contagion or amplification effects dominated.
Though this scenario of constant treatment effort sometimes happens in real life, in other situations, policymakers might want to decide whether to increase or decrease their treatment effort, and to find the optimal treatment level, taking into account diffusion effects.
This is the goal of this section, which is fully based on Davide [Viviano (2023)](http://arxiv.org/abs/2011.08174)'s recent working paper on the topic.

#### Setup and assumptions

Davide considers a setting with $K$ clusters of equal size $N$.
Researchers sample a proportion $\lambda\in\left]0,1\right]$ of the $N$ units in each cluster at each period $t$ and they have access to the following information for the sampled observations in each cluster: $\left(Y^{(k)}_{i,t},X^{(k)}_{i},D^{(k)}_{i,t}\right)_{i=1}^n$ where $n=\lambda N$ and $X^{(k)}_{i}$ are baseline characteristics.
There are $T$ periods.
Despite the data being allowed to be a panel or a repeated cross section, we denote observations as if there was repeated sampling.
Potential outcomes are denoted $Y^{(k)}_{i,t}(\mathbf{D}_1^{(k)},\dots,\mathbf{D}_t^{(k)})$, where $\mathbf{D}_s^{(k)}\in\left\{0,1\right\}^N$, and $s\leq t$.
We denote $Y^{(k)}(.)$ and $X^{(k)}$ the vectors of potential outcomes and covariates in cluster $k$.

The key policy parameter that we are going to be after is a treatment rule, $\pi(.;\beta):\mathcal{X}\leftrightarrow\left[0,1\right]$, indexed by a (possibly vector valued) parameter $\beta$ which lies in a compact set.
The treatment rules selects a probability of allocating each agent with characteristics $x$ at date $t$ in cluster $k$ to the treatment.
We would like to choose $\pi$ so that we maximize an objective function, for example total program returns net of program implementation costs.

In order to determine this optimal function, we are going to run two-step clustered experiments.
These experiments are as follows:

```{hypothesis,TreatAssignDavide,name='Treatment Assignement in the experiment'}
For $\beta_{k,t}\Ind\left(X^{(k)},Y^{(k)}(.)\right)$,

  \begin{align*}
    D^{(k)}_{i,t}|X^{(k)},Y^{(k)}(.),\beta_{k,t}\sim_{i.n.i.d.}\mathcal{B}(\pi(X^{(k)}_{i};\beta_{k,t})).
  \end{align*}
```

Assumption \@ref(hyp:TreatAssignDavide) implies that the allocation of treatment follows a Bernoulli distribution indexed by parameters $\beta_{k,t}$, and can be different for individuals with different baseline characteristics.

```{example}
Examples of experimental allocation rules are the equal probability rule: $\pi(.;\beta)=\beta\in\left[0,1\right]$ or targeted treatments $\pi(x;\beta)=\beta_x$.
The treatment rules can also be made conditional on cluster characteristics.
```

Davide now needs another assumption about the data generating process:

```{hypothesis,DGPDavide,name='Data generating process'}
For any $(i,t,k)$, we assume that:

\begin{enumerate}[(i)]
  \item $Y^{(k)}_{i,t}(\mathbf{D}_1^{(k)},\dots,\mathbf{D}_t^{(k)})$ is constant in $\mathbf{D}_1^{(k)},\dots,\mathbf{D}_{t-1}^{(k)}$ and $X_i^{(k)}\sim F_{X}$,\label{it:noCarryOver}
  \item Under an assignment with parameter $\beta_{k,t}$, we have:\label{it:FunctForm}
  \begin{align*}
    \espsub{Y^{(k)}_{i,t}|D^{(k)}_{i,t}=d,X_i^{k}=x}{\beta_{k,t}} & = m(d,x,\beta_{k,t}) + \alpha_t + \tau_k,
  \end{align*}
  \item $Y_{i,t}^{(k)}\Ind\left\{Y_{j,t}^{(k)}\right\}_{j\notin\mathcal{I}_i^{(k)}}|\beta_{k,t}$, with $\left|\mathcal{I}_i^{(k)}\right|\leq 2\gamma_N\geq 1$.\label{it:degreeLim}
\end{enumerate}
```

Assumption \@ref(hyp:DGPDavide) says that there are no carryover effects of the treatment beyond the period in which it is assigned (\@ref(it:noCarryOver)), that there are no interactions between the treatment and time and cluster fixed effects (\@ref(it:FunctForm)), and finally that outcomes depend on at most $\gamma_N$ other outcomes in the same cluster.

We are now equipped to define welfare as a function of the parameters of the allocation rule:

```{definition,WelfareDavide,Name='Welfare'}
For treatments as in Assumption \@ref(hyp:TreatAssignDavide), and under the assumptions on the d.g.p. in Assumption \@ref(hyp:DGPDavide), we can define welfare as $W(\beta)=\int y(x,\beta)dF_X(x)$, with $y(x,\beta)=\pi(x;\beta)m(1,x,\beta)+(1-\pi(x;\beta))m(0,x,\beta)$,
```

with $y(x,\beta)$ the outcome net of costs.
Equipped with this definition, and assuming all functions are differentiable, we can define the direct effect of the treatment ($\Delta(x,\beta)$), the marginal spillover effect ($S(d,x,\beta)$), the marginal policy effect ($M(\beta)$) and the welfare optimizing policy ($\beta^*$) as follows:

\begin{align*}
  \Delta(x,\beta) & = m(1,x,\beta)-m(0,x,\beta)\\
  S(d,x,\beta) & = \partder{m(d,x,\beta)}{\beta}\\
  M(\beta) & = \partder{W(\beta)}{\beta}\\
          & = \int\left[S(0,x,\beta)+\pi(x;\beta)(S(1,x,\beta)-S(0,x,\beta))\right.\\
          & \phantom{=\int\left[\right.}\left.+\partder{\pi(x,\beta)}{\beta} \Delta(x,\beta)\right]dF_X(x)\\
  \beta^* & = \arg\sup_{\beta}W(\beta).
\end{align*}

```{example}
A first example that Davide gives is the case of positive externalities with decreasing returns from neighbours' treatments.
We pose $D^{(k)}_{i,t}\sim_{i.i.d.}\mathcal{B}(\beta)$, and $\mathcal{N}_i$ is the set of neighbours of individual $i$.
We let

\begin{align*}
  Y^{(k)}_{i,t} & = \alpha_t + D^{(k)}_{i,t}\phi_1 + \frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}\phi_2
                    -\left(\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}\right)^2\phi_3 + \nu_{i,t}
\end{align*}

In that case, assuming that $\left|\mathcal{N}_i\right|\sim\mathcal{D}_N$, we have:

\begin{align*}
  \espsub{Y^{(k)}_{i,t}|\alpha_t,D^{(k)}_{i,t}=d}{\beta}
                  & = \alpha_t + d\phi_1
                    + \espsub{\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}}{\beta}\phi_2
                    -\espsub{\left(\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}\right)^2}{\beta}\phi_3 \\
                  & = \alpha_t + d\phi_1
                    + \esp{\espsub{\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{n}}{\beta}|\left|\mathcal{N}_i\right|}\phi_2
                    -\esp{\espsub{\left(\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{n}\right)^2}{\beta}|\left|\mathcal{N}_i\right|}\phi_3 \\
                  & =  \alpha_t + d\phi_1
                    + \esp{\frac{\left|\mathcal{N}_i\right|\beta}{\left|\mathcal{N}_i\right|}|\left|\mathcal{N}_i\right|}\phi_2
            -\esp{\frac{\left|\mathcal{N}_i\right|\beta(1-\beta)+\left|\mathcal{N}_i\right|^2\beta^2}{\left|\mathcal{N}_i\right|^2}|\left|\mathcal{N}_i\right|}\phi_3 \\
                  & =  \alpha_t + d\phi_1 + \beta\phi_2-\beta\phi_3\left(\beta+(1-\beta)\esp{\frac{1}{\left|\mathcal{N}_i\right|}}\right)
\end{align*}

As a consequence, we have:

\begin{align*}
  m(d,1,\beta) & = d\phi_1 + \beta\phi_2-\beta\phi_3\left(\beta+(1-\beta)\esp{\frac{1}{\left|\mathcal{N}_i\right|}}\right)\\
  \Delta(x,\beta) & = \phi_1\\
  S(d,x,\beta) & = \phi_2-\phi_3(2\beta+(1-2\beta)\esp{\frac{1}{\left|\mathcal{N}_i\right|}})\\
  y(1,\beta) & = \beta\phi_2-\beta\phi_3\left(\beta+(1-\beta)\esp{\frac{1}{\left|\mathcal{N}_i\right|}}\right)+\beta(\phi_1-c)\\
  M(\beta) & = \phi_2-\phi_3(2\beta+(1-2\beta)\esp{\frac{1}{\left|\mathcal{N}_i\right|}})+\phi_1-c.
\end{align*}
```

```{example}
Let us now look at an example with negative externalties.
We similarly have $D^{(k)}_{i,t}\sim_{i.i.d.}\mathcal{B}(\beta)$, but now outcomes are negatively affected by the proportion of treated, and all the more so if they are treated themselves:

\begin{align*}
  Y^{(k)}_{i,t} & = \alpha_t + D^{(k)}_{i,t}\phi_1 - \frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}\phi_2
                    -D^{(k)}_{i,t}\frac{\sum_{j\in\mathcal{N}_i}D^{(k)}_{j,t}}{\left|\mathcal{N}_i\right|}\phi_3 + \nu_{i,t}
\end{align*}

In that case, we have, after similar manipulation, $y(1,\beta)=\beta(\phi_1-c-\phi_2-\phi_3\beta)$.
```

```{remark}
One open question is whether Assumption \@ref(hyp:DGPDavide) is compatible with any real-looking network.
Davide formalizes a nice proposition that shows that indeed this assumption can be rationalized by an actual network formation model.
Let units be spaced on a latent space, and each unit can interact with at most the $\sqrt{\gamma_N}$ closest units.
Let $\uns{i_k\leftrightarrow j_{k}}$ denote whether or not $i$ and $j$ can be connected in the resulting the latent network.
Let $\mathcal{I}_k$ be the matrix of these potential connections in cluster $k$.
Davide's first assumption is:
```

```{hypothesis,iidConnect,name='Network'}
Actual connections are generated as follows:

  \begin{align*}
    A_{i,j}^{(k)} & = l(X^{(k)}_{i},X^{(k)}_{j},U^{(k)}_{i},U^{(k)}_{j})\uns{i_k\leftrightarrow j_{k}},
  \end{align*}

for some function $l$ and unobservables $U^{(k)}_{i}$ with $\left(X^{(k)}_{i},U^{(k)}_{i}\right)|\mathcal{I}_k\sim F_{U|X}F_{X}$ and with $\sum_{j=1}^N\uns{i_k\leftrightarrow j_{k}}=\sqrt{\gamma_N}$.
```

The second assumption Davide makes is on how potential outcomes are generated:

```{hypothesis,PoNetwork,name='Potential outcomes'}
Potential outcomes are generated as follows:

  \begin{align*}
    Y^{(k)}_{i,t}(\mathbf{D}_1^{(k)},\dots,\mathbf{D}_t^{(k)}) & = r(D_{i,t}^{(k)},\mathbf{D}_{\mathcal{N}_i^{k},t}^{(k)},X^{(k)}_{i},X^{(k)}_{\mathcal{N}_i^{k},t},U^{(k)}_{i},U^{(k)}_{\mathcal{N}_i^{k},t},A^{(k)}_{i,.},|\mathcal{N}_i^{k}|,\nu^{(k)}_{i,t})+\tau_k+\alpha_t,
  \end{align*}

for some function $r$ which attains the same value for any permutations of the entries of $A^{(k)}_{i,.}$, with $A^{(k)}_{i,.}$ the vector of connections of $i$ in $(k)$, and unobservables $\nu^{(k)}_{i,t}|\left(X^{(k)}_{i},U^{(k)}_{i}\right)\sim F_{\nu}$, and where $\mathcal{N}_i^{k}=\left\{j:A^{(k)}_{i,j}>0\right\}$.
```

Davide can then prove that this setting implies Assumption \@ref(hyp:DGPDavide):

```{proposition,DavideEquiv,Name='Microfoundation of d.g.p.'}
With a treatment assigned following Assumption \@ref(hyp:TreatAssignDavide), if Assumptions \@ref(hyp:iidConnect) and \@ref(hyp:PoNetwork) hold, then Assumption \@ref(hyp:DGPDavide) holds.
```

```{proof}
See [Viviano (2023)](http://arxiv.org/abs/2011.08174), Section B.1.2.
```

#### Identifying and estimating the marginal policy effect with a one-wave experiment

Davide proposes a one-wave experiment to get at the marginal policy effect.
Here is the algorithm he proposes, with $p_1=1$:

  1. Organize clusters into $G=\frac{K}{2}$ pairs with consecutive indexes $\left\{k,k+1\right\}$

  2. At $t=0$, either nobody receives the treatment, or treatment is assigned using rule $\pi(.;\beta)$.
      Collect baseline outcomes and observe $(Y_{i,0}^{(h)},X_{i}^{(h)})_{i=1}^N$, for $h\in\left\{1,\dots,K\right\}$.

  3. At $t=1$, start the experiment:
  * For each pair $g=\left\{k,k+1\right\}$, randomize

      \begin{align*}
        D_{i,1}^{(k)}|\beta,X_i^{(k)}=x & \sim \begin{cases}
                                                  \mathcal{B}(\pi(x,\beta+\eta_n\underline{e}_1)) & \text{ if } h=k\\
                                                  \mathcal{B}(\pi(x,\beta-\eta_n\underline{e}_1)) & \text{ if } h=k+1
                                                \end{cases}
      \end{align*}
      with $\bar{C}n^{-\frac{1}{2}}<\eta_n<\bar{C}n^{-\frac{1}{4}}$, and $\underline{e}_j=\left[0,\dots,0,1,0,\dots,0\right]$, where $\underline{e}_j\in\left\{0,1\right\}^p$, and $\underline{e}_j[j]=1$.
  * For $n$ units in cluster $h$ observe $Y_{i,1}^{(h)}$
  * Estimate the marginal effect as follows:

    \begin{align*}
      \bar{M}_n(\beta) & = \frac{1}{G}\sum_{g=1}^G\widehat{M}_g(\beta) \\
      \widehat{M}_g(\beta) & =
\frac{1}{2\eta_n}\left[\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,1}-\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,0}\right]
-\frac{1}{2\eta_n}\left[\frac{1}{n}\sum_{i=1}^nY^{(k+1)}_{i,1}-\frac{1}{n}\sum_{i=1}^nY^{(k+1)}_{i,0}\right]
    \end{align*}

  4. Construct the following test statistic

  \begin{align*}
    \mathcal{T}_n & = \sqrt{G}\frac{\bar{M}_n(\beta)}{\sqrt{\frac{1}{G-1}\sum_{g=1}^G(\widehat{M}_g(\beta)-\bar{M}_n(\beta))^2}}
  \end{align*}
   to test whether the current allocation is optimal.
   Indeed, if $\beta^*=\arg\max_{\beta} W(\beta)$ is an interior point, then $W(\beta)=W(\beta^*)\Rightarrow M(\beta)[j]=0$, $\forall j\in\left\{1,\dots,p_1\right\}$, with $p_1\leq p$.

  5. Constructs tests $\uns{\left|\mathcal{T}_n\right|> \text{cv}_{G}(\alpha)}$, with size $\alpha$ and critical values obtained by permuting the sign of the estimated marginal effects.

```{remark}
Davide's approach runs a pairwise randomized controlled trial similar to the ones we studied in Section \@ref(PairRCT), but at the cluster level.
Each cluster within the pair is allocated a slightly different value of the allocation parameter, with a perturbation aound the current level of allocation (or a $\beta$ of interest).
Davide then proposes a DID estimator to get rid of the time and cluster fixed effects (mostly for precision, since they do not affect consistency, at least they do not seem to).
He estimates a test statistic for whether the average marginal effect is zero across clusters, which is a necessary condition for being at the optimum (and a sufficient one if we assume sufficiency).
```

```{remark}
Davide's approach also recovers several other important treatment effects:
```

\begin{align*}
  \bar{W}_n(\beta) & = \frac{1}{K}\sum_{k=1}^K\left(\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,1}-\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,0}\right) \\
  \bar{\Delta}_n(\beta) & = \frac{1}{G}\sum_{g=1}^G\hat{\Delta}_g(\beta) \\
  \hat{\Delta}_{g}(\beta) & = \frac{1}{2n}\sum_{h\in\left\{k,k+1\right\}}\sum_{i=1}^n
                              \left[\frac{D^{(h)}_{i,1}Y^{(h)}_{i,1}}{\pi(X_i^{h};\beta+\eta_n\nu_h\underline{e}_1)}
                                   -\frac{(1-D^{(h)}_{i,1})Y^{(h)}_{i,1}}{1-\pi(X_i^{h};\beta+\eta_n\nu_h\underline{e}_1)}\right]\\
  \bar{S}_n(1,\beta) & = \frac{1}{G}\sum_{g=1}^G\hat{S}_{g}(1,\beta) \\
  \hat{S}_{g}(1,\beta) & = \frac{1}{2n}\sum_{h\in\left\{k,k+1\right\}}\frac{\nu_h}{\eta_n}\sum_{i=1}^n
                              \left[\frac{D^{(h)}_{i,1}Y^{(h)}_{i,1}}{\pi(X_i^{h};\beta+\eta_n\nu_h\underline{e}_1)}
                                   -\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,0}\right]\\
  \bar{S}_n(0,\beta) & = \frac{1}{G}\sum_{g=1}^G\hat{S}_{g}(0,\beta) \\
  \hat{S}_{g}(0,\beta) & = \frac{1}{2n}\sum_{h\in\left\{k,k+1\right\}}\frac{\nu_h}{\eta_n}\sum_{i=1}^n
                              \left[\frac{(1-D^{(h)}_{i,1})Y^{(h)}_{i,1}}{1-\pi(X_i^{h};\beta+\eta_n\nu_h\underline{e}_1)}
                                   -\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,0}\right]\\
  \nu_h & = \begin{cases} 1 & \text{ if } h=k \\ -1 & \text{ if } h=k+1 \end{cases} \\
\end{align*}

```{remark}
Note that Davide's approach is politically much more palatable than a 2-step clustered design with super controls and super treated clusters.
```

```{remark}
At the same time, note that Davide's approach just gives the direction to change $\beta$ but does not deliver an optimal $\beta^*$, unelss we are already there.
```

```{remark}
Davide also proves several theorems that ensure that the estimates above are consistent under some reasonable assumptions, and can be approximated by a normal.
The randomization tests also have correct coverage.
```

#### Identifying and estimating the optimal treatment allocation

Davide also proposes an algorithm to sequentially converge to the optimal treatment level.
Here is Davide's proposed algorithm for $p_1=1$, with $\beta\in\left[\underline{\beta},\overline{\beta}\right]$:

  1. Organize clusters into pairs $\left\{k,k+1\right\}$, with $k\in\left\{1,3,\dots,K-1\right\}$;

  2. At $t=0$, treatment is assigned using rule $D_{i,0}^{(h)}|X_i^{(h)}=x\sim\mathcal{B}(\pi(x;\beta_0))$, $\forall k\in\left\{1,\dots,K\right\}$.
      Collect baseline outcomes and observe $(Y_{i,0}^{(h)},X_{i}^{(h)})_{i=1}^N$, for $h\in\left\{1,\dots,K\right\}$.
      Initialize $\widehat{M}_{k,0}=0$, $\tilde{\beta}_k^0=\beta_0$.

  3. while $1\leq t\leq T$, do:
  * Define

      \begin{align*}
        \tilde{\beta}_{h}^{(t)} & = \mathcal{P}_{\underline{\beta},\overline{\beta}-\eta_n}(\tilde{\beta}_k^0+\alpha_{h+2,t}\widehat{M}_{h+2,t-1})
      \end{align*}
      with the convention $h+2=1$ when $h=K$ and $h=K+1$, $\alpha_{k,t}$ is the learning rate and $\mathcal{P}_{a,b}(x)=\arg\min_{x'\in\left[a,b\right]^p}||x-x'||^2$):
  * Assign treatments as (for $\bar{C}n^{-\frac{1}{2}}<\eta_n<\bar{C}n^{-\frac{1}{4}}$):

    \begin{align*}
        D_{i,0}^{(h)}|X_i^{(h)}=x & \sim\mathcal{B}(\pi(x;\beta_{h,t}))\\
        \beta_{h,t} & = \begin{cases}
                          \tilde{\beta}_{h,t}+\eta_n & \text{ if } h \text{ is odd}\\
                          \tilde{\beta}_{h,t}-\eta_n & \text{ if } h \text{ is even}
                        \end{cases}
    \end{align*}
  * For $n$ units in cluster $h$ observe $Y_{i,t}^{(h)}$
  * For each pair $\left\{k,k+1\right\}$, estimate the marginal effect as follows:

    \begin{align*}
      \widehat{M}_{k,t}=\widehat{M}_{k+1,t} & =
\frac{1}{2\eta_n}\left[\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,1}-\frac{1}{n}\sum_{i=1}^nY^{(k)}_{i,0}\right]
-\frac{1}{2\eta_n}\left[\frac{1}{n}\sum_{i=1}^nY^{(k+1)}_{i,1}-\frac{1}{n}\sum_{i=1}^nY^{(k+1)}_{i,0}\right]
    \end{align*}

  4. End while.

  5. Return $\hat{\beta}^*=\frac{1}{K}\sum_{k=1}^K\tilde{\beta}^T_{k}$

```{remark}
The algorithm simply mimicks gradient descent as in a Newton-Raphson algorithm.
One twist is that it uses as estimate of the gradient the marginal treatment effect estimated in another pair of clusters.
This ensures that there will not be overfitting: the choices of optimal treatment level remain indpendent at each stage of the potential outcomes abd covariates in the cluster.
Using all pairs but the treated pairs would not work neither (see Appendix B.1.4 in [Davide's paper](http://arxiv.org/abs/2011.08174)).
```

```{remark}
When $p_1>1$, the algorithm is split in $\frac{T}{p_1}$ sub-waves of length $p_1$, where we move each coordinate sequentially before moving to the next wave.
```

```{remark}
How to choose the optimal learning rate $\alpha_{k,t}$?
Under strong concavity of the objective function, the learning rate should be of order $\frac{J}{t}$, with for example $J\in\left[0.1,0.2\right]$ when $\beta$ is a proportion.
More robust with moderate to large $T$:
```

  \begin{align*}
      \alpha_{k,t} & = \begin{cases}
  \frac{J}{T^{\frac{1-\nu}{2}}||\widehat{M}_{k,t}||} & \text{ if }||\widehat{M}_{k,t}||_2^2>\frac{\kappa}{T^{1-\nu}}-\epsilon_n,\\
  0 & \text{ otherwise }
                      \end{cases}
  \end{align*}

for $\epsilon_n>0$, $\epsilon_n\rightarrow 0$, and small constants $\nu\leq 1$, $J$, $\kappa>0$.
This approach of dividing the estimated marginal effect by its norm is called gradient norm rescaling and guarantees control of out-of-sample regret under strict quasi-concavity.

```{remark}
Under the assmption that $W(\beta)$ is $\sigma$-strongly concave, for some strictly positive $\sigma$, and under additional technical assumptions, the distance between \hat{\beta}^*$ and the optimal $\beta^*$ is arbitrarily small, as well as the distance between $W(\beta^*)$ and $W(\hat{\beta}^*)$.
Davide also shows that regret can be made arbitrarily small in his approach, whether in and out-of-sample.
```

## Diffusion effects with detailed networks

We are now going to study what happens when we have access to detailed network information.
We observe the contiguity matrix $A$, or at least all the relevant links for each member of our sample and the treatment status of each peer of our sample members.
The analysis of such data is going to closely follow the treatment by [Michael Leung (2020)](https://doi.org/10.1162/rest_a_00818).

### Setting

We consider a network of total size $n$, where connections are represented by the matrix $A$, which is such that there are no self-links ($A_{i,i}=0$, $\forall i\in\left\{1,\dots,n\right\}$).
For each unit $i$ we observe $D_i$, $Y_i$, $\gamma_i=\sum_{j=1}^na_{i,j}$ ($i$'s *degree* or number of neighbors), and $T_i=\sum_{j=1}^na_{i,j}D_j$ (the number of $i$'s neighbors that are treated).
We posit a treatment response function that is as follows:

\begin{align*}
  Y_i & = r(D_i,T_i,\gamma_i,\epsilon_i),
\end{align*}

where $\epsilon_i\in\mathbb{R}^{d_{\epsilon}}$ are unobserved influences to outcomes, and $r$ is a function.
One specification of $r$ is the linear first-degree influence model:

\begin{equation}
  Y_i  = \beta_1 + \beta_2D_i + \beta_3\frac{T_i}{\gamma_i} + \epsilon_i
  (\#eq:linearnetworkmodel)
\end{equation}

where outcomes depend linearly on the proportion of neighbors treated.

```{remark}
Michael's model implicitly imposes that diffusion effects can only stem from direct connexions.
Connexions further away on the network (friends of friends) have no direct effect on $i$'s outcome.
This assumption can be relaxed, as long as there is a maximum distance $K$ on the network after which neighbors treatments have no effect on $i$'s outcome.
In our formulation so far, $K=1$.
```

```{remark}
The way we collect data is either we observe the full network or, more often, we conduct a snowball-sampling of $1$-neighborhoods.
In this sampling strategy, we first randomly select a set of $\tilde{n}\leq n$ focal units on which we collect $\left(Y_i,D_i\right)$ and the identity of their neighbors, which gives us $\gamma_i$ and $a_{i,j}$.
We then collect the treatment status of each of the neighbors, which gives us $T_i$.
We therefore have as data: $\left(Y_i,D_i,T_i,\gamma_i\right)_{i=1}^{\tilde{n}}$ as well as $\tilde{A}$, the set of sampled links.
```

### Identification of causal effects

We define conditional causal effects as follows:

\begin{align*}
  \Delta^Y_{TDT}(t,\gamma) & = \esp{r(1,t,\gamma,\epsilon_i)-r(0,t,\gamma,\epsilon_i)|D_i=1,T_i=t,\gamma_i=\gamma}\\
  \Delta^Y_{TIT}(d,\gamma) & = \esp{r(d,t,\gamma,\epsilon_i)-r(d,t',\gamma,\epsilon_i)|D_i=d,T_i=t,\gamma_i=\gamma},
\end{align*}

with $\Delta^Y_{TDT}(t,\gamma)$ the average treatment effect on the directly treated, keeping the indirect level of treatment and the degree of each individual constant; and $\Delta^Y_{TIT}(d,\gamma)$ the average treatment effect on the indirectly treated, keeping the direct level of treatment and the degree of each individual constant.

```{remark}
[Leung (2020)](https://doi.org/10.1162/rest_a_00818) also allows for the identification of the effect on function of $Y_i$, $h(Y_i)$.
```

To state identification results for both treatment effects, we are going to make several assumptions on $\tilde{D}=\left\{D_i\right\}_{i=1}^{\tilde{n}}$, $\tilde{\epsilon}=\left\{\epsilon_i\right\}_{i=1}^{\tilde{n}}$ and $\tilde{A}$:

```{hypothesis,Exo,name='Treatment exogeneity'}
We assume that *(a)* $\tilde{D}\Ind\left(\tilde{A},\tilde{\epsilon}\right)$ and *(b)* $\forall i\in\left\{1,\dots,\tilde{n}\right\}$, $\epsilon_i\Ind\tilde{A}|\gamma_i$.
```

```{remark}
Assumption \@ref(hyp:Exo) imposes that the treatment does not alter links between units across the network, and, furthermore, since treatment is assumed i.i.d., it also imposes that the treatment is not allocated with respect to network characteristics.
This can be relaxed, for example by conducting an experiment stratified on network charcateristics.
```

```{remark}
Assumption \@ref(hyp:Exo) imposes full independence between treatment and unobservables, which is satisfied mostly when the treatment is randomly allocated across units.
```

```{remark}
Part *(b)* of Assumption \@ref(hyp:Exo) imposes that links are independent from error terms, conditional on degree.
This assumption rules out unobserved homophily (individuals forming links based on unobserved determinants of outcomes), a key open issue in the literature.
```

We also need one technical assumption:

```{hypothesis,Support,name='Support'}
We assume that *(a)* $\Pr(D_i=1)\in]0,1[$ and *(b)* there exists $P$: $\N\rightarrow\left[0,1\right]$ such that, $\forall\gamma\in\N$, $\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\uns{\gamma_i=\gamma}\probconv P(\gamma)$ and $\Gamma=\left\{\gamma:P(\gamma)>0\right\}$ is not empty and is different from $\left\{0\right\}$.
```

```{remark}
Assumption \@ref(hyp:Support) imposes that at least some sampled units have at least one link, and some of them are treatedd and some of them are not.
```

We can now prove identification of our effects of interest:

```{theorem,IdentLeung,Name="Identification of treatment effects on a network"}
Under Assumptions \@ref(hyp:Exo) and \@ref(hyp:Support), $\Delta^Y_{TDT}(t,\gamma)$ and $\Delta^Y_{TIT}(d,\gamma)$ are identified, $\forall d\in\left\{0,1\right\}$, $\forall t \leq \gamma$, $\forall\gamma\in\Gamma$:

 \begin{align*}
  \Delta^Y_{TDT}(t,\gamma) & = \esp{Y_i|D_i=1,T_i=t,\gamma_i=\gamma}-\esp{Y_i|D_i=0,T_i=t,\gamma_i=\gamma},\\
  \Delta^Y_{TIT}(d,\gamma) & = \esp{Y_i|D_i=d,T_i=t,\gamma_i=\gamma}-\esp{Y_i|D_i=d,T_i=t',\gamma_i=\gamma}.
\end{align*}

```

```{proof}
Under Assumption \@ref(hyp:Support), $\esp{Y_i|D_i=d,T_i=t,\gamma_i=\gamma}$ is well defined $\forall d\in\left\{0,1\right\}$, $\forall t \leq \gamma$, $\forall\gamma\in\Gamma$.
We have, $\forall d\in\left\{0,1\right\}$, $\forall t,t' \leq \gamma$:

\begin{align*}
  \esp{Y_i|D_i=d,T_i=t,\gamma_i=\gamma} & = \esp{r(d,t,\gamma,\epsilon_i)|D_i=d,T_i=t,\gamma_i=\gamma}\\
                                        & = \esp{\esp{r(d,t,\gamma,\epsilon_i)|\tilde{A},D_i=d,T_i=t,\gamma_i=\gamma}|D_i=d,T_i=t,\gamma_i=\gamma}\\
                                        & = \esp{\esp{r(d,t,\gamma,\epsilon_i)|\gamma_i=\gamma}|\gamma_i=\gamma}\\
                                        & = \esp{r(d,t,\gamma,\epsilon_i)|\gamma_i=\gamma}\\
                                        & = \esp{r(d,t,\gamma,\epsilon_i)|D_i=d',T_i=t',\gamma_i=\gamma},
\end{align*}

where the first quality is by definition, the second equality uses the Law of Iterated Expectations, and the third equality uses Assumption \@ref(hyp:Exo).
The third equality uses the fact that Assumption \@ref(hyp:Exo) implies that $(D_i=d,T_i=t)\Ind\epsilon_i|(\tilde{A},\gamma_i)$, which enables us to undo the conditioning on $(D_i=d,T_i=t)$ in the inner expectation.
We then use part *(b)* of Assumption \@ref(hyp:Exo) to undo the conditioning on $\tilde{A}$.
Since the inner expectation then does only depend on $\gamma_i$, the outer expectation also does, by the Law of Iterated Expectations.
This undoes the conditioning on $(D_i=d,T_i=t)$ in the outer expectation.
The same reasoning applied in reverse gives the last equality.
This proves the result.
```

### Estimation of causal effects

Michael explores two estimators of the causal effects, one nonparametric and one parametric.
The nonparametric estimators are as follows:

\begin{align*}
  \hat{\Delta}^{Y^{np}}_{TDT}(t,\gamma) & = \hat{\mu}(1,t,\gamma)-\hat{\mu}(0,t,\gamma)\\
  \hat{\Delta}^{Y^{np}}_{TIT}(d,\gamma) & = \hat{\mu}(d,t,\gamma)-\hat{\mu}(d,t',\gamma)\\
  \hat{\mu}(d,t,\gamma) & = \frac{\sum_{i=1}^{\tilde{n}}Y_i\unsi{i}{d,t,\gamma}}{\sum_{i=1}^{\tilde{n}}\unsi{i}{d,t,\gamma}} \\
  \unsi{i}{d,t,\gamma} & = \uns{D_i=1,T_i=t,\gamma_i=\gamma}.
\end{align*}

The parametric estimators are:

\begin{align*}
  \hat{\Delta}^{Y^{p}}_{TDT}(t,\gamma) & = \hat{\beta}^{OLS}_2\\
  \hat{\Delta}^{Y^{np}}_{TIT}(d,\gamma) & = \hat{\beta}^{OLS}_3\left(\frac{t-t'}{\gamma}\right),
\end{align*}

using the OLS estimates of Equation \@ref(eq:linearnetworkmodel).

We need several assumptions to ensure that our estimators converge to the actual treatment effect when the network size grows large.
Let $\max_kA_{ik}A_{jk}$ denote whether $i$ and $j$ are indirectly linked through a common friend $k$.

```{hypothesis,IndepShocks,Name="Independent Shocks"}
For any pair $(i,j)\in\left\{1,\dots,n\right\}^2$, *(a)* $\epsilon_i\Ind\epsilon_j|\tilde{A},A_{ij}=0,\max_kA_{ik}A_{jk}=0$ and *(b)* $(\epsilon_i,\epsilon_j)\Ind\tilde{A}|A_{ij},\gamma_i,\gamma_j,\sum_kA_{ik}A_{jk}$.
```

Part (a) of Assumption \@ref(hyp:IndepShocks) imposes that $\epsilon_i$ and $\epsilon_j$ can only be correlated if $i$ and $j$ they are neighbors or share a common neighbor.
Part (b) of Assumption \@ref(hyp:IndepShocks) imposes that $\epsilon_i$ and $\epsilon_j$ depend on the network only through $i$ and $j$'s own connection, own degrees, and number of common connections.

We are now going to define a set of properties on the connections between units that will enable to apply a CLT with non independent data.
Under Assumption \@ref(hyp:IndepShocks), we know that the outcome of observations are correlated across the netework when either they are direct neighbor or they have a neighbor in common.
Let us encode these connections within a new matrix, $G$, such that each entry measures whether the outcome of two observations are correlated or not: $G_{ij}=\uns{A_{ij}+\max_kA_{ik}A_{jk}+\uns{i=j}>0}$.
Let $\mathbf{N}_i=\left\{j:G_{ij}=1\right\}$ be the set of units whose outcomes are correlated with that of $i$, and $|\mathbf{N}_i|$ the cardinal of this set (*i.e.* the number of units whose outcomes are correlated with $i's$ outcome).
Let $G^3$ be the third matrix power of $G$.

```{hypothesis,DegreeDistribution,Name="Degree Distribution"}
$\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}|\mathbf{N}_i|^3$ and $\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\sum_{j\neq i}(G^3)_{ij}$ are bounded in probability.
```

Assumption \@ref(hyp:DegreeDistribution) imposes that the amount of links in the network is small enough so that the $G$ matrix is sparse.
Assumption \@ref(hyp:DegreeDistribution) implies that the average degree is bounded asymptotically, so that average degree is sunstantially smaller than sample size.
There is no direct way to test for this, but one convenient approach is to compute the density of $G$ (its proportion of linked pairs), $\frac{1}{\left(\begin{array}\tilde{n}\\2\end{array}\right)}\sum_{i\leq j}G_{ij}=\frac{2\sum_{i=1}^{\tilde{n}}\gamma_i}{\tilde{n}-1}$.
In sparse enough networks, the density is around 10\%.