What are efficient pooling designs for RT-PCR tests?

Question

I realize this is long, but hopefully I think it may be worth the reading for people interested in combinatorics and it might prove important to Covid-19 testing. Slightly reduced in edit.

The starting point of this question is this important article by Mutesa et al. where a hypercube $\{0,1,2\}^n$ is used for pooling takings for Covid-19 testing. This pooling design is only usable at low prevalence, the main questions are whether it can be improved in its prevalence range and whether one can find good pooling designs usable at higher prevalence.

I have written a draft sketching some possible research directions, and I would like to share here the main point and ask here what seem to me to be the main questions. It might be better to set up a Polymath project, but I don't feel I have the skills (I am not a combinatorician myself) nor the proper network to make it work.

We will model pooled PCR testing for e.g. Covid-19 by a hypergraph, i.e. a pair $(V,E)$ where $V$ is a set (whose elements are called vertices and represent patients) and $E$ is a set of non-empty subsets of $V$ (whose elements are called edges and represent pools). Recall that $v=\lvert V\rvert$ is the order of the hypergraph and $e=\lvert E\rvert$ its size; $v$ is the number of takings analyzed in a batch, and $e$ the number of tests to be run in parallel.

Definition Given a vertex $x\in V$, let $x^*$ be the set of edges containing $x$. Given a subset $X\subset V$ of vertices, let $X^*=\{e\in E \mid \exists x\in X, x\in e\}$ be the set of all edges incident to some element of $X$. Let us define a pooling design as a hypergraph $(V,E)$ satisfying the following property: $$\forall x\in V, \forall X\subset V, \quad x^* = X^* \implies X=\{x\}$$

This condition ensures that, whenever there is at most one positive taking, its uniqueness is guaranteed by the tests and it can be identified.

Given a pooling design $(V,E)$, we define its compression rate $$r=\frac{e}{v}$$ (the smaller, the better), and its detection capacity, i.e. the maximal number of positive taking that can be guaranteed and identified. Formally, letting $\mathcal{P}_{\le n}(V)$ be the set of subsets of $V$ with at most $n$ elements, we set $$c = \max \big\{n\colon \forall X,Y\in \mathcal{P}_{\le n}(V), X^*=Y^*\implies X=Y \big\}.$$ The definition of a pooling design ensures $c\ge 1$, but larger is better.

Proposition. Let $(V,E)$ be a pooling design of order $v$, size $e$ and detection capacity $c$. Then the compression rate satisfies $$r \ge H\big(\frac{c}{v}\big) - o_{v\to\infty}(1) $$

The proof is straightfoward, and sketched in the draft.

Example 1. The individual testing consist in taking $V$ the set of all $N$ takings, and $E=\big\{\{x\} \colon x\in V\big\}$: each edge is a single vertex. We call this the trivial pooling design of order $v$; it has \begin{align*} v &= e = N & r &= 1 & c &= N \end{align*}

Example 2. The hypercube design of (Mutesa et al. 2020) with dimension $D\ge2$ consist in taking $V=\{1,2,3\}^D$ and $E$ the set of coordinate slices, i.e. $$E=\bigcup_{k=1}^D \big\{\{1,2,3\}^{k-1}\times \{i\}\times\{1,2,3\}^{D-k} \colon i\in\{1,2,3\}\big\}.$$ It has \begin{align*} v &= 3^D & e &= 3D & r &= \frac{D}{3^{D-1}} & c &= 1 \end{align*}

Comparing $H(c/v)$ and the actual compression rate for the hypercube design with various values of $D$ show some limited room for improvement (see the draft): the hypercube is off by only a factor less than $2$; these pooling designs are thus not too far from optimal in their prevalence regime.

Example 3. The complete quadrilateral can be described with $V=\{1,2,3,4,5,6\}$ and $E=\big\{ \{1,2,3\}, \{3,4,5\}, \{5,6,2\}, \{1,4,6\} \big\}$. It has \begin{align*} v &= 6 & e &= 4 & r &= \frac23 & c &= 1 \end{align*} For comparison, we note that $H(c/v) \simeq 0.65$, very close to the compression rate: this pooling design is close to optimal in its prevalence regime.

Other examples from incidence geometry are given in the draft.

Example 4. Let $p$ be a prime number (or a primitive number) and $\mathbb{F}_p$ be the Field with $p$ elements. Choose a dimension $D\ge 2$ and a parameter $k\ge D$. We set $V=\mathbb{F}_p^D$ (for $p=3$, we thus have the same vertex set than in the hypercube design). Let $(\phi_1,\dots,\phi_k)$ be linear forms such that any $D$ of them are linearly independent. Without loss of generality, we can assume $(\phi_1,\dots,\phi_D)$ is the canonical dual basis (i.e. $\phi_i(x_1,\dots,x_D) = x_i$). Last, we let $E$ be the set of all levels of all the $\phi_i$: $$ E = \big\{\phi_i^{-1}(y) \colon i\in\{1,\dots, k\}, y\in\mathbb{F}_p \big\}.$$ Let us call the pooling design $(V,E)$ the generalized hybercube design of parameters $(p,D,k)$. It has \begin{align*} v &= p^D & e &= kp & r &= \frac{k}{p^{D-1}} \end{align*} and the remaining question is how large can be $c$.

General Question Which values of $v,r,c$ are realized by a pooling design?

Question 1. Determine $c$ for the generalized hypercube design (it might be that $c$ depends on the specific linear form chosen, although I would bet a low stake that it does not). Given $v_0$, what choice of $p,D,k$ such that $v\simeq v_0$ minimizes $\frac{r}{H(c/v)}$? Given a prevalence, what is the best value of $r$ that can be achieved with a generalized hypercube for which detection capacity is exceeded with probability less than $5\%$?

Question 2. Does there exist pooling designs with $v\gg 1$, $c/v \simeq 1/6$ and compression rate $\simeq2/3$?

Question 3. For small values of $v$, give all pooling designs that are optimal in the sense that no other pooling design with the same order has both better compression rate and better detection capability.

Question 4. Are any of the above question made simpler if we generalize the definitions, and replace the detection capacity $c$ by the set $\mathcal{D}$ of $X\subset V$ such $X^*=Y^* \implies X=Y$ for all $Y\subset V$? (Then the performance of the pooling at prevalence $p$ would be the probability that the set of positives takings is in $\mathcal{D}$, assuming the takings are IID random variables with Bernoulli laws of parameter $p$).

Some discussion related to the choice of tags in chat: chat.stackexchange.com/transcript/10243/2020/11/15 (As the OP, you're probably in the best position to judge which of those suggestions seem reasonable.) — Martin Sleziak, Nov 16, 2020 at 8:12
Thanks. I added combinatorial designs without being sure it fits (symmetry plays no role here, and designs have usually much more edges than vertices, but maybe this can still leads to interesting examples by some duality and more importantly it should attract the people with the right knowledge and skills. — Benoît Kloeckner, Nov 16, 2020 at 8:16
Somewhat related question: mathoverflow.net/questions/363457/… — Gro-Tsen, Nov 19, 2020 at 10:40
@Balazs: yes, see the answer by the author Endre Csoka below. — Benoît Kloeckner, Nov 19, 2020 at 15:05

Benoît Kloeckner · Accepted Answer · 2020-11-16 08:59:58Z

Let me get started with a small take at question 3, by proving that for $v\le 6$, the complete quadrilateral is optimal.

First, for $v\in\{1,2,3\}$ it is clear that no pooling design can have compression rate $r<1$ (so trivial is optimal). For example for $v=3$, we need to distinguish at least $5$ situations (no positives, at least $2$ positives, and $3$ possible single positives), so $2$ bits of information cannot be enough and we must have $e\ge 3$.

Thus $v=4$ is the first case where the trivial bound does not preclude a pooling design of interest (we need to distinguish $6$ situations, leading to the bound $e\ge3$). However:

Proposition. There are no pooling design with $v=4$ and $r<1$.

Proof. Assume $(V,E)$ is a pooling design with $V=\{1,2,3,4\}$ and $e=3$. If an element of $E$ is a singleton, then removing it from $E$ and its element from $V$ would give a pooling design with $v=3$ and $e=2$, which is impossible. If two elements $p,q$ of $E$ are contained one in the other, $p\subset q$, then replacing $q$ with $q\setminus p$ gives a pooling design (more information is carried by the results of $(p,q\setminus p)$ than by the results of $(p,q)$).

We can thus assume that no element of $E$ is a singleton, and no element of $E$ contains another (these are general arguments than can be used more widely).

In particular, all elements of $E$ have $2$ or $3$ elements.

No vertex can belong to all edges, since otherwise the positivity of this vertex would entail positivity of all edges, an event that cannot be distinguished from all vertices being positives.

No vertex $a$ can be contained in only one edge, otherwise the positivity of another vertex $b$ of this edge could not be distinguished from the positivity of $a$ and $b$.

It follows that all vertices must have degree exactly $2$. The total degree is thus $8$, and we must have two elements of $E$ of cardinal $3$ and the last one of cardinal $2$. But then the two largest edges must have two elements in common, which thus have the same link, a contradiction. $\square$

The same arguments lead to:

Proposition. A pooling design with $v=5$ must have $e\ge 4$.

Note that $(v,e) = (5,4)$ can be realized by removing a vertex from the complete quadrilateral.

Proof. Assume that $(V,E)$ is a pooling design with $v=5$ and $e=3$. Then its edges have cardinal $2,3$ or $4$ and its vertices all have degree $2$. The total degree is $10$, which can be achieved in two ways.

First, the decomposition $10=4+4+2$, i.e. two edges have $4$ elements each. But then these edges have two elements in common, which cannot be distinguished since they have degree $2$.

Second, the decomposition $10=4+3+3$. Then letting $V=\{1,2,3,4,5\}$ and $E=\{p,q,r\}$ with $p=\{1,2,3,4\}$, we must have $5^* = \{q,r\}$. Each of $q$ and $r$ have $3$ elements, including $5$. Therefore, up to symmetry, $q=\{1,2,5\}$ and $r=\{3,4,5\}$. Then $1^*=2^*$ and $3^*=4^*$, impossible. $\square$

Corollary. The complete quadrilateral is optimal for order $6$. For order $v< 6$, the only other pooling design with compression rate $r<1$ is obtained by removing one vertex from the complete quadrilateral.

Louis D · Accepted Answer · 2020-11-20 18:52:41Z

This isn't a full answer, but too long for a comment. I suppose it comes closest to trying to answer Question 3 or the general question of whether the hypercube design can be improved.

Definition Given a hypergraph $G=(\{v_1, \dots, v_n\}, E)$, the dual of $G$ is the hypergraph $H$ with $V(H)=E(G)$ and $E(H)=\{\{e\in E(G): v_i\in e\}: i\in [k]\}$ (in other words, each edge of $H$ is a maximal collection of edges from $G$ which are incident with a single vertex).

Let $H_{n,k}$ be the dual of $K_n^{k}$, the complete $k$-regular hypergraph on $n$ vertices. Note that the dual of $H_{n,k}$ is isomorphic to $K_n^k$.

(It seems to me that this hypergraph must have been studied before, but I couldn't find any reference to it. One possible lead is that $H_{4,2}$ is what you call the complete quadrilateral.)

Claim 1. $H_{n,k}$ is a $\binom{n-1}{k-1}$-uniform $k$-regular hypergraph with $\binom{n}{k}$ vertices and $n$ edges.

Proof. In $K_n^k$, every vertex is incident with $\binom{n-1}{k-1}$ edges, every edge has order $k$, there are $\binom{n}{k}$ edges, and $n$ vertices.$\square$

Claim 2. $H_{n,k}$ is a pooling design.

Proof. Every vertex in $H_{n,k}$ is incident with $k$ edges, so $|x^*|=k$. If $X$ is a set of vertices with $|X|>1$ (which corresponds to a set of more than one edge in $K_n^k$, which spans more than $k$ vertices in $K_n^k$) then $|X^*|>k$. So $x^*\neq X^*$ if $|X|>1$.$\square$

The compression rate of $H_{n,k}$ is $\frac{n}{\binom{n}{k}}$ which is minimized when $k=\lfloor{n/2}\rfloor$. Also note that ratio of the uniformity to the number of vertices is $\binom{n-1}{k-1}/\binom{n}{k}=k/n$. So there is a tradeoff when minimizing the compression rate, since the uniformity and degree increase when we increase $k$.

Some more examples: $H_{5,2}$ is 4-uniform with 10 vertices and 5 edges giving a compression ratio of $1/2$. $H_{6,3}$ is 10-uniform with 20 vertices and 6 edges, giving a compression ratio of $3/10$. $H_{7,3}$ is 15-uniform with 35 vertices and 7 edges, giving a compression ratio of $1/5$. Note that the hypercube design with $D=3$ is 9-regular with 27 vertices and 9 edges and thus a compression ratio of 1/3, so $H_{6,3}$ and $H_{7,3}$ compare favorably in this case.

Update 1. (It seems best to update my previous answer rather than write a new one.)

After thinking it over some more, I think I have an alternate characterization of pooling designs which both makes it easier to check if $H$ is a pooling design and elucidates some features of pooling designs. In particular, this gives a simple proof of the Propositions in your answer.

Claim 3 $H$ is a pooling design if and only if $x^*\not\subseteq y^*$ for all distinct $x,y\in V(H)$.

Proof. ($\Rightarrow$) Suppose there exists distinct $x,y\in V(H)$ such that $x^*\subseteq y^*$. Then $y^*=\{x,y\}^*$ and thus $H$ is not a pooling design.

($\Leftarrow$) Suppose $H$ is not a pooling design; that is, suppose there exists $y\in V(H)$ and $Y\subseteq V(H)$ with $Y\neq \{y\}$ such that $y^*=Y^*$. Since $Y\neq \{y\}$, there exists $x\in Y$ such that $x\neq y$. Since $x\in Y$, we have $x^*\subseteq Y^*=y^*$. $\square$

Corollary 1 Let $H$ be a hypergraph and let $G$ be the dual of $H$. $H$ is a pooling design if and only if $e\not\subseteq f$ for all distinct $e,f\in E(G)$.

Proof. ($\Rightarrow$) Suppose $H$ is a pooling design. Choose distinct $e,f\in E(G)$ which correspond to distinct $x, y\in V(H)$ respectively. Since $x^*\not\subseteq y^*$, we have $e\not\subseteq f$.

($\Leftarrow$) Suppose $e\not\subseteq f$ for all distinct $e,f\in E(G)$. Choose distinct $x,y\in V(H)$ which correspond to distinct $e,f\in E(G)$. Since $e\not\subseteq f$, we have $x^*\not\subseteq y^*$. $\square$

Corollary 2 Let $H$ be a hypergraph with $e$ edges and $n$ vertices such that $\binom{e}{\lfloor{e/2}\rfloor}<n$. Then $H$ is not a pooling design.

Proof. Let $G$ be the dual of $H$ and note that $G$ has $e$ vertices and $n$ edges. Since $|E(G)|=n>\binom{e}{\lfloor{e/2}\rfloor}=\binom{|V(G)|}{\lfloor{|V(G)|/2}\rfloor}$, Sperner's theorem implies that there exists distinct $e,f\in E(G)$ such that $e\subseteq f$. Thus $H$ is not a pooling design by Corollary 1. $\square$

In particular, this proves that every pooling design on $4\leq n\leq 6$ vertices has at least 4 edges, every pooling design on $7\leq n\leq 10$ vertices has at least 5 edges, etc.

Update 2.

Again, after considering some more, I now think it's clearer to just remain in the setting of the hypergraph $G$ and forget about taking the dual.

For example, let's compare the $K_8$-design to the hypercube design with $D=3$. In the $K_8$-design, each edge is a sample (there are 28), each vertex is a test pooling the samples which are incident with that vertex (there are 8), each test pools 7 samples (since the degree of each vertex is 7), and each sample will be used twice (since $K_8$ is 2-uniform). As I mentioned in a comment, this is better than the $D=3$ hypercube design in every parameter. Also you can see that if exactly one sample is infected, say the edge $\{i,j\}$, then exactly two tests (test $i$ and test $j$) will come back positive.

For another example, let's compare the $K_{13}$-design to the hypercube design with $D=4$. The $D=4$ hypercube design handles 81 samples using 12 tests each of which has size 27 and each sample is used 4 times. The $K_{13}$-design handles 78 samples using 13 tests, but each test has size 12 and each sample is only used 2 times.

For a final example, let's compare the $K_{9,9}$-design (that is, a complete bipartite graph with 9 vertices in each part) to the $D=4$ hypercube design. The $K_{9,9}$-design handles 81 samples using 18 tests, each of which has size 9 and each sample is used 2 times; however, this design has the additional feature that if three tests come back positive, then we will know exactly which two samples are infected. Neither the $K_{13}$-design, nor the $D=4$ hypercube design have that property.

Update 3

Given this alternate way of thinking about pooling designs, the detection capacity of $G$ can be defined as the largest integer $c$ such that no edge $e\in E(G)$ is contained in the union of at most $c$ edges of $E(G)\setminus \{e\}$. So if we want a pooling design with testing capacity $c$ which uses $t$ tests, we want a hypergraph on $t$ vertices with as many edges as possible such that no edge $e\in E(G)$ is contained in the union of at most $c$ edges of $E(G)\setminus \{e\}$. It turns out that this problem was studied in Erdős, Paul; Frankl, P.; Füredi, Z., Families of finite sets in which no set is covered by the union of (r) others, Isr. J. Math. 51, 79-89 (1985). ZBL0587.05021.

Thanks, this is very nice! The compression rate has to be compared with $H(c/v)$, but assuming $c=1$ in the above example, $H_{6,3}$ is only a factor $\simeq 1.05$ off and $H_{7,3}$ only $\simeq 1.07$ compared to a factor $\simeq1.46$ for the hypercube of dimension $3$, so these examples are very good! (They have $c=1$, otherwise they would probably beat the lower bound.) — Benoît Kloeckner, Nov 19, 2020 at 8:25
I updated my answer. Now I want to think about the case of larger values of c. — Louis D, Nov 19, 2020 at 16:38
I'll just make a comment until I have time to work out the details. In order to generalize this to higher detection capacity, say $c=2$, it seems that we need a generalization of Sperner's theorem which gives the maximum number of edges in a hypergraph $H$ such that for distinct $e,f\in E(H)$ and (not-necessarily-distinct) $g_1,g_2\in E(H)$, $e\cup g_1\not\subseteq f\cup g_2$. The dual of such a hypergraph would have a detection capacity of at least 2. As far as I can tell, the dual of the complete $k$-regular hypergraph on $2k$ vertices has a detection capacity of 1. — Louis D, Nov 20, 2020 at 1:37
Another thing I just realized is that the "correct" comparison to the hypercube design with $D=3$ is $H_{8,2}$. $H_{8,2}$ is 7-uniform, 2-regular, has 28 vertices, 8 edges giving a compression ratio of 2/7. So it can test more people (28 instead of 27), with fewer tests (8 instead of 9), the size of each test is smaller (7 instead of 9), and the samples need to be split a fewer number of times (2 instead of 3). — Louis D, Nov 20, 2020 at 12:12
I updated my answer yet again to include more detailed information about the case of larger testing capacity. — Louis D, Nov 20, 2020 at 18:53

Endre Csóka · Accepted Answer · 2020-11-19 10:07:01Z

4

If you are thinking about the realistic problem for COVID-19, then it is different from your mathematical question. I tried to make a summary about the real question: https://arxiv.org/pdf/2005.02388.pdf

answered Nov 19, 2020 at 10:07

Endre Csóka

411 bronze badge

$\begingroup$ You should read the paper by Mutesa et al. mentioned. They do actually implement their pooling methods in practice, and insist that one should limit the number or successive rounds. Your paper seems to under-estimate the importance of parallel testing, from what I see. $\endgroup$
– Benoît Kloeckner
Nov 19, 2020 at 14:20
2

$\begingroup$ After further reading, there seems to be much in common between your article and the above setup, I do not see that they oppose one to the other. $\endgroup$
– Benoît Kloeckner
Nov 19, 2020 at 15:06

Add a comment |

Benoît Kloeckner · Accepted Answer · 2021-01-14 09:04:25Z

I add this answer so as to be able to mark this question as answered. As I should have guessed, these problems have been studied for more than 70 years, and the questions I asked are probably either solved or known to be open, up to minor changes. One reference relevant to the questions I asked here (pertaining to "combinatorial Group testing") is

Du, D., Hwang, F. K., & Hwang, F. (2000). Combinatorial group testing and its applications (Vol. 12). World Scientific.

(thanks to Louis D for pointing this reference to me.)

The practical problem, however, is rather the Probabilistic Group Testing with "trivial two-stage algorithms" (more stages are impractical and most importantly too long to provide the results, purely non-adaptative algorithm leave errors which are usually not acceptable). Optimal performance is known in the limit of large volume and zero prevalence, see

Mézard, M., & Toninelli, C. (2011). Group testing with random pools: Optimal two-stage algorithms. IEEE Transactions on Information Theory, 57(3), 1736-1745.

An impressive point of this paper is that two-stage algorithms achieve (in the limit) the information theoretic bound up to a rather modest (and proved to be optimal) constant.

A recent survey is

Aldridge, M., Johnson, O., & Scarlett, J. (2019). Group testing: an information theory perspective. arXiv preprint arXiv:1902.06002.

All this seems to leaves some important practical questions open, e.g. identifying what are best (or close to optimal) two-stage algorithms in the case of fixed prevalence.

Benoît Kloeckner · Accepted Answer · 2020-11-22 17:45:27Z

An interesting direction, uncovered by @LouisD's answer mentioning [EFF] (Erdős, Paul; Frankl, P.; Füredi, Z., Families of finite sets in which no set is covered by the union of (r) others, Isr. J. Math. 51, 79-89 (1985). ZBL0587.05021), is to find a family $V$ of $k$-subsets of a $n$-set $E$, such that no two elements in the family intersect in more than $t$ points. Then associating each subset to a taking, and each element of $E$ to a pool, we get a pooling design with detection capacity at least $\lceil \frac k t\rceil-1$ since it needs at least $\lceil \frac k t\rceil$ elements of the family to cover any other elements.

For this, one can use finite fields in a number of way, using for example the fact that two lines of a projective space over $\mathbb{F}_q$ intersect in at most $1$ points (this can be generalized to other dimensions).

Among the pretty effective pooling designs one can get this way, let us mention two that are not equivalent to previously described in the other anwsers.

1.1. Consider $E=\mathbb{F}_3^3$ and $V$ the set of its affine lines. Then we get $v=117$, $e=27$ and $c=2$.

1.2 Consider $E=\mathbb{P}^3\mathbb{F}_3^4$ and $V$ the set of its (projective) lines. Then we have $v=130$, $e=40$ and $c=2$.

Very high compression rates can be achieved with $2$-planes in $4$-dimensional spaces, but the detection capacity stays moderate and this seems only applicable in low prevalence. Low compression rates but high detection capacity are achieved by taking large $q$ and working in dimension $2$.

Edit. Removed another method, whose computations where way wrong.

Stack Exchange Network

What are efficient pooling designs for RT-PCR tests?

5 Answers 5

Your Answer

Not the answer you're looking for? Browse other questions tagged
co.combinatorics
graph-theory
discrete-geometry
applied-mathematics
combinatorial-designs
or ask your own question.

Linked

What are efficient pooling designs for RT-PCR tests?

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged co.combinatoricsgraph-theorydiscrete-geometryapplied-mathematicscombinatorial-designs or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
co.combinatorics
graph-theory
discrete-geometry
applied-mathematics
combinatorial-designs
or ask your own question.