Pointwise Information Decomposition Using the Specificity and Ambiguity Lattices

What are the distinct ways in which a set of predictor variables can provide information about a target variable? When does a variable provide unique information, when do variables share redundant information, and when do variables combine to provide complementary information? The redundancy lattice from the partial information decomposition of Williams and Beer provided a promising glimpse at the answer to these questions; however, this structure was constructed using a much criticised measure of redundant information. Despite much research effort, no satisfactory replacement measure has been proposed. This paper takes a different approach, applying the axiomatic derivation of the redundancy lattice to a single realisation from the set of variables. In order to do this, one must overcome the difficulties associated with signed pointwise mutual information. This is done by applying the decomposition separately to the non-negative entropic components of the pointwise mutual information, which we refer to as the specificity and the ambiguity. Then, based upon an operational interpretation of redundancy, measures of redundant specificity and ambiguity are defined. It is shown that the decomposed specificity and ambiguity can be recombined to yield the sought-after information decomposition. The decomposition is applied to canonical examples from the literature and its various properties are discussed. In particular, the pointwise decomposition using specificity and ambiguity satisfies a chain rule over target variables, which provides new insights into interpreting the well-known two-bit copy example.


I. INTRODUCTION
The aim of information decomposition is to divide the total amount of information provided by a set of predictor variables, about a target variable, into atoms of partial information contributed either individually or jointly by the various subsets of the predictors.So suppose that we are trying to predict a target variable T , with finite state space T , from a pair of predictor variables S 1 and S 2 , with finite state spaces S 1 and S 2 .The mutual information I(S 1 ; T ) quantifies the information S 1 individually provides about T .Similarly, the mutual information I(S 2 ; T ) quantifies the information S 2 individually provides about T .Now consider the joint variable S 1,2 , with finite state space S 1 ×S 2 .The joint mutual information I(S 1,2 ; T ) quantifies the total information S 1 and S 2 jointly provide about T .Although Shannon's information theory provides the above three measures of information, there are four possible ways S 1 and S 2 could contribute information about T : the predictor S 1 could uniquely provide information about T ; or the predictor S 2 could uniquely provide information about T ; both S 1 and S 2 could both individually, but redundantly, provide the same information about T ; or the predictors S 1 and S 2 could together provide information about T which is not available in either predictor individually.Hence, we have the following underdetermined set of equations, where U (S 1 \S 2 → T ) and U (S 2 \S 1 → T ) is the unique information provided by S 1 and S 2 respectively, R(S 1 , S 2 → T ) is the redundant information, and C(S 1 , S 2 → T ) is the complementary information.(The directed notation is utilise here to emphasis the privileged role of the variable T .)This is the bivariate information decomposition.The problem is to define one of the unique, redundant or complementary information-something not provided by Shannon's information theory-in order to complete the decomposition.Now suppose that we are trying to predict a target variable T from a set consisting of n finite state predictor variables S = {S 1 , . . ., S n }.Again, the aim of information decomposition is to divide the total amount of information I(S 1 , . . ., S n ; T ) into atoms of partial information contributed either individually or jointly by the various subsets of S. But what are the distinct ways in which these subsets of predictors might contribute information about the target?Multivariate information decomposition is more involved than bivariate information decomposition because the structure of multivariate information is non-trivial-it is not immediately obvious how many atoms of information one needs to consider, nor is it clear how these atoms should relate to each other.Hence, the general problem of information decomposition is to provide both a structure for multivariate information which is consistent with the bivariate decomposition, and a way to uniquely define the atoms in this general structure.
The remainder of Section I will introduce an intriguing framework called partial information decomposition (PID) which aims to address the general problem of information decomposition; however, it will also highlight some of the criticism and weaknesses of this framework.Section II will then consider the underappreciated pointwise nature of information and discuss the relevance of this to the question of information decomposition.In particular, a modified, pointwise partial information decomposition (PPID) is proposed, although this approach is quickly repudiated due to complications associated with attempting to decompose the pointwise mutual information.Section III aims to circumvent this issue by examining information on a fundamental level.The conclusion of this enquiry is that one must address the problem of information decomposition by considering not merely the pointwise mutual information, but rather the unsigned entropic components of the pointwise mutual information, which are referred to as the specificity and the ambiguity.Based upon this, PPID using the specificity and ambiguity is introduced in Section IV-this is the main section of this article.Section V will apply this framework to a number of canonical examples from the PID literature, and discuss some of the key properties of the decomposition.Section VI concludes the main body of the article.

A. Notation
The following notational convention are observed throughout this article: T , T , t, t c , denote the target variable, event space, event and complementary event respectively; S, S, s, s c , denote the predictor variable, event space, event and complementary event respectively; S, s, represent the set of n predictor variables {S1, . . ., Sn} and events {s1, . . ., sn} respectively; T t , S s , denote the two-event partition of the event space, i.e.T t = {t, t c } and S s = {s, s c }; H(T ), I(S; T ), uppercase function names be used for average information-theoretic measures; h(t), i(s, t), lowercase function names be used for pointwise information-theoretic measures.
Finally, to be discussed in more detail when appropriate, consider the following: A1, . . ., A k sources are sets of predictor variables, i.e.Ai ∈ P1(S) where P1 is the power set without ∅; a1, . . ., a k source events are sets of predictor events, i.e. ai ∈ P1(s).

B. Partial Information Decomposition
The partial information decomposition (PID) of Williams and Beer [1,2] was introduced to address the problem of multivariate information decomposition.The approach taken is appealing as, rather than speculating about the structure of multivariate information, Williams and Beer took a more principled, axiomatic approach.First they consider potentially overlapping subsets of S called sources, denoted A 1 , . . ., A k .Then they examine the various ways these sources might contain the same information.Formally, they introduce three axioms which "any reasonable measure for redundant information [I ∩ ] should fulfil" [3, p. 3502]. 1  W&B Axiom 1 (Commutativity).Redundant information is invariant under any permutation σ of sources, W&B Axiom 2 (Monotonicity).Redundant information decreases monotonically as more sources are included, W&B Axiom 3 (Self-redundancy).Redundant information for a single source A i equals the mutual information, Axioms 1 and 2 are based upon the intuition that redundancy should be analogous to the notion of intersection from set theory (which is both commutative and monotonically decreasing), while Axiom 3 aims to tie this notion of redundancy to Shannon's information theory.In addition to these three axioms.there is a fourth (implicit) Axiom being assumed here known as local positivity [5], which is the requirement that all atoms be non-negative.Williams and Beer [1,2] then show how these axioms reduce the number of sources to the collection of sources such that no source is a superset of any other.These remaining sources are called partial information atoms (PI atoms).Each PI atom corresponds to a distinct way the set of predictors S can contribute information about the target T .Furthermore, Williams and Beer show that these PI atoms are partially ordered, and hence form a lattice which they call the redundancy lattice.(Figure 3 depicts the redundancy lattices for bivariate and trivariate cases.)For the bivariate case, the redundancy lattice recovers the decomposition (1), while in the multivariate case it provides a meaningful structure for decomposition of the total information provided by an arbitrary number of predictor variables.
While the redundancy lattice of PID provides a structure for multivariate information decomposition, it does not uniquely determine the value of the PI atoms in the lattice.To do so requires actually defining a measure of redundant information which satisfies the above axioms.Hence, in order to complete the PID framework, Williams and Beer simultaneously introduced a measure of redundant information called I min which quantifies redundancy as the minimum information that any source provides about a target event t, averaged over all possible events from T .However, not long after its introduction I min was heavily criticised.Firstly, I min does not distinguish between "whether different random variables carry the same information or just the same amount of information" [5, p. 269] (see also [6,7]).Secondly, I min does not possess the target chain rule introduced by Bertschinger et al. [5] (under the name left chain rule).This latter point is problematic as the target chain rule is a natural generalisation of the chain rule of mutual information-i.e. one of the fundamental, and indeed characterising, properties of information in Shannon's theory [8,9].
These issues with I min prompted much research attempting to find a suitable replacement measure compatible with the PID framework.Using the methods of information geometry, Harder et al. [6] focused on a definition of redundant information called I red (see also [10]).Bertschinger et al. [11] defined a measure of unique information U I based upon the notion that if one variable contains unique information then there must be some way to exploit that information in a decision problem.Griffith and Koch [12] used an entirely different motivation to define a measure of synergistic information S VK whose decomposition transpired to be equivalent to that of U I [11].Despite this effort, none of these proposed measures are entirely satisfactory.Firstly, just as for I min , none of these proposed measures possess the target chain rule.Secondly, these measures are not compatible with the PID framework in general, but rather are only compatible with PID for the special case of bivariate predictors, i.e. the decomposition (1).This is because they all aim to simultaneously satisfy the Williams and Beers axioms, local positivity, and the identity property introduced by Harder et al. [6].In particular, Rauh et al. [13] proved that no measure satisfying the identity property and the Williams and Beer Axioms 1-3 can yield a non-negative information decomposition beyond the bivariate case of two predictor variables.In addition to these proposed replacements for I min , there is also a substantial body of literature discussing either PID, similar attempts to decompose multivariate information, or the problem of information decomposition in general [3-5, 7, 10, 13-27].Furthermore, there has also been several attempts to apply the current proposals [28][29][30][31].Nevertheless (to date), there is no generally accepted measure of redundant information which is entirely compatible with PID framework, nor has any other well-accepted multivariate information decomposition emerged.
To summarise the problem, we are seeking a meaningful decomposition of the information provided an arbitrarily large set of predictor variables about a target variable, into atoms of partial information contributed either individually or jointly by the various subsets of the predictors.Crucially, the redundant information must capture when two predictor variables are carrying the same information about the target, not merely the same amount of information.Finally, any proposed measure of redundant information should satisfy the target chain rule so that net redundant information can be consistently computed for a chain of (potentially related) target events.

II. POINTWISE INFORMATION THEORY
Although underappreciated in the current reference texts on information theory [32,33], both the entropy and mutual information can be derived from first principles as fundamentally pointwise quantities-that is, as measures which quantify the information content of individual events rather than entire variables. 2The pointwise entropy h(t) = − log p(t) quantifies the information content of a single event t, while the pointwise mutual information, quantifies the information provided by s about t, or vice versa.The usual (average) entropy and (average) mutual information can be recovered by taking the expectation over all events from the relevant variables, i.e.H(T ) = h(t) and I(S; T ) = i(s; t) .To our knowledge, this pointwise notion of information was first considered by Woodward and Davies [34,35] who noted that average form of Shannon's entropy "tempts one to enquire into other simpler methods of derivation [of the per state entropy]" [34, p. 51].Indeed, they showed that the (pointwise) entropy and (pointwise) mutual information can both be derived from just two axioms concerning the addition of the information provided by the occurrence of individual events [35].Fano [9] formalised their idea further by deriving the pointwise mutual information and pointwise entropy from four postulates which "should be satisfied by a useful measure of information" [9, p. 31].This bottom-up approach of first deriving the pointwise quantities and then taking the expectation over these yields the same quantities as Shannon's top-down method of directly defining the average quantities.Although both approaches arrive at the same (average) quantities, Shannon's treatment obfuscates the pointwise nature of the fundamental quantities-in contrast to Fano's treatment which makes it manifestly obvious.
The relevance of this pointwise nature of information to the problem of information decomposition will be established and discussed in detail in the next section (Section II A).However, before continuing, it is important to note thatin contrast to the (average) mutual information-the pointwise mutual information is not non-negative.Positive pointwise information corresponds to the predictor event s raising the probability p(t|s) relative to the prior probability p(t).Hence when the event t occurs it can be said that the event s was informative about the event t.Conversely, negative pointwise information corresponds to the event s lowering the posterior probability p(t|s) relative to the prior probability p(t).Hence when the event t occurs we can say that the event s was misinformative about the event t. 3

A. Pointwise Information Decomposition
Now that pointwise nature of information has been established, suppose that we have a realisation from the joint event space T ×S 1 ×S 2 consisting of the target event t and predictor events s 1 and s 2 .The pointwise mutual information i(s 1 ; t) quantifies the information provided individually by s 1 about t, while the pointwise mutual information i(s 2 ; t) quantifies the information provided individually by s 2 about t.The pointwise joint mutual information i(s 1,2 ; t) quantifies the total information provided jointly by s 1 and s 2 about t.In correspondence with the (average) bivariate decomposition (1), consider the pointwise bivariate decomposition, first suggested by Lizier et al. [4], where the lower case quantities denote the pointwise equivalent of the corresponding upper case quantities in (1).This decomposition could be considered for every realisation on the support of the joint distribution P (S 1 , S 2 , T ).Hence, consider taking the expectation of these pointwise atoms over all realisations, 2 The term pointwise mutual information has only recently become typical.Perhaps the term event-wise would provide a more apt description; however, the usage is not typical.Woodward [34] and Fano [9] both referred to it as the mutual information and then explicitly prefix the average mutual information.Some literature, typically in the context of time-series analysis, refer to it as the local mutual information, e.g.[4,18]. 3The term misinformation should absolutely not be taken to mean disinformation (i.e.does not mean intentionally misleading information).Furthermore, note that while a source event s may be deemed to be misinformative about a particular target event t, a source event s is never misinformative about the target variable T on average.This can be seen by noting that the pointwise mutual information averaged over all target realisations is non-negative [9].In other words, the information provided by s is on average helpful for predicting T ; however, in certain instances this, typically helpful, information is misleading in the sense that it lowers p(t|s) relative to p(t).Typically helpful information which subsequently turns out to be misleading is misinformation.
Since the expectation is a linear operation, this will recover the (average) bivariate decomposition (1).Together, equations (3) for every realisation, (1) and ( 4) form the bivariate pointwise information decomposition.Just as in (1), these equations are underdetermined requiring a separate definition of either the pointwise unique, redundant or complementary information for uniqueness.(Defining an average atom is sufficient for a unique bivariate decomposition (1), but still leaves the pointwise decomposition (3) within each realisation underdetermined).

B. Pointwise Unique
Now consider applying this pointwise information decomposition to the probability distribution Pointwise Unique (PwUnq) in Table IV.In PwUnq, observing 0 in either of S 1 or S 2 provides zero information about the target T , while complete information about the outcome of T is obtained by observing 1 or a 2 in either predictor.The probability distribution is structured such that in each of the four realisations, one predictor provides complete information while the other predictor provides zero information-the two predictors never provide the same information about the target which is justified by noting that one of the two predictors always provides zero pointwise information.
Given that redundancy is supposed to capture the same information, it seems reasonable to assume there must be zero pointwise redundant information for each realisation.This assumption is made without any measure of pointwise redundant information; however, no other possibility seems justifiable.This assertion is used to determine the pointwise redundant information terms in Table IV.Then using the pointwise information decomposition (3), we can then evaluate the other pointwise atoms of information in Table IV.Finally using (4), we get that there is zero (average) redundant information, and 1 /2 bit of (average) unique information from each predictor.From the pointwise perspective, the only reasonable conclusion seems to be that the predictors in PwUnq must contain only unique information about the target.
However, in contrast to the above, I min , I red , U I, and S VK all say that the predictors in PwUnq contain no unique information, rather only 1 /2 bit of redundant information plus 1 /2 bit of complementary information.This problem, which will be referred to as the pointwise unique problem, is a consequence of the fact that these measures all satisfy Assumption ( * ) of Bertschinger et al. [11], which (in effect) states that the unique and redundant information should only depend on the marginal distributions P (T, S 1 ) and P (T, S 2 ).In particular, any measure which satisfies Assumption ( * ) will yield zero unique information when P (T, S 1 ) is isomorphic 4 to P (T, S 2 ), as is the case for PwUnq.It arises because Assumption ( * ) (and indeed the operational interpretation the led to its introduction) does not respect the pointwise nature of information.This operational view does not take into account the fact that individual events s 1 and s 2 may provide different information about the event t, even if the probability distributions P (T, S 1 ) and P (T, S 2 ) are the same.Hence, we contend that for any measure to capture the same information (not merely the same amount), it must respect the pointwise nature of information.

C. Pointwise Partial Information Decomposition
With the pointwise unique problem in mind, consider constructing an information decomposition where the pointwise nature of information is an inherent property.Let a 1 , . . ., a k be potentially intersecting subsets of the predictor p s1 s2 t i(s1; t) i(s2; t) i(s1,2; t) u(s1 \s2 → t) u(s2 \s1 → t) r(s1, s2 → t) c(s1, s2 → t) For each realisation, the pointwise mutual information proided by each individual and joint predictor events, about the target event has been evaluated.Note that one predictor event always provides full information about the target while the other provides zero information.Based on the this, it is assumed that there must be zero redundant information.The PPI atoms are then calculated via (3).
events s = {s 1 , . . ., s n }, called source events.Now consider rewriting the Williams and Beer axioms in terms of a measure of pointwise redundant information i ∩ where the aim is to deriving a pointwise partial information decomposition (PPID).
PPID Axiom 1 (Symmetry).Pointwise redundant information is invariant under any permutation σ of source events, PPID Axiom 2 (Monotonicity).Pointwise redundant information decreases monotonically as more source events are included, with equality if a k ⊇ a i for any a i ∈ {a 1 , . . ., a k−1 }.
PPID Axiom 3 (Self-redundancy).Pointwise redundant information for a single source event a i equals the pointwise mutual information, It seems that the next step should be to define some measure of pointwise redundant information which is compatible with these PPID axioms; however, there is a problem-the pointwise mutual information is not non-negative.While this would not be an issue for the examples like PwUnq, where none of the source events provide negative pointwise information, it is an issue in general (e.g.see RdnErr in Section V D).The problem is that set-theoretic intuition behind Axiom 2 (monotonicity) makes little sense in the face of negative pointwise information.Indeed, it is for this very reason that local positivity is considered a desirable property for PID.
Given the desire to address the pointwise unique problem, there is a need to overcome this issue.Ince [18] suggested that the set-theoretic intuition is only valid when all source events provide either positive or negative pointwise information.Ince contends that information and misinformation are "fundamentally different" [18, p. 11] and that the set-theoretic intuition should be ignored in the difficult to interpret situations where both are present.We however, will take a different approach-one which aims to deal with these difficult to interpret situations whilst preserving the set-theoretic intuition that redundancy corresponds to overlapping information.
By way of a preview, we first consider precisely how an event s 1 provides information about an event t by the means of two distinct types of probability mass exclusion.We show how considering the process in this way naturally splits the pointwise mutual information into particular entropic components, and how one can consider redundancy on each of these components separately.Splitting the signed pointwise mutual information into these unsigned entropic components circumvents the above issue with Axiom 2 (monotonicity).Crucially, however, by deriving these entropic components from the probability mass exclusions, we retain the set-theoretic intuition of redundancy-redundant information will correspond to overlapping probability mass exclusions in the two-event partition T t = {t, t c }.

III. PROBABILITY MASS EXCLUSIONS AND THE DIRECTED COMPONENTS OF POINTWISE MUTUAL INFORMATION
By definition, the pointwise information provided by s about t is associated with a change from the prior p(t) to the posterior p(t|s).As explored from first principles in Finn and Lizier [38], this change is a consequence of the exclusion of probability mass in the target distribution P (T ) induced by the occurrence of the event s and inferred via the joint distribution P (T, S).To be specific, when the event s occurs, one knows that the complementary event s c = {S \s} did not occur.Hence one can exclude the probability mass in the joint distribution P (T, S) associated with the complementary event, i.e. exclude P (T, s c ), leaving just the probability mass P (T, s) remaining.The new target distribution P (T |s) is evaluated by normalising this remaining probability mass.In [38], probability mass diagrams were introduced in order to visually explore the exclusion process: Figure 1 is an example of such a diagram.Clearly, this process is merely the definition of conditional probability.However, by viewing the change from the prior to the posterior in this way-by focusing explicitly on the exclusions rather than the resultant conditional probability-the vague intuition that redundancy corresponds to overlapping information becomes precise.This point will elaborated upon in Section III C; however, before that there is a need to discuss the two distinct types of probability mass exclusion in Section III A and then these to information-theoretic quantities in Section III B.
FIG. 1. Sample probability mass diagrams, which use length to represent the probability mass of each joint event from T ×S.Left: the joint distribution P (T, S); Middle: The occurrence of the event s 1 leads to exclusions of the complementary event s 1 c which consists of two elementary event, i.e. s 1 c = {s 2 , s 3 }.This leaves the probability mass P (T, s 1 ) remaining.The exclusion of the probability mass p(t 1 , s 1 c ) was misinformative since the event t 1 did occur.By convention, misinformative exclusions will be indicated with diagonal hatching.On the other hand, the exclusion of the probability mass p(t 1 c , s 1 c ) was informative since the complementary event t 1 c did not occur.By convention, informative exclusions will be indicated with horizontal or vertical hatching.Right: this remaining probability mass can be normalised yielding the conditional distribution P (T |s 1 ).

A. Two Distinct Types of Probability Mass Exclusions
In [38], it was shown that there are two distinct types of probability mass exclusions, depending on where the exclusion occurs in the target distribution P (T ) and the particular target event t which occurred.Informative exclusions are those which are confined to the probability mass associated with the set of elementary events in the target distribution which did not occur, i.e. exclusions confined to the probability mass of the complementary event p(t c ).They are called such because the pointwise mutual information i(s; t) is a monotonically increasing function of the total size of these exclusions p(t c ).By convention, informative exclusions are represented on the probability mass diagrams by horizontal or vertical lines.On the other hand, the misinformative exclusion is confined to the probability mass associated with the elementary event in the target distribution which did occur, i.e. an exclusion confined to p(t).It is referred to as such because the pointwise mutual information i(s; t) is a monotonically decreasing function of the size of this type of exclusion p(t).By convention, misinformative exclusions are represented on the probability mass diagrams by diagonal lines.
Although an event s may exclusively induce either type of exclusion, in general both types of exclusion are present simultaneously.The distinction between the two types of exclusions leads naturally to the following question-can one decompose the pointwise mutual information i(s; t) into a positive informational component associated with the informative exclusions, and a negative informational component associated with the misinformative exclusions?This question is considered in detail in Section III B. However, before moving on, there is a crucial observation to be made about the pointwise mutual information which will have important implications for the measure of redundant information: Remark 1.The pointwise mutual information i(s; t) depends only on the size of informative and misinformative exclusions.In particular, it does not depend on the apportionment of the informative exclusions across the set of elementary events contained in the complementary event t c .
In other words, whether the event s turns out to be net informative or misinformative about the event t-whether i(s; t) is positive or negative-depends on the size of the two types of exclusions; but, to be explicit, does not depend on the distribution of the informative exclusion across the set of target events which did not occur.This remark will be crucially important when it comes to providing the operational interpretation of redundant information in Section III C, and is also further discussed in terms of Kelly gambling [39] in Appendix A.

B. The Directed Components of Pointwise Information: Specificity and Ambiguity
Returning to the idea that one might be able to decompose the pointwise mutual information into a positive and negative component, associated with the informative amd misinformative exclusions respectively, Finn and Lizier [38] proposed three postulates for such a decomposition.Before stating the postulates, it is important to note that although there is a "surprising symmetry" [40, p. 23] between the information provided by s about t and the information provided by t about s, there is nothing to suggest that the components of the decomposition should be symmetric-indeed the intuition behind the decomposition only makes sense when considering the information as being directed.Hence, directed notation will be used to explicitly denote the information provided by s about t.
Postulate 1 (Decomposition).Given an event space T ×S 1 , the pointwise mutual information provided by the event s 1 about the event t can be decomposed into two non-negative components Postulate 2 (Monotonicity).The function i + s 1 → t is a continuous, monotonically increasing function of the size of the informative exclusion p(s 1 c , t c ) for a fixed sized misinformative exclusion p(s 1 c , t).On the other hand, the function i − s 1 → t is a continuous, monotonically increasing function of the size of the misinformative exclusion p(s 1 c , t) for a fixed sized informative exclusion p(s 1 c , t c ).
Postulate 3 (Chain Rule).The functions i + and i − satisfy the following chain rule, In Finn and Lizier [38], it was proved that these postulates lead to the following forms, which are unique up to the choice of the base of the logarithm: Hence, the postulates uniquely decompose the pointwise information provided by s about t into the following entropic components, Although the decomposition of mutual information into entropic components is well-known, it is non-trivial that the above postulates, based on the size of the two distinct types of probability mass exclusions, should lead to this particular form as opposed to i(s; t) = h(t) − h(t|s).
It is important to note that although the original motivation was to decompose the pointwise mutual information into separate components associated with informative and misinformative exclusion, the decomposition (11) does not quite possess this direct correspondence: • The positive informational component i + (s → t) does not depend on t but rather only on s.This can be interpreted as follows: the less likely s is to occur, the more specific it is when it does occur, the greater the total amount of probability mass excluded p(s c ), and the greater the potential for s to inform about t (or indeed any other target realisation).
• The negative informational component i − (s → t) depends on both s and t, and can be interpreted as follows: the less likely s is to coincide with the event t, the more uncertainty in s given t, the greater size of the misinformative probability mass exclusion p(t, s c ), and therefore the greater the potential for s to misinform about t.
Hence, while the negative informational component i − (s → t) does correspond directly to the size of the misinformative exclusion p(t, s c ), the positive informational component i + (s → t) does not correspond directly to the size of the informative exclusion p(t c , s c ). Rather, the positive informational component i + (s → t) corresponds to the total size of the probability mass exclusions p(s c ), which is the sum of the sum of the informative and misinformative exclusions.
For the sake of brevity, the positive informational component i + (s → t) will be referred to as the specificity, while the negative informational component i − (s → t) will be referred to as the ambiguity.5

C. Operational Interpretation of Redundant Information
Arguing about whether one piece of information differs from another piece of information is nonsensical without some kind of unambiguous definition of what it means for two pieces of information to be the same.Hence, Bertschinger et al. [11] has advocated the need to provide an operational interpretation of what it means for information to be unique or redundant.This section will provide such an operational definition whilst simultaneous aiming to hone the vague intuition that redundancy corresponds to overlapping information.
The operational interpretation of redundancy adopted here is based upon the following idea: since the pointwise information is ultimately derived from probability mass exclusions, the same information must induce the same exclusions.More formally, the information provided by a set of predictor events s 1 , . . ., s k about a target event t must be the same information if each source event induces the same exclusions with respect to the two-event partition T t = {t, t c }.While this statement makes the motivational intuition clear, it is not yet sufficient to serve as an operational interpretation of redundancy: there is no reference to the two distinct types of probability mass exclusions, the specific reference to the pointwise event space T t has not been explained, and there is no reference to the fact the exclusions from each source may differ in size.
Informative exclusions are fundamentally different from misinformative exclusions and hence each type of exclusion should be compared separately: informative exclusions can overlap with informative exclusions, and misinformative exclusions can overlap with misinformative exclusions.In information-theoretic terms, this means comparing the specificity and the ambiguity of the sources separately-i.e.considering a measure of redundant specificity and a separate measure of redundant ambiguity.Crucially, these quantities (being pointwise entropies) are non-negative, meaning that the difficult associated with Axiom 2 (Monotonicity) in Section II C will no longer be an issue.
The specific reference to the two-event partition T t in the above statement is based upon Remark 1 and is crucially important.The pointwise mutual information does not depend on the apportionment of the informative exclusions across the set of events which did not occur, hence the pointwise redundant information should not depend on this apportionment either.In other words, it is immaterial if two predictor events s 1 and s 2 exclude different elementary events within the target complementary event t c (assuming the probability mass excluded is equal) since with respect to the realised target event t the difference between the exclusions is only semantic.This then has important implications for the comparison of exclusions from different predictor events.As the pointwise mutual information depends on, and only depends on, the size of the exclusions, then the only sensible comparison is a comparison of size.Hence, the common or overlapping exclusion must be the smallest exclusion.Thus, consider the following operational interpretation of redundancy: Operational Interpretation (Redundant Specificity).The redundant specificity between a set of predictor events s 1 , . . ., s n is the specificity associated with the source event which induces the smallest total exclusions.
Operational Interpretation (Redundant Ambiguity).The redundant ambiguity between a set of predictor events s 1 , . . ., s n is the ambiguity associated with the source event which induces the smallest misinformative exclusion.

D. Motivational Example
To motivate the above operational interpretation, and in particular the need to treat the specificity separately to the ambiguity, consider Figure 2. In this pointwise example, two different predictor events provide the same amount of pointwise information since P (T |s 1 1 ) = P (T |s 1 2 ), and yet the information provided by each event is in some way different since each excludes different sections of the target distribution P (T ).In particular, s 1 1 and s 1 2 both preclude the target event t 2 , while s 1 2 additionally excludes probability mass associated with target events t 1 and t 3 .From the perspective of the pointwise mutual information the events s 1 1 and s 1 2 seem to be providing the same information as However, from the perspective of the specificity and the ambiguity it can be seen that information is being provided in different ways since Now consider the problem of decomposing information into its unique, redundant and complementary components.Figure 2 shows where exclusions induced by s 1 1 and s 1 2 overlap where they both exclude the target event t 2 which is an informative exclusion.This is the only exclusion induced by s 1  1 and hence all of the information associated with FIG. 2. Sample probability mass diagrams for two predictors S1 and S2 to a given target T .Here events in the two different predictor spaces provide the same amount of pointwise information about the target event, log 2 4 /3 bits, since P (T |s 1 1 ) = P (T |s 1 2 ), although each excludes different sections of the target distribution P (T ).Since they both provide the same amount of information, is there a way to characterise what information the additional unique exclusions from the event s 1  2 are providing?
this exclusion must be redundantly provided by the event s 1 2 .Now, without any formal framework, consider taking the redundant specificity and redundant ambiguity, This would mean that the event s 1 2 provides the following unique specificity and unique ambiguity, The redundant specificity log 4 /3 bit accounts for the overlapping informative exclusion of the event t 2 .The unique specificity and unique ambiguity from s 1 2 are associated with its non-overlapping informative and misinformative exclusions; however, both of these 1 bit and hence, on net, s 1  2 is no more informative than s 1 1 .Although attained without a formal framework, this example highlights a need to consider the specificity and ambiguity rather than merely the pointwise mutual information.

IV. POINTWISE PARTIAL INFORMATION DECOMPOSITION USING SPECIFICITY AND AMBIGUITY
Based upon the argumentation of Section III, consider the following axioms: Axiom 1 (Symmetry).Pointwise redundant specificity i + ∩ and pointwise redundant ambiguity i − ∩ are invariant under any permutation σ of source events, Axiom 2 (Monotonicity).Pointwise redundant specificity i + ∩ and pointwise redundant ambiguity i − ∩ decreases monotonically as more source events are included, with equality if a k ⊇ a i for any a i ∈ {a 1 , . . ., a k−1 }.
Axiom 3 (Self-redundancy).Pointwise redundant specificity i + ∩ and pointwise redundant ambiguity i − ∩ for a single source event a i equals the specificity and ambiguity respectively, As shown in Appendix B 1, Axioms 1-3 induce two lattices-namely the specificity lattice and ambiguity latticewhich are depicted in Figure 3. Furthermore, each lattice is defined for every realisation from P (S 1 , . . ., S n , T ).The redundancy measures i + ∩ or i − ∩ can be thought of as a cumulative information functions which integrates the specificity or ambiguity uniquely contributed by each node as one moves up each lattice.Finally, just as in PID, performing a Möbius inversion over each lattice yielding the unique contributions of specificity and ambiguity from each sources event.
Similarly to PID, the specificity and ambiguity lattices provide a structure for information decomposition: unique evaluation requires a separate definition of redundancy.However, unlike PID (or even PPID), this evaluation requires both a definition of pointwise redundant specificity and pointwise redundant ambiguity.Before continuing on to provide these definitions, it is helpful to first see how the specificity and ambiguity lattices can be used to decompose multivariate information.

in [1]).
A. Bivariate PPID using the Specificity and Ambiguity Consider again the bivariate case where the aim is to decompose the information provided by s 1 and s 2 about t.The specificity lattice can be used to decompose the pointwise specificity, while the ambiguity lattice can be used to decompose the pointwise ambiguity, These equations share the same structural form as (3) only now decompose the specificity and the ambiguity rather than the pointwise mutual information , e.g.r + (s 1 , s 2 → t) denotes the redundant specificity while u − (s 1 \s 2 → t) denoted the unique ambiguity from s 1 .Just as in for (3), this decomposition could be considered for every realisation on the support of the joint distribution P (S 1 , S 2 , T ).
There are two ways one can be combine these values.Firstly, as per (4), one could take the expectation of the atoms of specificity or the atoms of ambiguity, over all realisations yielding the average PI atoms of specificity and ambiguity, Alternatively, one could subtract the pointwise unique, redundant and complementary ambiguity from the pointwise unique, redundant and complementary specificity yielding the pointwise unique, redundant and complementary information, Both ( 20) and ( 21) are linear operations, hence one could perform both of these operations (in either order) to attain the average unique, average redundant and average complementary information-i.e.recover (1) from PID,

B. Redundancy Measures on the Specificity and Ambiguity Lattices
There is now a need to define the pointwise redundant specificity and pointwise redundant ambiguity.However, before attempting to provide such a definition, there is a need to consider Remark 1 and the operational interpretation of in Section III C. In particular, the pointwise redundant specificity i + ∩ and pointwise redundant ambiguity i − ∩ should only depend on the size of informative and misinformative exclusions.They should not depend on the apportionment of the informative exclusions across the set of elementary events contained in the complementary event t c .Formally, this will be requirement will be enshrined via the following axiom: Axiom 4 (Two-event Partition).The pointwise redundant specificity i + ∩ and pointwise redundant ambiguity i − ∩ are functions of the probability measures on the two-event partitions A a1 1 ×T t , . . ., A a k k ×T t .Since the pointwise redundant specificity i + ∩ is specificity associated with the source event which induces the smallest total exclusions, and pointwise redundant ambiguity i − ∩ is the ambiguity associated with the source event which induces the smallest misinformative exclusion, consider the following: Definition 1.The pointwise redundant specificity is given by Definition 2. The pointwise redundant ambiguity is given by See Appendix B 2 for further relevant consideration of these measures.Note, the proof of the following theorems can be found in Appendix B 2.
Theorem 1.The definitions of r + min and r − min satisfy Axioms 1-4.Theorem 2. The redundancy measures r + min and r − min increase monotonically on the A (s), .
Theorem 3. The atoms of partial specificity π + and partial ambiguity π − evaluated using the measures r + min and r − min on the specificity and ambiguity lattices (respectively), are non-negative.As in (20), one can take the expectation of the either the pointwise redundant specificity r + min or the pointwise redundant ambiguity r − min to get the average redundant specificity R + min or the average redundant ambiguity R − min .Alternatively, just as in (21), one can recombine the pointwise redundant specificity r + min and the pointwise redundant ambiguity r − min to get the pointwise redundant information r min .Finally, as per (22), one could perform both of these (linear) operations in either order to obtain the average redundant information R min .Note that while Theorem 3 proves that the atoms of partial specificity π + and partial ambiguity π − are non-negative, it is trivial to see that r min could be negative since when source events can redundantly provide misinformation about a target event.As shown in the following theorem, R min can also be negative.
Theorem 4. The atoms of partial average information Π evaluated by recombining and averaging π ± are not nonnegative.
This means that the measure R min does not satisfy local positivity.Nonetheless the negativity of R min is readily explainable in terms of the operational interpretation of Section III C, as will be discussed further in Section V D. However, failing to satisfy local positivity does mean that r min and R min do not satisfy the target monotonicity property first discussed in Bertschinger et al. [5].Despite this, as the following theorem shows, the measures do satisfy the target chain rule.
Theorem 5 (Pointwise Target Chain Rule).Given the joint target realisation t 1,2 , the pointwise redundant information r min satisfies the following chain rule, = r min a 1 , . . ., The proof of the last theorem is deferred to Appendix B 3. Note that since the expectation is a linear operation, Theorem 5 also holds for the average redundant information R min .Furthermore, as these results apply to any of the source events, the target chain rule will hold for any of the PPI atoms, e.g.(21), and any of the PI atoms, e.g.(22).However, no such rule holds for the pointwise redundant specificity or ambiguity.The specificity depends only on the predictor event, i.e. does not depend on the target events.As such, when an increasing number of target events are considered, the specificity remains unchanged.Hence, a target chain rule cannot hold for the specificity, or the ambiguity.
V. DISCUSSION PPID using the specificity and ambiguity takes the ideas underpinning PID and applies them on a pointwise scale but also circumvents the monotonicity issue associated with the signed pointwise mutual information.This section will explore the various properties of the decomposition in an example driven manner and compare the results to the most widely-used measures from the existing PID literature.(Further examples can be found in Appendix C.) The following shorthand notation will be utilised in the figures throughout this section:

A. Comparison to Existing Measures
A similar approach to the decomposition presented in this paper is due to Ince [18], who also sought to define a pointwise information decomposition.Despite the similarity in this regard, I ccs (the name given to the redundancy measure presented in [18]) approaches the pointwise monotonicity problem of Section II C in an entirely different way to the decomposition presented here.Specifically, I ccs attempts to utilise the pointwise co-information to measure "the set-theoretic overlap of the two univariate [pointwise] information values" despite the lack of non-negativity for any of these three quantities.Aware of this issue, Ince defines I ccs such that it "only interpret[s] the [pointwise] co-information as a set-theoretic overlap in the case where all three [pointwise] information terms have the same sign" and hence ignores the other situations "which do not admit a clear interpretation" [18, p. 11].In contrast, the PPID using specificity and ambiguity does not dispose of the set-theoretic intuition in these difficult to interpret situations.Rather, PPID using specificity and ambiguity considers the notion of redundancy in terms of overlapping exclusions-i.e. the underlying, non-negative quantities which are amenable to the set-theoretic interpretation.
The measures of pointwise redundant specificity r + min and pointwise redundant ambiguity r − min are also similar to both the minimum mutual information I mmi [17] and the original PID redundancy measure I min [1] in that, all of these approaches consider the redundant information to be the minimum information provided about a target event t.However, I min applies this idea to the sources A 1 , . . ., A k , i.e. to collections of entire predictor variables from S. In contrast, r ± min applies this notion to the source events a 1 , . . ., a k , i.e. to collections of predictor events from s.In other words, the measure I min can be regarded as being semi-pointwise, since it considers the information provided by the variables S 1 , . . ., S n about an event t.On the other hand, the measures r ± min are fully pointwise since they consider the information provided by the events s 1 , . . ., s n about an event t.This difference in approach can be seen in the probability distribution PwUnq-unlike PID, PPID using the specificity and ambiguity respects the pointwise nature of information (see Section V C).
PPID using specificity and ambiguity also share certain similarities with the PID induced by the measure U I of Bertschinger et al. [11].Firstly, Axiom 4 can be considered to be a pointwise adaptation of their Assumption ( * ).That is, the measures r ± min depend only on the marginal distributions P (S 1 , T ) and P (S 2 , T ) with respect to the two-event partitions S s1 1 ×T t and S s2 2 ×T t .Secondly, in PPID using specificity and ambiguity, the only way one can only decide if there is complementary information c(s 1 , s 2 → t) is by knowing the joint distribution P (S 1 , S 2 , T ) with respect to the joint two-event partitions S s1 1 ×S s2 2 ×T t .This is (in effect) a pointwise form of Assumption ( * * ).Thirdly, by definition, r ± min are given by the minimum value that any one source event provides.This is the largest possible value that one could take for these quantities whilst still requiring that the unique specificity and ambiguity be nonnegative.Hence, within each realisation, r ± min minimise the unique specificity and ambiguity, while maximising the redundant specificity and ambiguity.This is similar to U I which minimises the (average) unique information whilst still satisfying Assumption ( * ).Finally, note that since the measure S VK produces a bivariate decomposition which is equivalent to that of U I [11], the same similarities apply between PPID using specificity and ambiguity, and the decomposition induced by S VK from Griffith and Koch [12].

B. Probability Distribution Xor
Figure 4 shows the canonical example of synergy, exclusive-or (Xor) which considers two independently distributed binary predictor variables S 1 and S 2 and a target variable T = S 1 XOR S 2 .There are several important points to note about the decomposition of Xor.Firstly, despite providing zero pointwise information, an individual predictor event does induce exclusions.The informative and misinformative exclusions are perfectly balanced such that the posterior (conditional) distribution is equal to the prior distribution, e.g.see the red coloured exclusions induced by S 1 = 0.In information-theoretic terms, for each realisation, the pointwise specificity equals 1 bit since half of the total probability mass remains; while the pointwise ambiguity also equals 1 bit since half of the probability mass associated with the event which subsequently occurs, i.e.T = 0, remains.These are perfectly balanced such that when recombined, as per (11), the pointwise mutual information is equal to 0 bit-as one would expect.
Secondly, S 1 = 0 and S 2 = 0 both induce the same exclusions with respect to the target pointwise event space T T =0 .Hence, as per the operational interpretation of redundancy adopted in Section III C, there is 1 bit of pointwise redundant specificity and 1 bit of pointwise redundant ambiguity in each realisation.The presence of (a form of) redundancy in Xor is entirely novel amongst the existing measures in the PID literature.Thirdly, despite the presence of this redundancy, recombining the atoms of pointwise specificity and ambiguity for each realisation, as per (21), leaves only one non-zero PPI atom, namely the pointwise complementary information c(s 1 , s 2 → t) = 1 bit.Furthermore, this is true for every pointwise realisation; hence, by (22), the only non-zero PI atom is the average complementary information C(S 1 , S 2 → T ) = 1 bit-as one would expect.
Example PwUnq.Top: probability mass diagrams for the realisation (S1 = 0, S2 = 1, T = 1).Middle: For each realisation, the PPID using specificity and ambiguity is evaluated (see Figure 4 for details).Upon recombination as per (21), the PPI decomposition from Table IV is attained.Bottom: as does the average information-the decomposition does not have the pointwise unique problem.

C. Probability Distribution PwUnq
Figure 5 shows the probability distribution PwUnq introduced in Section II B. Recombining the decomposition via (21) yields the pointwise information decomposition proposed in Table IV-unsurprisingly, the explicitly pointwise approach results in a decomposition which does not suffer from the pointwise unique problem.
In each realisation, observing a 0 in either source provides the same balanced informative and misinformative exclusions as in Xor (e.g.see the red colored exclusions induced by S 1 = 0).Observing either a 1 or 2 provides the same misinformative exclusion as observing the 0, but provides a larger informative exclusion than 0. In particular, this leaves only the probability mass associated with the event which subsequently occurs remaining (hence why observing a 1 and 2 is fully informative about the target).Information theoretically, in each realisation the predictor events provide 1 bit of redundant pointwise specificity and 1 bit of redundant pointwise ambiguity; while the fully informative event additionally provides 1 bit of unique specificity.

D. Probability Distribution RdnErr
Figure 6 shows the probability distribution redundant-error (RdnErr) which considers two predictors which are nominally redundant and fully informative about the target, but where one predictor occasionally makes an erroneous prediction.In particular, Figure 6 shows the decomposition of RdnErr where S 2 makes an error with a probability ε = 1 /4.The important feature to note about this probability distribution is that upon recombining the specificity and ambiguity, and then taking the expectation over every realisation, the resultant average unique information from S 2 is U (S 2 \S 1 → T ) = −0.811bit, which is clearly negative.
On first inspection, the result that the average unique information can be negative may seem problematic; however, it is readily explainable in terms of the operational interpretation of Section III C. In RdnErr, a source event always excludes exactly 1 /2 of the total probability mass, thus every realisation contains 1 bit of redundant pointwise specificity.The events of the error-free S 1 induce only informative exclusions and as such provide 0 bit of pointwise ambiguity in each realisation.In contrast, the events in the error-prone S 2 always induce a misinformative exclusion, meaning that S 2 provides unique pointwise ambiguity in every realisation.Since S 2 never provides unique specificity, the average unique information is negative on average.
Despite the negativity of the average unique information, note that S 2 provides 0.189 bit of information since S 2 also provides 1 bit of average redundant information.Hence, it is not that S 2 provides negative information on average (as this is not possible); rather, it is that not all of the information provided by S 2 (i.e. the specificity) is "useful" [41, p. 21].This is in contrast to S 1 which only provides useful specificity.That is, it is the unique ambiguity which distinguishes the information provided by variable S 2 from S 1 , and hence why S 2 is deemed to provide negative average unique information.This form of uniqueness can only be distinguished if one allows the average unique information to be negative.Of course, this requires abandoning the local positivity as a required property, as per Theorem 4. Few of the existing measures in the PID literature consider dropping this requirement as negative information quantities are typically regarded as being "unfortunate" [32, p. 49].However, in the context of the pointwise mutual information, negative information values are readily interpretable as being misinformative values.Of course, the average information from each predictor must be non-negative; however, it may be that what distinguishes one predictor from another are precisely the misinformative predictor events, meaning that the unique information is in actual fact, unique misinformation.Forgoing local positivity makes the PPID using specificity and ambiguity somewhat unique (the other exception is Ince [18].)Middle: for each realisation, the PPID using specificity and ambiguity is evaluated (see Figure 4 for details).Bottom: the average PI atoms may be negative as the decomposition does not satisfy local positivity.
Example Tbc.Top: the probability mass diagrams for the realisation (S1 = 0, S2 = 0, T = 00).Middle: for each realisation, the PPID using specificity and ambiguity is evaluated (see Figure 4).Bottom: the decomposition of Xor yields the same result as Imin.

E. Probability Distribution Tbc
Figure 7 shows the probability distribution two-bit-copy (Tbc) which considers two independently distributed binary predictor variables S 1 and S 2 , and a target variable T consisting of a separate elementary event for each joint event S 1,2 .There are several important points to note about the decomposition of Tbc.Firstly, due to the symmetry in the probability distribution, each realisation will have the same pointwise decomposition.Secondly, due to the construction of the target, there is an isomorphism6 between P (T ) and P (S 1 , S 2 ), and hence the pointwise ambiguity provided by any (individual or joint) predictor event is 0 bit (since given t, one knows s 1 and s 2 ).Thirdly, the individual predictor events s 1 and s 2 each exclude 1 /2 of the total probability mass in P (T ) and so each provide 1 bit of pointwise specificity; thus, by (23), there is 1 bit of redundant pointwise specificity in each realisation.Fourthly, the joint predictor event s 1,2 excludes 3 /4 of the total probability mass, providing 2 bit of pointwise specificity; hence, by (18), each joint realisation provides 1 bit of pointwise complementary specificity in addition to the 1 bit of redundant pointwise specificity.Finally, putting this together via (22), Tbc consists of 1 bit of average redundant information and 1 bit of average complementary information.
Although "surprising" [5, p. 268], according to the operational interpretation adopted in Section III C, two independently distributed predictor variables can share redundant information.That is, since the exclusions induced by s 1 and s 2 are the same with respect to the two-event partition T t , the information associated with these exclusions is regarded as being the same.Indeed, this probability distribution highlights the significance of specific reference to the two-event partition in Section III C and Axiom 4. (This can be seen in the probability mass diagram in Figure 7, where the events S 1 = 0 and S 2 = 0 exclude different elementary target events within the complementary event 0 c , and yet are considered to be the same exclusion with respect to the two-event partition T 0 .)That these exclusions should be regarded as being the same is discussed further in Appendix A; now however, there is a need to discuss Tbc in terms of Theorem 5 (Target Chain Rule).
Tbc was first considered as a "mechanism" [6, p. 3] where "the wires don't even touch" [12, p. 167], which merely copies or concatenates S 1 and S 2 into a composite target variable T 1,2 = (T 1 , T 2 ) where T 1 = S 1 and T 2 = S 2 .However, using causal mechanisms as a guiding intuition is dubious since different mechanisms can yield isomorphic probability distributions [42, and references therein].In particular, consider two mechanisms which generate the composite target variables T 1,3 = (T 1 , T 3 ) and T 2,3 = (T 2 , T 3 ) where T 3 = S 1 XOR S 2 .As can be seen in Figure 7, both of these mechanisms generate the same (isomorphic) probability distribution P (S 1 , S 2 , T ) as the mechanism generating T 1,2 .If an information decomposition is to depend only on the probability distribution P (S 1 , S 2 , T ), and no other semantic I(S1,2; T1,3) I(S1,2; T1) I(S1,2; T3|T1) I(S1,2; T3) I(S1,2; T1|T3) U I, I red , SVK Shows the decomposition of the quantities in the first row induced by the measures in the first column.For consistency, the decomposition of I(S1,2; T1,3) should equal both the sum of the decomposition of I(S1,2; T1) and I(S1,2; T3|T1), and the sum of the decomposition of I(S1,2; T3) and I(S1,2; T1|3).Note that the decomposition induced by U I, I red and SVK are not consistent.In contrast, Rmin is consistent due to Theorem 5.
details such as labelling, then all three mechanisms must yield the same information decomposition-this is not clear from the mechanistic intuition.
Although the decomposition of the various composite target variables must be the same, there is no requirement that the three systems must yield the same decomposition when analysed in terms of the individual components of the composite target variables.Nonetheless, there ought to be a consistency between the decomposition of the composite target variables, and the decomposition of the component target variables-in other words, there should be a target chain rule.As shown in Theorem 5, the measures r min and R min satisfy the target chain rule, whereas I min , U I, I red and S VK do not [5,7].Failing to satisfy the target chain rule can lead to inconsistencies between the composite and component decompositions, depending on the order in which one considers decomposing the information (this is discussed further in Appendix A 3).In particular, Table V shows how U I, I red and S VK all provide the same inconsistent decomposition for Tbc when considered in terms of the composite target variable T 1,3 .In contrast, R min produces a consistent decomposition of T 1,3 .Finally, based on the above isomorphism, consider the following (the proof is deferred to Appendix B 3): Theorem 6.The target chain rule, identity property and local positivity, cannot be simultaneously satisfied.

F. Summary of Key Properties
The following are the key properties of the PPID using the specificity and ambiguity.Property 1 follows directly from the Definitions 1 and 2. Property 2 follows from Theorems 3 and 4. Property 3 follows from the probability distribution Tbc in Section V E. Property 4 was discussed in Section IV B. Property 5 is proved in Theorem 5.
Property 1.When considering the redundancy between the source events a 1 , . . ., a k , at least one source event a i will provide zero unique specificity, and at least one source event a j will provide zero unique ambiguity.The events a i and a j are not necessarily the same source event.
Property 2. The atoms of partial specificity and partial ambiguity satisfy local positivity, π ± ≥ 0. However, upon recombination and averaging, the atoms of partial information do not satisfy local positivity, Π ≥ 0. Property 3. The decomposition does not satisfy the identity property.

VI. CONCLUSION
The partial information decomposition of Williams and Beer [1,2] provided an intriguing framework for the decomposition of multivariate information.However, it was not long before "serious flaws" [11, p. 2163] were identified.Firstly, the measure of redundant information I min failed to distinguish between whether predictor variables provide the same information or merely the same amount of information.Secondly, I min fails to satisfy the target chain rule, despite this kind of addativity being one of the defining characteristics of information.Notwithstanding the problems, the axiomatic derivation of the redundancy lattice was too elegant to be abandoned, and hence several alternate measures were proposed, i.e.I red , U I and S VK [6,11,12].However, as these measures all satisfy the identity property, they cannot produce a non-negative decomposition for an arbitrary number of variables [13].Furthermore, none of these measures satisfy the target chain rule.Finally, in spite of satisfying the identity property (which many consider to be desirable), these measures still fail to identify when variables provide the same information, as exemplified by the pointwise unique problem presented in Section II.
This paper took the axiomatic derivation of the redundancy lattice from PID and applied it to the separate entropic components of the pointwise mutual information-the specificity and the ambiguity.Then, based upon an operational interpretation of redundancy, measures of pointwise redundant specificity r + min and pointwise redundant ambiguity r − min were defined.Together with specificity and ambiguity lattices, these measures were used to decompose multivariate information for an arbitrary number of variables.Crucially, upon recombination, the measure r min satisfies the target chain rule.Furthermore, when applied to PwUnq, these measures do not result in the pointwise unique problem.
In our opinion, this demonstrates that the decomposition is indeed correctly identifying redundant information.However, some will likely disagree with this point given that the measure of redundancy does not satisfy the identity property.According to the identity property, independent variables can never provide the same information.In contrast, according to the operational interpretation adopted in this paper, independent variables can provide the same information if they happen to provide the same exclusions with respect to the two-event target distribution.
In any case, the proof of Theorem 6 and the subsequent discussion in Appendix B 3, highlights the difficulties that the identity property introduces when considering the information provided about events in separate target variables.
(See further discussion in Appendix A 3).
Our future work with this decomposition will be both theoretical and empirical.Regarding future theoretical work, given that the aim of information decomposition is to derive measures pertaining to sets of random variables, it would be worthwhile to derive the information decomposition from first principles in terms of measure theory.Indeed, such an approach would surely eliminate the semantic argument about what it means for information to unique, redundant or complementary, that currently plague the problem domain.Furthermore, this would certainly be a worthwhile exercise before attempting to generalise the information decomposition to continuous random variables.Regarding future empirical work, there are many rich data sets which could be decomposed using this decomposition including financial time-series and neural recordings, e.g.[27,30,31].In Section III C, it was argued that the information provided by a set of predictor events s 1 , . . ., s k about a target event t is the same information if each source event induces the same exclusions with respect to the two-event partition T t = {t, t c }.This was based on the fact that pointwise mutual information does not depend on the apportionment of the exclusions across the set of events which did not occur t c .It was argued that since the pointwise mutual information is independent of these differences, the redundant mutual information should also be independent of these differences.This requirement was then integrated into the operational interpretation of Section III C, and was later enshrined in the form of Axiom 4. This appendix aims to justify this operational interpretation, and in particular show why redundant information in Tbc is not "unreasonably large" [5, p. 269].

Pointwise Side Information and the Kelly Criterion
Consider a set of horses T running in a race which can be considered a random variable T with distribution P (T ).Say that for each t ∈ T a bookmaker offers odds of o(t)-for-1, i.e. the bookmaker will pay out o(t) dollars on a 1 dollar bet if the horse t wins.Furthermore, say that there is no track take t∈T 1 /o(t) = 1, and these odds are fair, i.e. o(t) = 1 /p(t) for all t ∈ T [39].Let b(T ) be the fraction of a gambler's capital bet on each horse t ∈ T and assume that the gambler stakes all of their capital on the race, i.e. t∈T b(t) = 1.Now consider an i.i.d.series of these races T 1 , T 2 , . . .such that P (T k ) = P (T ) for all k ∈ N, and let t k ∈ T represent the winner of the k-th race.Say that the bookmaker offers the same odds on each race and the gambler bets their entire capital on each race.The gambler's capital after m races D m is a random variable which depends on two factors per race: the amount the gambler staked on each race winner t k , and the odds offered on each winner t k .That is, where monetary units have been chosen such that D 0 = 1.Hence, the gambler's wealth grows (or shrinks) exponentially, i.e. where is the doubling rate of the gambler's wealth using a betting strategy b(T ).Here, the last equality is by the weak law of large numbers for large m.
Any reasonable gambler would aim to use an optimal strategy b * (T ) which maximises the doubling rate W (b, T ). Kelly [32,39] proved that the optimal doubling rate is given by and is achieved by using the proportional gambling scheme b * (T ) = P (T ).When the race T k occurs and the horse t k wins, the gambler will receive a payout of b * (t k ) o(t k ) = 1, i.e. the gambler receives their stake back regardless of the outcome.In the face of fair odds, the proportional Kelly betting scheme is the optimal strategy-non-terminating repeated betting with any other strategy will result in losses.Now consider a gambler with access to a private wire S which provides (potentially useful) side information about the upcoming race.Say that these messages are selected from the set S, and that the gambler receives the message s k before the race T k .Kelly [32,39] showed that the optimal doubling rate in the presence of this side information is given by and is achieved by using the conditional proportional gambling scheme b * (T |s k ) = P (T |s k ).Both the proportional gambling scheme b * (T ) and the conditional proportional gambling scheme b * (T |S) are based upon the Kelly criterion, whereby bets are apportioned according to the best estimation of the outcome available.The financial value of the private wire to a gambler can be ascertained by comparing their doubling rate of the gambler with access to the side wire to that of a gambler with no side information, i.e.
This important result due to Kelly [39] equates the increase in the doubling rate ∆W due to the presence of side information with the mutual information between the private wire S and the horse race T .If on average, the gambler receives 1 bit of information from their private wire, then on average the gambler can expect to double their money per race.Furthermore, as one would expect, independent side information does not increase the doubling rate.With no side information, the Kelly gambler always received their original stake back from the bookmaker.However, this is not true for the Kelly gambler with side information.Although their doubling rate is greater than or equal to that of the gambler with no side information, this is only true on average.Before the race T k , the gambler receives the private wire message s k ; then, the horse t k wins the race.From (A6), one can see that the pointwise doubling rate ∆w k for the k-th race is given by the pointwise mutual information ∆w = i(s k ; t k ). (A7) Hence, just like the pointwise mutual information, the pointwise doubling rate can be positive or negative: if it is positive, the gambler will make a profit; if it is negative, the gambler will sustain a loss.Despite the potential for pointwise loses, the average pointwise return (i.e. the doubling rate) is, just like the average mutual information, non-negative-and indeed, is optimal.Furthermore, while a Kelly gambler with side information can lose money on any single race, they can never actually go bust.The Kelly gambler with side information s still hedges their risk by placing bets on all horses with a non-zero probability of winning according to their side information, i.e. according to P (T |s k ).The only reason they would fail to place a bet on a horse is if their side information completely precludes any possibility of that horse winning.In other words, a Kelly gambler with side information will never fall foul of gambler's ruin.

Justification of Axiom 4 and Redundant Information in Tbc
Consider Tbc semantically described in terms of a horse race.That is, consider a four horse race T where each horse has an equiprobable chance of winning, and consider the binary variables T 1 , T 2 , and T 3 which represent the following, respectively: the colour of the horse, black 0 or white 1; the sex of the jockey, female 0 or male 1; and the colour of the jockey's jersey, red 0 or green 1. Say that the four horses have the following attributes: Horse 0: is a black horse T 1 = 0, ridden by a female jockey T 2 = 0, who is wearing a red jersey T 3 = 0.
Horse 1: is a black horse T 1 = 0, ridden by a male jockey T 2 = 1, who is wearing a green jersey T 3 = 1.
Horse 2: is a white horse T 1 = 1, ridden by a female jockey T 2 = 0, who is wearing a green jersey T 3 = 1.
Horse 3: is a white horse T 1 = 1, ridden by a male jockey T 2 = 1, who is wearing a red jersey T 3 = 0.
There are two important points to note: Firstly, the horses in the race T could also be uniquely described in terms of the composite binary variables T 1,2 , T 1,3 or T 2,3 .Secondly, if one knows T 1 and T 2 , then one knows T 3 (which can be represented by the relationship T 3 = T 1 XOR T 2 ).Finally, consider private wires S 1 and S 2 which independently provide the colour of the horse and the colour of the jockey's jersey (respectively) before the upcoming race, i.e. S 1 = T 1 and S 2 = T 2 .Now say a bookmaker offers fair odds of 4-for-1 on each horse in the race T .Consider two gamblers who each have access to one of S 1 and S 2 .Before each race, the two gamblers receive their respective private wire messages and place their bets according to the Kelly strategy.This means that each gambler lays half of their, say $1, stake on each of their two respective non-excluded horses: unknowingly, both of the gamblers have placed a bet on the soon-to-be race winner, and each gambler has placed a distinct bet on one of the two soon-to-be losers.The only horse neither has bet upon is also a soon-to-be loser. 7After the race, the bookmaker pays out $2 dollars to each gamble, and hence both have doubled their money.This is because both of the gamblers had one bit of 1 bit of information about the race, i.e. pointwise mutual information.In particular, both gamblers improved their probability of predicting the eventual race winner.It did not matter, in any way, that the gamblers had each laid distinct bets on one of the three eventual race losers.The fact that they laid different bets on the horses which did not win, made no difference to their winnings.The apportionment of the exclusions across the set of events which did not occur, makes no difference to the pointwise mutual information.With respect to what occurred (i.e. with respect to which horse won), the fact the that they excluded different losers is only semantic.When it came to predicting the would-be-winner, both gamblers had the same predictive power; they both had the same freedom of choice with regards to selecting what would turn out to be the eventual race winner-they had the same information.It is for this reason, that this information should be regarded as redundant information, regardless of the independence of the information sources.Hence, the introduction of both the operational interpretation of redundancy in Section III C and Axiom 4 in Section IV B. Now consider a third gambler who has access to both private wires S 1 and S 2 , i.e. S 1,2 .Before the race, this gambler receives both private wire messages which, in total, precludes three of the horses from winning.This gambler then places the entirety of their, say $1, stake on the remaining horse which is sure to win.After the race, the bookmaker pays out $4, and hence this gambler has quadrupled their money because they had 2 bit of information about the race, i.e. pointwise mutual information.Having both private wire messages simultaneously gave this gambler a 1 bit informational edge over the two gamblers with a single side wire.While each of the singleton gamblers had 1 bit of independent information, the only way one could profit from the independence of this information is by having both pieces of information simultaneously-this makes this 1 bit of information complementary.Although this may seem "palpably strange" [12, p. 167] at first, it is not so weird when one notes the following: the only way to exploit two pieces of independent information, is by having both pieces together simultaneously.

Accumulator Betting and the Target Chain Rule
Say that in addition to the 4-for-1 odds offered on the race T , the bookmaker also offers fair odds of 2-for-1 on each of the binary variables T 1 , T 2 and T 3 .Now, in addition to being able to directly gamble on the race T , one could indirectly gamble on T by taking out so-called accumulator bets on any pair of T 1 , T 2 and T 3 .An accumulator is a series of chained bets whereby any return from one bet is automatically staked on the next bet: if any bet in the chain is lost, the entire chain is lost.For example, one could place 4-for-1 bet on horse 0 by taking out the following accumulator: a 2-for1 bet on a black horse winning, which chains into a 2-for-1 bet on the winning jockey being female-or equivalently, vice versa.In effect, these accumulators enable one to bet on T , by instead placing a chained bet on the independent component variables within the (equivalent) joint variables T 1,2 , T 1,3 and T 2,3 .Now consider again the three gamblers from the prior section, i.e. the two gamblers who each have a private wire S 1 and S 2 , and the third gamble who has access to S 1,2 .Say that they must each place a, say $1, accumulator bet on T 1,3 -what should each gambler do according to the Kelly criterion?
For the sake of clarity, consider only the realisation where the horse T = 0 subsequently wins (due to the symmetry, the analysis is equivalent for all realisations).First consider the accumulator whereby the gamblers first bet on the colour of the winning horse T 1 , which chains into a bet on the colour of the winning jockey's jersey T 3 .Suppose that the private wire S 1 communicates that the winning horse will be black, while the private wire S 2 communicates that the winning horse will be ridden by a female jockey, i.e. S 1 = 0 and S 2 = 0. Following to the Kelly strategy, the gambler with access to S 1 = 0 takes out two $0.5 accumulator bets.Both of these accumulators feature the same initial bet on the winning horse being black since T 1 = S 1 = 0. Hence both bets return $1 which become the stake on the next bet in each accumulator.This gambler knows nothing about the colour of the jockey's jersey T 3 ; as such, one accumulator chains into a bet on the winning jersey being red T 3 = 0, while the other chains into a bet on it being green T 3 = 1.When the horse T = 0 wins, the stake bet on the green jersey is lost, while bet on red jersey will pay out $2.This gambler doubled their money and had 1 bit of side information.Now consider the gambler with private wire S 2 , who knows nothing about T 1 or T 3 individually.However, this gambler knows that the winner must be a female jockey T 2 = 0, and hence this gambler knows that if a black horse T 1 = 0 wins, then its jockey must be wearing red T 3 = 0; or if a white horse T 1 = 0 wins, then its jockey must be wearing green T 3 = 1 (i.e.since T 3 = T 1 XOR T 2 ).As such, this gambler can utilise the Kelly strategy to place the following two $0.5 accumulator bets: the first accumulator bets on the winning horse being black T 1 = 0, and then chains into a bet on the winner's jersey being red T 3 = 0; whereas the second accumulator bets on the winning horse being white T 1 = 1, and then chains into a bet on the winner's jersey being green T 3 = 1.When the horse T = 0 wins, the first accumulator will payout $2, while the second accumulator will be lost.Hence, this gambler can also double their money and so also had 1 bit of side information.Finally, consider the gambler with access to both private wires S 1,3 , who can place an accumulator on the black horse T 1 = 0 winning, chaining into a bet on the winning jockey wearing red T 3 = 0.This gambler can quadruple their stake, and so must possess 2 bit of side information.
Each of the three gamblers have the same final return regardless of whether the gamblers are betting on the variable T , or placing accumulator bets on the variables T 1,2 , T 1,3 or T 2,3 .However, the paths to the final result differs between the gamblers, reflecting the difference between the information the each gambler had about the sub-variables T 1 , T 2 or T 3 .Given the result of Kelly, the proposed information decomposition should reflect these differences, yet still arrive at the same result-in other words, the information decomposition should satisfy a target chain rule.This is clear if the Kelly interpretation of information is to remain as a "duality" [32, p. 159] in information theory.

Appendix B: Supporting Proofs and Further Details
This appendix contains many of the important theorems and proofs relating to PPID using specificity and ambiguity.
1. Deriving the Specificity and Ambiguity Lattices from Axioms 1-4 The following section is based directly on the original work of Williams and Beer [1,2].The key difference is that now using sources events a i are used in place of sources A i .Proposition 1.Both i + ∩ and i − ∩ are non-negative.
Proof.Since ∅ ⊆ a i for any a i , Axioms 2 and 3 imply Hence, both i + ∩ a 1 , . . ., a k → t and i − ∩ a 1 , . . ., a k → t are non-negative.
Proposition 2. Both i + ∩ and i − ∩ are bounded from above by the specificity and the ambiguity from any single source event, respectively.Proof.For any single source a i , Axioms 2 and 3 yield as required.
In keeping with Williams and Beer's approach [1,2], consider all of the distinct ways in which a collection of source events a = {a 1 , . . ., a k } could contribute redundant information.Thus far, the natural domain has been assumed that the redundancy measure can be applied to any collection of source events, i.e.P 1 (a) where P 1 denotes the power set with the empty set removed.Recall that the sources events are themselves collections of predictor events, i.e.P 1 (s).That is, the natural domain of both i + ∩ and i − ∩ has been assumed to be P 1 P 1 (s) .However, this can be greatly reduced using Axiom 2 which states that if a i ⊆ a j , then Hence, one need only consider the collection of source events such that no source event is a superset of any other in order, This collection A (s) captures all the distinct ways in the source events could provide redundant information.
As per Williams and Beer's PID, this set of source events A (s) is structured.Consider two sets of source events α, β ∈ A (s).If for every source event b ∈ β there exists a source event a ∈ α such that a ⊆ b, then all of the redundant specificity and ambiguity shared by b ∈ β must include any redundant specificity and ambiguity shared by a ∈ α.Hence, a partial order can be defined over the elements of the domain A (s) such that any collection of predictors event coalitions precedes another if and only if the latter provides any information the former provides, Applying this partial ordering to the elements of the domain A (s) produces a lattice which has the same structure as the redundancy lattice from PID, i.e. the structure of the sources events here is the same as the structure of the sources in PID.(Figure 3 depicts this structure for the case of 2 and 3 predictor variables.)Finally, applying i + ∩ to these sources events yields a specificity lattice while applying i − ∩ yields an ambiguity lattice.Similar to I ∩ in PID, the redundancy measures i + ∩ or i − ∩ can be thought of as a cumulative information functions which integrates the specificity or ambiguity uniquely contributed by each node as one moves up each lattice.In order in evaluate the unique contribution of specificity and ambiguity from each node in the lattice, consider the Möbius inverse [43,44] of i + ∩ and i − ∩ .That is, the specificity and ambiguity of a node α is given by Thus, the unique contributions of partial specificity i + ∂ and partial ambiguity i − ∂ from each node can be calculated recursively from the bottom-up, i.e.
Theorem 7. Based on the principle of inclusion-exclusion, we have the following closed-from expression for the partial specificity and partial ambiguity, By the principle of inclusion-exclusion (e.g.see [44, p. 195]) we get that For any lattice L and A ⊆ L, we have that ∩ a∈A ↓ a = ↓ ( A) (see [45, p. 57]), thus as required.
Similarly to PID, the specificity and ambiguity lattices provide a structure for information decomposition-unique evaluation requires a separate definition of redundancy.However, unlike PID (or even PPID), this evaluation requires both a definition of pointwise redundant specificity and pointwise redundant ambiguity.

Redundancy Measures on the Lattices
In Section IV B, Definitions 1 and 2 provided the require measures.This section will prove some of the key properties of these measures when they are applies to the lattices derived in the previous section.The correspondence with the approach taken by Williams and Beer [1,2] continues in this section.However, sources events a i are used in place of sources A i , and the measures r ± min are used in place of min.Note that the basic concepts from lattice theory, and the notion used here, are as of [1,Appendix B].
Theorem 1.The definitions of r + min and r − min satisfy Axioms 1-4.Proof.Axioms 1, 3 and 4 follow trivially from the basic properties of the minimum.The main statement of Axiom 2 also immediately follows from the properties of the minimum; however, there is a need to verify the equality condition.As such, consider a k such that a k ⊇ a i for some a i ∈ {a 1 , . . ., a k−1 }.From Postulate 3, we have that h(a k ) ≥ h(a i ) and hence that min aj ∈{a1,...,a k } h(a j ) = min aj ∈{a1,...,a k−1 } h(a j ), as required for r + min .Mutatis mutandis, similar follows for r − min .Theorem 2. The redundancy measures r + min and r − min increase monotonically on the A (s), .
The proof of this theorem will require the following Lemma.
Lemma 1.The specificity and ambiguity i ± (a → t) are increasing functions on the lattice P 1 (s), ⊆ Proof.Follows trivially from Postulate 3.
Proof of Theorem 2. Assume there exists α, β ∈ A (s) such that α ≺ β and r ± min (β → t) < r ± min (α → t).By definition, i.e. ( 23) and (24), there exists b ∈ β such that i ± (b → t) < i ± (a → t) for all a ∈ α.Hence, by Lemma 1, there does not exist a ∈ α such that a ⊆ b.However, by assumption α ≺ β, and hence there exists a ∈ α such that a ⊆ b, which is a contradiction.Theorem 8.When using r ± min in place of the general redundancy measures i ± ∩ , we have the following closed-from expression for the partial specificity π + and partial ambiguity π − , Theorem 3. The atoms of partial specificity π + and partial ambiguity π − evaluated using the measures r + min and r − min on the specificity and ambiguity lattices (respectively), are non-negative.
Proof.It α =⊥, the π ± (α → t) = r ± min ≥ 0 by the non-negativity of entropy.If α =⊥, assume there exists α ∈ A (s)\{⊥} such that π ± (α → t) < 0. By Theorem 8, (B18) From this it can be seen that there must exist β ∈ α − such that for all b ∈ β, we have that i ± (a → t) < i ± (b → t) for some a ∈ α.By Postulate 3 there does not exist b ∈ β such that b ⊂ a.However, since by definition, β ≺ α there exists b ∈ β such that b ⊂ a, which is a contradiction.
Theorem 4. The atoms of partial average information Π evaluated by recombining and averaging π ± are not nonnegative.
Proof.The proof is by the counter-example using RdnErr.The relationship between the regular forms and the conditional forms of the redundant specificity and redundant ambiguity has some important consequences.Proof.By ( 24) and (B20).
Note that specificity itself is not a function of the target event or variable.Hence, all of the target dependency is bound up in the ambiguity.Now consider the following.
Theorem 6.The target chain rule, identity property and local positivity, cannot be simultaneously satisfied.
Furthermore, this above theorem can be informally generalised as follows: it is not possible to simultaneously satisfy the target chain rule, the identity property, and have only C(S 1 , S 2 → T ) = 1 bit in the probability distribution Xor without having negative (average) PI atoms in probability distributions where there is no ambiguity from any source.To see this, again consider decomposing the isomorphic probability distributions P (T 1,2 ) and P (T 1,3 ).In line with (B27), decomposing T 1,2 via the identity property yields C(S 1 , S 2 → T 1,2 ) = 0 bit.On the other hand, decomposing T 1,3 yields C(S 1 , S 2 → T 3 ) = 1 bit.Since P (T 1,2 ) is isomorphic to P (T 1,3 ), the target chain rule requires that, That is, one would have to accept the negative (average) PI atom C(S 1 , S 2 → T 1 |T 3 ) = −1 bit despite the fact that there are no non-zero pointwise ambiguity terms upon splitting any of i(s 1 ; t 1 |t 3 ), i(s 2 ; t 1 |t 3 ) and i(s 1,2 ; t 1 |t 3 ) into specificity and ambiguity.Although this does not constitute a formal proof that the identity property is incompatible with the target chain rule, one would have to accept and find a way to justify C(S 1 , S 2 → T 1 |T 3 ) = −1 bit.Since there is no ambiguity in i(s 1 ; t 1 |t 3 ), i(s 2 ; t 1 |t 3 ) and i(s 1,2 ; t 1 |t 3 ), this result is not reconcilable within the framework of specificity and ambiguity.

Property 4 .Property 5 .
The decomposition does not satisfy the target monotonicity property.The decomposition satisfies the target chain rule.
Appendix A: Kelly Gambling, Axiom 4, and Tbc