# Communications for Control in Cyber Physical Systems

【有计划得了解一下法国的Turbo中国的 polar和美国的LDTC】

## SEQUENTIAL ESTIMATION

For the linear dynamics in Eq. (5.4), system state estimation can be considered as a lossy source coding problem, since the sensor encodes the observations and the decoder output always has errors except in trivial cases. The lossy source coding is traditionally studied using rate-distortion theory , as a branch of information theory, which discusses the minimum communication requirement to satisfy the requirement on the distortion at the decoder output.

However, the information-theoretic argument is based on the assumption of infinite codeword length and thus infinite delays, which is reasonable for traditional data communications (e.g., one packet encoded as a codeword may have thousands of bits). Obviously such arguments are not suitable for CPSs, since the encoding procedure must have a limited delay. Hence in contrast to the block structure of traditional data communications, the encoding procedure in a CPS should have a sequential structure, as illustrated in Fig. 5.13. Before we study sequential rate distortion, we provide a review of traditional ratedistortion theory. When the true information symbol is x while the recovered one

For a certain information source generating symbols {X(t)}, the rate-distortion function is defined as

where QY|X is the conditional probability of generating the recovered symbol Y given the true information symbol X, DQ is the distortion incurred by the conditional distribution QY|X, and I is mutual information. Intuitively, the rate-distortion function means the minimum average number of bits needed to achieve a distortion less than or equal to D. Meanwhile, the ratedistortion function can also determine whether a communication channel with a certain capacity can support the transmission of an information source with a given distortion bound. Roughly speaking, when transmitting the information of a GaussMarkov process through a communication channel having capacity C, distortion D can be achieved if R(D) < C, while it cannot be achieved if R(D) > C. This dichotomy is illustrated in Fig. 5.14. Similarly to the traditional definition of rate-distortion function with infinite delay, Ref.  defined the sequential rate-distortion function as follows:

We notice that the sequential rate-distortion function has the following features:
• The necessary rate may be dependent on the time T.
• Only the causal information is considered in the mutual information, which is
due to the requirement of sequential decoding.
• The expected distortion is no larger than D for each time slot, instead of beingaveraged over all time slots.
The following theorem shows the application of the sequential rate-distortionfunction in the lossy transmission of information sources . Note that a sufficientcondition is much more difficult to find.
Theorem 9. Consider a memoryless communication channel with channel capacity C. A necessary condition for achieving the distortion D causally is RsT (D) ≤ Cfor all T.

•必要的速率可能取决于时间T。
•互为信息中仅考虑因果信息，即由于需要顺序解码。
•每个时隙的预期失真不大于D，而不是在所有时隙中平均。

## Gauss-Markov source

Since the sequential rate-distortion function provides a lower bound for the channel capacity of the communication channel, it is important to calculate the sequential rate-distortion function for various information sources for the purpose of online system state estimation. We follow the argument in Ref.  to discuss the sequential rate-distortion function for a d-dimensional Gauss-Markov source where

We assume a Gaussian communication channel with dimension d. The following lemma shows the necessary property for the infimizing channel in Eq. (5.68) in the definition of a sequential rate-distortion function: Lemma 3. The infimizing channel in the Gauss-Markov information source in Eq. (5.70), namely P(dY(t)|x(0 : t − 1), y(0 : t − 1)), is a Gaussian channel of the form P(dY(t)|x(t), y(0 : t − 1)).

## Rate distortion

In the context of causal coding, the rate-distortion function rc(D) is defined to be the minimum average transmission rate such that the average distortion is no larger than D. It is shown in Ref.  that rc(D) is determined as follows:

where d(f) denotes the average distortion caused by the coder f . An interesting conclusion in Theorem 3 of Ref.  is that the causal rate-distortion functions in the above coder schemes are all the same when the information source is memoryless. However, it is still an open problem to compare the rate-distortion functions when the information source has memory

## SOURCE CODING IN A CPS: POINT-TO-POINT CASE

Source coding encodes the original information (either continuous or discrete valued) into discrete information bits. The main purpose is to reduce redundancy and thus transmit fewer bits (thus using fewer communication resources). In many existing CPSs, the measurements (e.g., the frequency and phase measurements at phasor measurement units (PMUs) in smart grids) are quantized with certain precisions and then sent to the controller. However, such a straightforward scheme may waste communication resources for the following reasons:

• • Ignoring temporal redundancy: The measurements at different times are usually correlated. For example, in power grids, the frequency and phase measurements are continuous in time and do not change very rapidly. Hence successive measurements are similar to each other, and it is not necessary to fully quantize each measurement in an independent manner. A possibly more efficient approach is to quantize increments in the frequency and phase.
• •忽略时间冗余：通常将不同时间的测量结果关联起来。 例如，在电网中，频率和相位测量在时间上是连续的，并且变化不是很快。 因此，连续的测量彼此相似，因此不必以独立的方式完全量化每个测量。 一种可能更有效的方法是量化频率和相位的增量。
• • Ignoring spatial redundancy: When there are multiple sensors, the measurements at different sensors are usually correlated, unless the sensors happen to measure orthogonal subspaces. The redundancies among different sensors can be fully utilized to decrease the source coding rate, even though there are no communications among the sensors (and thus no explicit coordination among them).
• •忽略空间冗余：当有多个传感器时，除非传感器碰巧测量正交子空间，否则通常会关联不同传感器的测量结果。 即使传感器之间没有通信（因此它们之间没有明确的协调），也可以充分利用不同传感器之间的冗余来降低源编码率。

In this section we assume a single sensor and consider only temporal redundancy. Spatial redundancy will be considered in the next section. In traditional communication systems, the temporal redundancy can be compressed by carrying out joint source coding for all the source information symbols. However, such a source coding scheme requires a long latency since the coding needs to be carried out until all the source information symbols have been gathered. A long communication latency is intolerable for real-time feedback control in a CPS. Hence it is of key importance to study real-time source coding, in which the source information symbols are encoded in a sequential manner.

## SYSTEM MODEL

Obviously, the output of the decoder is dependent only on the received bits and is thus nonanticipating.

## STRUCTURE OF CAUSAL CODERS

Sliding block causal coders In this case the output of a decoder at time t is given by 在这种情况下，解码器在时间t的输出为

If the coder has finite memory, then we have where M is the memory length.

Block causal coders In such coders, the information is encoded and decoded in a block manner, namely

where k is the index of the block (or codeword). Obviously, such a coding scheme will cause an average delay N/2.

Block stationary coders Block stationary coders can be considered as a combination of sliding block causal coders and block causal coders. The output is given by

where the decoder output is in blocks and the decoding procedure uses all the previous received bits.

Finite-state causal coders A challenge of sliding block causal coders and block stationary coders is that infinite memory is needed to store all the previously received bits. To alleviate this problem while not discarding the history, a state S can be defined for the coders, such that the decoder output is a function of the state and the newest information symbol, namely

and the system state S evolves in the following law:

## SHANNON ENTROPY

Information and communications are triggered by uncertainties of information source (which can be represented by a random variable or random process). When the random variable is realized and has a deterministic value, the more uncertain the random variable is, the more information we can obtain. For a random variable X with finite alphabet XX and probability distribution P, its Shannon entropy is defined as

where the unit is bit (or nat) when the logarithm base is 2 (or e). In the special case of uniform distribution, the entropy of X is given by log |X |, where |X | is the cardinality of X . For example, if X has eight possible values, then we obtain 3 bits when X is determined. Similarly, for two random variables X and Y, we can also define the joint information as

## MUTUAL INFORMATION

For two random variables X and Y, the mutual information is to measure the amount of randomness they share. It is defined as

I(X; Y) = H(X) − H(X|Y) = H(Y) − H(Y|X).

When I(X; Y) = H(X), X and Y completely determine each other; on the other hand, if I(X; Y) = 0, X and Y are mutually independent. As we will see, the mutual information is of key importance in deriving the capacity of communication channels.

## COMMUNICATION CHANNELS

Below we introduce some typical communication channels and their corresponding characteristics:

• • Wireless channels: The information is carried by electromagnetic waves in the open space. There are many typical wireless communication systems such as 4G cellular systems and WiFi (IEEE 802.11) systems, which enable mobile communications. However, wireless channels are often unreliable and have relatively small channel capacities.
• •无线通道：信息是由开放空间中的电磁波传送的。 有许多典型的无线通信系统，例如4G蜂窝系统和WiFi（IEEE 802.11）系统，可实现移动通信。 但是，无线信道通常不可靠并且具有相对较小的信道容量。
• • Optical fibers: The information can also be sent through optical fibers in wired communication networks. This communication channel is highly reliable and can support very high communication speeds. It is often used as the backbone network in a large area communication network such as the Internet.
• •光纤：信息也可以通过有线通信网络中的光纤发送。 该通信通道高度可靠，可以支持很高的通信速度。 它通常用作诸如Internet之类的大范围通信网络中的骨干网络。
• • Underwater channels: Underwater acoustic waves can convey information with very low reliability and very low communication rates, which can be used in underwater communications such as submarine or underwater sensor communications.
• •水下通道：水下声波可以以非常低的可靠性和非常低的通信速率传输信息，可用于水下通信（例如，潜艇或水下传感器通信）中。

A communication channel can be modeled using conditional probabilities. Given an input x, the channel output y is random due to random noise, and is characterized by the corresponding conditional probability P(y|x). Several typical models of communication channels are given below, as illustrated in Fig. 2.2:

• • Binary symmetric channel: Both the input and output are binary with alphabet {0, 1}. The input is swapped at the output with probability 1 − p. Such a model can be used to describe the errors in demodulation and decoding, where 1 − p is the error probability of each information bit.
• •二进制对称通道：输入和输出均为二进制，字母为{0，1}。 输入在输出处以概率1- p交换。 这样的模型可用于描述解调和解码中的错误，其中1-p是每个信息比特的错误概率。
• • Erasure channel: The input is binary. A new element , which denotes the loss of information, is added to the output, besides bits 0 and 1. Such a model can depict communication systems with possible packet losses, such as the Internet.
• •擦除通道：输入为二进制。 除位0和1之外，还将一个表示信息丢失的新元素添加到输出中。这种模型可以描述具有可能的数据包丢失的通信系统，例如Internet。
• • Gaussian channel: Both the input and output of the channel are real numbers. The output is a superposition of the input and Gaussian noise, i.e., y(t) = x(t) + n(t),
• •高斯通道：通道的输入和输出均为实数。 输出是输入和高斯噪声的叠加，即y（t）= x（t）+ n（t），

As will be seen, the SNR is of key importance in determining the performance of Gaussian channels.

• • Gaussian band-limited channel: This channel is continuous in time. The additive noise has power spectral density N0 and the signal is allowed to transmit within a frequency band with bandwidth W.
• 高斯带限信道：该信道在时间上是连续的。 附加噪声的功率谱密度为N0，并且允许信号在带宽为W的频带内传输。
• • Multiple access channel: When there are multiple (say K) transmitters and a single receiver, the transmitted signals will have an effect on each other. For the discrete case, the channel is characterized by the conditional probability given
• 多路访问信道：当有多个（例如K个）发射器和一个接收器时，所发射的信号将相互影响。 对于离散情况，信道的特征在于给定的条件概率

An important special case is the Gaussian multiple access channel, in which the received signal is given by 一个重要的特殊情况是高斯多路访问信道，其中接收信号由n是高斯噪音

## CHANNEL CAPACITY

In his groundbreaking paper , Shannon proposed the concept of channel capacity C (bits/channel use) for communication channels; i.e., if the transmission rate R (also bits/channel use) is less than C, then there exists an encoding and decoding scheme to make the transmission error arbitrarily small; otherwise, the data cannot be reliably transmitted. For the simplest case of discrete memoryless channels, the channel capacity is given by

• • The binary symmetric channel has a channel capacity of 1 − H(p), where H(p) = −p log p − (1 − p)log(1 − p) is the Shannon entropy of a binary distribution with probabilities p and 1 − p.
• • The erasure channel has a channel capacity p, where p is the probability that the transmitted bit is not erased.
• • The Gaussian channel has the following channel capacity:
• •二进制对称信道的信道容量为1-H（p），其中H（p）= -p log p-（1-p）log（1-p）是概率为p的二进制分布的Shannon熵 和1-p。
• •擦除通道的通道容量为p，其中p是未擦除所传输位的概率。
•   •高斯通道具有以下通道容量：

where Ps is the maximum average power, if the constraint is the average power (instead of the peak power).

• The band-limited Gaussian channel has the following channel capacity:

if the constraint is on the average power.

• Multiple access channel: When there are K transmitters, the communication capability is described by the capacity region Ω, within which each point (R1, … , RK) is achievable. For the Gaussian multiple access channel, we have

Although the concept of channel capacity has been widely used in the practical design of communication systems, the following points need to be noted in the context of communications in a CPS.

• The channel capacity defined in traditional information theory is based on asymptotic analysis and infinite codeword length. However, in the context of CPS, due to the requirement of real-time for controlling the dynamics, the infinite codeword length, which also implies infinite communication delay, cannot be tolerated.

• The achievability of channel capacity is usually based on the argument of random coding, which is of theoretical importance but is not practical. Although for simple binary channels the low-density parity-check (LDPC) codes have almost attained the channel capacity, the practical capacity of coding schemes has not been obtained for many typical channel models.

•信道容量的可实现性通常基于随机编码的论点，这在理论上很重要，但不切实际。 尽管对于简单的二进制信道，低密度奇偶校验（LDPC）码几乎达到了信道容量，但是对于许多典型的信道模型，尚未获得编码方案的实际容量。

## SOURCE CODING

Source coding means converting from the original information, which could be either discrete or continuous, to a series of information bits. It can be categorized into lossless source coding, in which no information is lost during the source coding, and lossy source coding, in which a portion of the information is lost. For simplicity, we assume that the information symbols are discrete-time and i.i.d.

## LOSSLESS SOURCE CODING

The information source of lossless source coding, which is represented by a random variable X, must be finite. Denote by X = {x1, … , xn} the alphabet of the information source symbols. It is assumed that the distribution is {P(X = xj)}j. One simple approach is to use log2 n bits to represent these n symbols. However, we can do better and use variable length code, in which the information symbols may be represented by different numbers of bits. To achieve the lower average code length, we should assign fewer bits to more frequent symbols. It has been shown that the minimum average number of bits is given by H(P) (i.e., the entropy of X) and can be achieved by properly assigning the bits to the symbols with known probabilities.

The above discussion assumes known distributions of the information symbols. In practice, it is possible that the distribution is unknown and we only have a sequence of bits. In the situation of unknown statistics, a technique of universal source coding, such as the Lempel-Ziv coding scheme, can be applied. A surprising conclusion is that, even though the statistical information is lacking, the universal coding can still achieve the optimal compression rate asymptotically (as the length of sequence tends to infinity), given that the information source is stationary and ergodic.

## LOSSY SOURCE CODING

When the information source is continuous in value, it is impossible to encode the source information symbols using finitely many bits. Hence some information has to be lost when digital communications are used. We say that the source coding is lossy in this situation. Note that we can also apply lossy source coding for discrete information sources, if some information loss is tolerable.

Theoretically, lossy source coding can be studied using rate-distortion theory . First a distortion function is defined as d(X, Xˆ), where X is the information symbol and Xˆ is the recovered one. For example, when X is real valued, the distortion can be defined as the mean square error; i.e., d(X, Xˆ) = E[|X − Xˆ| 2]. Then a ratedistortion function R(D) is defined for each given average distortion D, which means the minimum source coding rate ensuring that the average distortion is no larger than D. It is well known that R(D) can be obtained from the optimization of mutual information between X and Xˆ. In practice, lossy source coding is carried out by a quantizer.

Essentially a quantizer divides the scope of the information source symbol X into many regions, each of which corresponds to a bit string. Fig. 2.3 shows a simple scalar quantizer, in which the range of X is the interval [0, 1] and it is divided into four intervals. When X falls in one of the intervals, the quantizer outputs the corresponding two bits. When X is a vector, a vector quantizer is needed, which is much more complicated. A powerful design algorithm for vector quantizers is Lloyd’s algorithm .

## MODULATION AND CODING

When receiving a series of information bits, the transmitter will convert them to physical symbols ready for transmission. Usually two steps are needed for this conversion. First the information bits are encoded by adding redundancies in order to combat the communication channel unreliability and the potential errors arising from this. Then the coded bits are converted to physical symbols that are ready to be passed to the hardware for transmission.

CHANNEL CODING

There are two types of channel coding used to protect the transmission reliability from the channel uncertainty. In error detection coding, mistakes during the transmission may be detected. In error correction coding, the errors may be corrected without retransmitting the message.

Error detection coding

In error detection coding schemes, typical errors during the transmission can be detected. Then the received codeword may be discarded and the failure of the transmission is announced; or the receiver will send back a request to the transmitter for retransmission. The simplest scheme of error detection is the parity check bit, namely the sum of the information bits is appended to the codeword. For example, when the original information bits are 1100111, the transmitted bit sequence is 11001111, where the last bit 1 is the parity check bit. If an error occurs in one of the information bits and the receiver receives 11001110, then the error can be detected since the parity check bit is unequal to the sum of the information bits. Although it is also possible that the parity check bit, or two information bits, is wrong, the corresponding probability is much smaller.

Error correction coding

In the error correction coding scheme, typical errors in transmission can be corrected at the receiver without retransmission. The capability of error correction is assured by adding redundancies to the information bits. Take the simplest repetition code, for example. When we transmit bit 1, we can transmit it three times, thus sending out bit sequence 111.

The decoder uses the majority rule; i.e., the decision is the most frequent bit in the received sequence. Hence if there is one transmission error and the decoder receives bits 011, the decoder can claim that the information bit is actually 1 and thus correct the error. The transmission rate, usually denoted by R, is defined as the ratio between the number of information bits and the number of coded bits. The above example of repetition code has a transmission rate 1/3. The smaller R is, the more redundancies are added and the more reliable the transmission is; however, a smaller R also uses more communication bandwidth since the transmitter needs to send out more coded bits in the same time. A tradeoff is needed between the reliability and resource consumption.

A typical error correction coding scheme used in practice is the linear block code, in which the coded bits c are given by c = bG, (2.14) where b represents the information bits (as a row vector) and G is called the generation matrix. The receiver receives a senseword s = c + e, where e is the error vector. The decoder applies a matrix H, which satisfies GTH = 0, to calculate d = sH = eH, which is called the syndrome. The error vector e can be found by looking up a table indexed by the syndrome.

A more powerful error correction code is the convolutional code, which does not have a fixed codeword length and is thus different from the linear block code. In the convolutional code, memory is introduced by streaming the input information bits through registers. Take the structure in Fig. 2.4, for example. There are two registers; thus the memory length is 2. For each input information bit b(t), the two coded bits, denoted by c1(t) and c2(t), are generated by b(t) and the previous two information bits b(t − 1) and b(t − 2), which are given by

Then the two registers are occupied by b(t −1) and b(t). Since one input information bit triggers the generation of two coded bits, the transmission rate is 1 2 . By changing the number of simultaneous output coded bits and the number of branches of the input bits, we can achieve any rational number for transmission rates. As we will see in later chapters, the structure of the convolutional code is actually similar to that of linear dynamical systems, thus facilitating the proposed turbo channel decoding and system state estimation in CPSs. A breakthrough in the area of error correction codes is the turbo code, which was proposed by Berrou et al. in 1993  and achieves transmission rates very close to the Shannon limit of channel capacity (Fig. 2.5). For simplicity, we use the original form of turbo codes given in Ref. .

The encoder essentially uses a parallel convolutional code. Two convolutional encoders are used. The raw information bits, and a randomly permeated version of the bits (carried out by an interleaver), are fed into the two convolutional codes. Then the output of the turbo encoder consists of the original information bits and the output bits of the two convolutional encoders. The key novelty in the turbo code is in its decoder, which is illustrated in Fig. 2.6. The received encoded bits at the decoder consist of three parts: m (the original information bits with possible errors), X1 (the output of convolutional encoder 1), and X2 (the output of convolutional encoder 2). m and X1 are fed into a standard decoder, which generates the soft decoding output (namely the probability distribution of each bit being 0 or 1), instead of hard decisions.

The soft output is fed into decoder 2 as prior information. Decoder 2 combines m, X2 and the prior information from decoder 1, and also generates soft outputs. Then the outputs of decoder 2 are fed into decoder 1 as prior information. Such a procedure is repeated iteratively (thus named “turbo”) until convergence is achieved. The performance of decoding is improved through this iterative procedure. Such a “turbo” principle of information processing has been applied in many other tasks such as multiuser detection and equalization. As we will see, it can also be used for system state estimation in a CPS.

MODULATION

When the information bits are encoded into codewords (note that in some situations the channel coding procedure is not necessary), the procedure of modulation maps the abstract information to physical signals. In many typical communication systems, the information is conveyed by sinusoidal carriers, namely

where A, f , and φ are the amplitude, frequency, and phase of the sinusoidal signal, respectively. Hence the information can be carried in the amplitude, frequency or phase, thus resulting in the following types of modulation schemes: • Pulse amplitude modulation (PAM): There are M possible signal amplitudes, which represent log2 M bits. The simplest case is M = 2, namely the transmitter either transmits or not (A = 0), which is called on-off keying (OOK). • Frequency shift keying (FSK): The transmitter can choose one of M frequencies for the sinusoidal carrier signal. Hence each switch of the frequency represents log2 M bits. • Phase shift keying (PSK): The transmitter switches among M possible phases for transmitting log2 M bits. The above three typical modulation schemes can be combined; e.g., quadrature amplitude modulation (QAM) is a combination of PAM and PSK. To explain the mechanism of QAM, we consider the following signal:

which is decomposed to the quadrature and in-phase terms (A1 and A2 are the corresponding amplitudes). We can consider the carriers cos(2πft) and sin(2πft) as two orthogonal signals, each of which carries one branch of independent information. Hence (A1, A2) can be considered as a point in the signal plane (called a constellation), which is used to convey information. When there are M possible points in the constellation, we call the modulation scheme M-QAM. Fig. 2.7 shows illustrations of 16QAM and 64QAM.

When selecting a suitable modulation scheme, there is a tradeoff between the transmission rate and transmission reliability. Usually, the symbol rate (i.e., the number of modulation signals transmitted in one second) is fixed. Hence if a denser constellation is used, more coded bits can be transmitted in the same time, thus improving the data transmission rate. However, the points in a dense modulation are close to each other, and thus can be confused by noise and results in transmission errors. As we will see in subsequent chapters, this tradeoff is important for the modulation selection in CPSs.

## Estimation of the Rate-Distortion Function

Suppose a data string x n1 := ( x 1, x 2, . . . , x n ) is generated by a stationary memoryless source ( Xn ; n ≥ 1) with unknown marginal distribution P on a discrete alphabet A. In many theoretical and practical problems arising in a wide variety of scientific contexts, it is desirable – and often important – to obtain accurate estimates of the entropy H ( P ) of the source, based on the observed data x n1 ; see, for example,        and the references therein.

Perhaps the simplest method is via the so-called plug-in estimator, where the entropy of P is estimated by H ( Px n1 ), namely, the entropy of the empirical distribution Px n1 of x n1 . The plug-in estimator satisfies the basic statistical requirement of consistency, that is, H ( PX n1 ) → H ( P ) in probability as n → ∞. In fact, it is strongly consistent; the convergence holds with probability one . A natural generalization is the problem of estimating the rate-distortion function R (P, D ) of a (not necessarily discrete-valued) source. Motivation for this comes in part from lossy data compression, where we may need an estimate of how well a given data set could potentially be compressed, cf. , and also from cases where we want to quantify the “information content” of a particular signal, but the data under examination take values in a continuous (or more general) alphabet, cf. .

The rate-distortion function estimation question appears to have received little attention in the literature. Here we present some basic results for this problem. First, we consider the simple plug-in estimator R(PXn1 , D ), and determine conditions under which it is strongly consistent, that is, it converges to R (P, D ) with probability 1, as n → ∞. We call this the nonparametric estimation problem, for reasons that will become clear below

At first glance, consistency may seem to be a mere continuity issue: Since the empirical distribution PXn 1 converges, with probability 1, to the true distribution P as n → ∞, a natural approach to proving that R(PXn 1 , D) also converges to R(P, D) would be to try and establish some sort of continuity property for R(P, D) as a function of P.

But, as we shall see, R(PXn 1 , D) turns out to be consistent under rather mild assumptions, which are in fact too mild to ensure continuity in any of the usual topologies; see Section III-E for explicit counterexamples. This also explains our choice of the empirical distribution Px n 1 as an estimate for P: If R(P, D) was continuous in P, then any consistent estimator Pˆ n of P could be used to make R(Pˆ n, D) a consistent estimator for R(P, D). Some of the subtleties in establishing regularity properties of the rate-distortion function R(P, D) as a function of P are illustrated in  .

Another advantage of a plug-in estimator is that Px n 1 has finite support, regardless of the source alphabet. This makes it possible (when the reproduction alphabet is also finite) to actually compute R(Px n 1 , D) by approximation techniques such as the Blahut-Arimoto algorithm   . When the reproduction alphabet is continuous, the Blahut-Arimoto algorithm can still be used after discretizing the reproduction alphabet; the discretization can, in part, be justified by the observation that it can be viewed as an instance of the parametric estimation problem described below. Other possibilities for continuous reproduction alphabets are explored in  . The consistency problem can be framed in the following more general setting. As has been observed by several authors recently, the rate-distortion function of a memoryless source admits the decomposition,

\where the infimum is over all probability distributions Q on the reproduction alphabet, and R(P, Q, D) is the rate achieved by memoryless random codebooks with distribution Q used to compress the source data to within distortion D; see, e.g.,  . Therefore, R(P, D) is the best rate that can be achieved by this family of codebooks. But in the case where we only have a restricted family of compression algorithms available, indexed, say, by a family of probability distributions {Qθ ; θ ∈ Θ} on the reproduction alphabet, then the best achievable rate is:

\其中极小值是再现字母表上所有概率分布Q的总和，R（P，Q，D）是无记忆随机码本所获得的速率，其分布Q用于将源数据压缩到失真D之内； 参见，例如， 。 因此，R（P，D）是该系列码本可以达到的最佳速率。 但是，在我们只有有限的压缩算法系列的情况下，可以用一系列概率分布{Qθ； 在复制字母上，则可达到的最佳速率为：

We also consider the parametric estimation problem, namely, that of establishing the strong consistency of the corresponding plug-in estimator RΘ(PXn 1 , D) as an estimator for RΘ(P, D). It is important to note that, when Θ indexes the set of all probability distributions on the reproduction alphabet, then the parametric and nonparametric problems are identical, and this allows us to treat both problems in a common framework.

Our two main results, Theorems 4 and 5 in the following section, give regularity conditions for both the parametric and nonparametric estimation problems under which the plug-in estimator is strongly consistent. It is shown that consistency holds in great generality for all distortion values D such that RΘ(P, D) is continuous from the left. An example illustrating that consistency may actually fail at those points is given in Section III-D.

In particular, for the nonparametric estimation problem we obtain the following three simple corollaries, which cover many practical cases. Corollary 1: If the reproduction alphabet is finite, then for any source distribution P, R(PXn 1 , D) is strongly consistent for R(P, D) at all distortion levels D ≥ 0 except perhaps at the single value where R(P, D) transitions from being finite to being infinite. Corollary 2: If the source and reproduction alphabets are both equal to R d and the distortion measure is squared-error, then for any source distribution P and any distortion level D ≥ 0, R(PXn 1 , D) is strongly consistent for R(P, D). Corollary 3: Assume that the reproduction alphabet is a compact, separable metric space, and that the distortion measure ρ(x, ·) is continuous for each x ∈ A.