# A digression on linear systems thinking

You probably don’t have to read this if you read my blog at all; but just in case… Skip this if you are familiar with linear systems.

I’ve sometimes been asked to make my blog posts simpler to understand for the typical photographer. All this mathematics on diffraction and photons and such is not within the backgrounds of most practicing photographers, whether pros or amateurs. I’ve thought about this, but often, as Vincent Versace likes to quote Einstein all the time, we can make something as simple as possible, but no simpler. Instead, my aim here is to provide a clear and basic understanding of one of the central underpinnings of all analysis of cameras and images.

This notion is that a camera and its component parts can be assumed to be a linear system. What this implies is that the light falling on the camera can be taken as some sort of input signal and the image that comes out can be taken as an output signal. All of the component parts of the camera, lens, optical filters, sensor, electronics, and so on, operate on this input signal to produce an output signal.

To say that the system is linear implies among other things that if you had an input signal I and another input signal J, and you knew the output signals for each of these, call them O and P, so that our camera maps $I \longrightarrow O$ and $J \longrightarrow P$, then our camera will map $I+J \longrightarrow O+P$.

In its most basic form, this means that we can break an input signal down into distinct component parts, say points of light, and treat each point on its own. We can then consider how the camera will handle each point of light and build up the resulting image, or output signal, up out of the sum of the entire set of results to each distinct point.

This doesn’t always work. Every system has its limits. If our input was a single point of light that just saturated the limits of a sensor well yielding over-exposure, then adding in two of these points of light won’t deliver twice as much over-exposure. Instead, we’re likely to see a non-linear phenomenon called inter-pixel interference (IPI) where the adjacent pixels will start to show values that were due to electrons spilling over into the associated sensor wells. To avoid these limits, we often have to limit our models to small signals, values of input that can be summed up without over-stressing the system. When these non-linearities occur, we get results like this:

$I + J \overset{\text{Camera}}{\rightarrow }O + P + a O^2 + b P^2 + c O P +\text{other stuff}$

where a, and b, and c and so on are constants that indicate how much of each kind of non-linearity is present. In sound systems, these higher order terms are sometimes called harmonic distortion. Aliasing, which I’ve written about a good deal here lately, yields a kind of sub-harmonic distortion. Anytime you put in content at one frequency and get out content at another frequency, you’ve found a non-linearity. We’ll come back to why this is later. But enough of non-linearities for now.

If we’re going to represent cameras and scenes as linear systems and signals, then we need some mathematical way of representing systems as operations and the signals that are operated on. One natural way is in terms of space. We can break the scene down into points of light coming in from some direction. In one limit, this is basically geometric optics and ray-tracing. This doesn’t work very well once we encounter the wave character of light with diffraction, and so some other machinery is required. However, a DSLR sensor and its output files are perfect instrument for considering in this fashion. For each pixel position, the file reports values of red, green, and blue. So, if we number pixels columns from 1 to $N$ and pixel rows from 1 to $M$, then we can describe an image file as RGB(n, m) where n and m are the row and column values. We can transform these RGB values into La*b* or CMYK or whatever else as well.

Let’s take an example of a system. Suppose we take a file and apply a blur to it. Each pixel has 8 neighboring pixels, 2 up and down, 2 right and left, and 4 more on the diagonals. Let’s say that our blur maps the central pixel value as 80% into the central pixel, 4% into each of the up/down and right/left pixels, and 1% into each of the diagonal pixels. The total sums to 100% so no signal strength is lost. This is a perfectly linear operation. We could compute the output image one input pixel at a time and add up the results to get the final blurred image. Although the actual parameters differ from what I’ve quoted here, blur functions in Photoshop work exactly like this.

But back to where our overall thinking about lenses and such doesn’t quite fit the geometric optics model. Often the solution is in terms of a trick sponsored by Mr Fourier (Joe to his friends). Let’s consider our output signal in terms of a print. Say we used as an input signal a single point of light, so our print is mainly black with some small white blob in the middle. [We are already in some intellectual difficulty since we have to think of paper white as something and black ink as nothing; but work with me here.] We can give the print some $x, y$ co-ordinates and describe the blob mathematically in terms of the intensities of red, green and blue, or cyan, magenta, yellow and black. Mr Fourier’s trick is to show that he could have created that same white blob by adding up a bunch of sine waves lined up properly in the horizontal and vertical directions. Say we start with horizontal sine waves. We get the first one with one wavelength across the horizontal direction and lined up so its peak value is right where the blob is. Then we get another with two wavelengths and line it up with the location of the blob. And so on. With an infinite number of sine waves. What we’ll find is that right at the blob, these waves all interfere constructively, because we made it that way. Away from the blob, they all interfere destructively and the result is nothing. So, we have a thin vertical line. Then we repeat in the vertical direction, and we get our blob. If we can do this trick with any point, we can do it for any shape made up of points.

Here’s an example for our point, given in just one dimension, up to the first 2 sine waves:

Fourier expansion of a point at the origin, 2 terms

Here’s what we get at 10 terms:

Fourier expansion at 10 terms

And here’s after 1000 terms:

Fourier expansion of a point, 1000 terms

Here’s what we’d get in a 3-D plot expanding only to the first 3 terms in Mr Fourier’s world:

2-D Fourier expansion of a point, 3 terms

Looks a bit like an Airy disk, no? Back to that later.

Let’s think for a moment about what’s going on here. We began by thinking of our signal in terms of some intensity of light at a point in space, say $I(x,y)$. What this means is that our signal has some given value $I$ at each point $x,y$. But now, we’re saying that we could have made the exact same signal up out of some sine waves by specifying their intensity and phase, $A_x, \phi_x, A_y, \phi_y$, at each wavelength $2 \pi n_x/L_x, 2 \pi n_y/L_y$, where the $n_x, n_y$ run from 1 to infinity. When we represent our signal in the form $I(x,y)$ we are using a Cartesian co-ordinate system in $x$ and $y$. When we represent it in Mr Fourier’s system of sine waves, we are just using another Cartesian co-ordinate system in $n_x$ and $n_y$.

What we have done in going into Mr Fourier’s world is to change our co-ordinate system; and Mr Fourier’s transformations are nothing more or less than a rotation of co-ordinates. This is why these optical engineers are all uptight about these resolution charts. They understand that any image can be made up out of these little squiggles of sine waves in the same way that it can be made up out of points of light.

This tells us that a signal is some thing in a space of many dimensions. If we think about our DSLR sensor, say from my beautiful Nikon D700, we have 2832 rows and 4256 columns. At each one of these row and column positions, we can have a discrete R or G or B value of between 0 and $2^14-1 = 16,383$. The number of dimensions is equal to the number of pixel positions, 12,052,992. That’s a lot of dimensions. Still, many problems in Fourier analysis have infinite dimensions. When we get to systems like this sensor or its output files, where the number of dimensions are finite and the values in each dimension are quantized, we will use something called the discrete Fourier transform or DFT. Same basic idea, slightly different math: sums instead of integrals.

A central dogma of linear systems thinking is that systems have characteristic functions, called their impulse response, that they apply to any impulse of energy applied to them. The underlying theory is that any linear system uses up energy in a unique way. A classic case might be a pendulum. Hit it with a hammer and the pendulum will swing at some characteristic frequency for a while and gradually slow to a stop. In the case of a camera, with a nearly circular aperture, we have seen before in other posts that the result is an Airy disk. If the aperture is square, the result is much more like the approximation to a pulse that we just saw using Fourier series. The concept of an impulsive input signal is that it quickly delivers energy into the system. The way in which the system degrades that energy can depend thereafter only on its structure. In the case of a camera, if we deliver a short sharp pulse of light, the response is called the point spread function (PSF). If we imagine that any scene is comprised of a huge number of such discrete pulses, then we can further assume that an image is composed of a sum of point spread functions weighted by the unique intensities and colors of each point from the scene.

We looked earlier at a blur function in some post processing system like Photoshop. This blur had its own unique impulse response or PSF, just as we gave earlier. If we applied this blur function across every point (or dimension) in the image file, we would be doing an operation called convolution. A mathematical expression for convolution is as follows:

$\text{Output}(n, m) = \underset{j=0}{\overset{J-1}{ \sum }} \underset{k=0}{\overset{K-1}{ \sum }} \text{Input}(j, k)*\text{ImpulseResponse}(n-j, m-k)$

Admittedly, this may look ugly. But it’s just some math to describe what we said; namely, to find the output value at some point n, m we take the input at n, m and add up all the possible values of the impulse response around it. In many cases, like our blur function, the impulse response is very limited. Summing over the j and k values is just this process of sliding the impulse response around until we’ve collected all the input values that could impact the output position.

There is another way to handle this convolution procedure, in Mr Fourier’s world. The notion is similar to using logarithms to multiply. Anybody remember that? If you want to multiply two numbers, you can take their logarithms and add them, then find the antilogarithm of the result; and that’s your answer. In the Fourier world, the equivalent of convolution is just multiplication. So, if we know the Fourier transform of an input signal and the Fourier transform of a system’s PSF (or what’s equivalent, its MTF), then we can multiply these two together; and that’s the Fourier transform of the output. (OK, I’m neglecting phase terms here for those of you who understand all this… Why are you here anyway?) If you want the result back in regular space, take the inverse Fourier transform; and violà, the answer. One of the reasons that folks into digital signal processing love their Fourier transforms so much is that multiplying signals in Fourier’s world is much much less time consuming of computer resources than convolution in regular space (or time). Even throwing in the computational cost of Fourier transforms, it’s often much faster.

In the real world, we find that many optical components do not have uniform MTFs over their entire surface. Lenses get soft at the edges. Sensors show vignetting at their boundaries. So, the MTF that we find near the optical axis is not the MTF at the edge. Likewise, we can find that the MTF depends on the orientation of line pairs. Horizontal MTF is not the same as vertical MTF is not the same as tangential MTF. Fourier theory actually accounts for this reasonably well; and would tell us that we need to look separately at horizontal and vertical spatial frequencies. But we should be able to compute tangential MTFs from the combination of horizontal and vertical MTFs; and if that isn’t true, then we have a linearity problem. The dependence of MTF on location is in itself a kind of non-linearity; but one that is easy to work with. We would have a similar situation with a sound system whose frequency response changed as it warmed up. So long as the rate of change of the frequency response is slow, say on the order of several seconds, with respect to the audio frequencies (20 – 20,000 Hz) we have no big problems. In our lenses though, life is a little tougher since we find MTFs changing across spatial dimensions that are of the order of the system’s spatial frequency response. What saves our linear modeling in this case is that it’s the high spatial frequencies that are impacted as we move to the edge. The lower spatial frequencies, representing content that spans the entire image, remain reasonably constant in most cases.

So, our central idea is that we can decompose a signal into its component parts. We can choose those parts for our convenience. A similar situation occurs in, say, your car stereo when you turn up the bass. You’ve set the electronic system to increase the lower frequencies selectively. The same thing can be done for spatial systems as well.

Let’s show a simple example. Here’s a circular aperture, represented in Photoshop as a white disk in the center of a black background.

Circular aperture

And here is the magnitude of its Discrete Fourier Transform (DFT):

DFT of circular aperture

Now, I should probably explain some mathematical niceties in this representation. First, this is just our old friend, the Airy disk, which we first encountered here in discussions of diffraction. Looking at this image, you’ll notice certain symmetries. What you’re seeing is, in effect, the Airy disk sliced into four pieces and put together backwards. This is just a side-effect of the way in which the software I’m using (Wolfram Mathematica) computes the DFT. The low spatial frequencies are clustered in the left-hand corner, where, as the scales show, both the frequencies in the x and y axes are near 0.

Let’s see what happens if we apply a low pass filter to this circular disk. The DFT now looks like this, with those few higher spatial frequency ripples gone:

Airy Disk low-pass filtered

And this is the resulting image:

Circular aperture, smoothed

You can see the obvious smoothing of the sharp edges of the disk. Let’s go the other way. Let’s enhance the higher frequencies and remove the low ones.

High pass filter applied

You might have to look at this for a moment, but what’s happened is that the low frequencies are completely removed and the higher frequency ripples have been amplified. Here’s the resulting image:

circular aperture, high pass filtered

Cool, eh? We just have the edge of the disk left. Just in case you thought this didn’t work; it actually works a hot damn.

In another post here, I showed how the human eye is a bandpass system in terms of its contrast sensitivity function (CSF). What is quite astounding about the CSF chart shown there is that our eyes are obviously processing the image on the basis of its spatial frequency content and not just the points of light. The peak values of the points of light are exactly the same along any horizontal line. But we don’t see them as equal. We see them in a way that depends on the local variation in spatial frequency. Our perceptual apparatus, our perceptual system, has a bandpass MTF!

As a thing of some not insignificant beauty, depending on your point of view I guess, the following is the DFT of that CSF chart:

DFT of the CSF chart

Click on the image for the larger view and pay careful attention to the area around 0,0 in the left hand corner. You should see a bit of a dotted line there indicating the very low frequency values corresponding to the uniform gray of the image. The very fine spatial frequency structure shows the information content of the image; and the notches tell that many spatial frequencies are missing. Nonetheless, this view of the CSF chart in Mr Fourier’s world is perhaps even more amazing that the regular view we have in our own; don’t you think?