Tapestry: Sharding Convolution Operators
This post develops part of this document:
Sharding Convolution Operators
Let’s now consider a new operation, the application of Convolution Kernels.
1 | Y = Conv2D(X, K) |
Kernel convolution operations tile (or tessellate) a moving input window over the entire space of an input tensor. Convolution operations (frequently, see sparse convolutions below) share input data with neighbors; and effective modeling of their shard characteristics can dramatically reduce data flow in large computations, by sharding data neighborhoods to maximize local data sharing.
Expanding the sharding theory of convolution operations will require us to:
- define tensor stride view operations, to model sparse convolutions;
- develop stacked affine projections, to work in derived view environments;
- define tensor fusion operations, to reassemble sparse shards.
Consider the following case of a
If we apply no padding, and shift each neighbor by 1 step:
; ; ; - etc …
The projection function
- the origin value should point to the first cell used by the output origin;
- the marginal stride matches the output stride;
- the projection shape size matches the window size.
In this situation:
is , is
Convolution operations are frequently applied to not one convolution kernel,
but to a stack of them. It’s common for a
Additionally, in cases where no padding is used, the output must lose size relative to the input; the first and last values along each dimension are shifted in to permit the full selection of the convolution filters. Padding will be discussed later, which brings with it many questions of how that padding should be generated.
Consider:
- a 100 batch,
shape, 1-channel input ; - a 128 layer,
shape, 1-channel input convolution filter ; - yielding a 100 batch,
shape, 128-channel output .
Sparse Strided Convolution Operators
Consider an operation which is common in convolution, but which our current index projection description has no mechanism for describing: striding
In this example, we wish to apply the
This is a common mechanism to add non-local information to a kernel without inflating the size (and complexity) of the kernel filter itself; a good model of it is necessary.
The outcome we’d like to achieve in this situation is that we’re able to rewrite this operation into dense variants; doing so permits local neighborhood data reuse.
Consider the following rewrite, into strided sliced views of
There are two broad approaches to realize this goal, which will be explored in later sections:
- extending the projection function language with the concept of striding;
- developing strided tensor slice and fusion operations.
In practice, these two approaches are isomorphic to each other; though in some situations some problems are easier to express in one or the other approach. We’ll develop both.