Tapestry: Tensor View Selections

This post develops part of this document:

Tensor View Selections

As noted in previous sections, a family of tensor view selection operations can significantly reduce the complexity of representing expression graphs.

Consider a very simple expression; one which indexes solely over a dimension, mapping vectors of -features to vectors of -features.

The signature for this expression describes the contract of the attached operation; exactly; we do not have analytic information about the internals of the operation (at this level), so to execute this expression, we must provide an input tensor shaped as , and we must expect an output tensor shaped as .

But suppose the previous step provided tensor view , which was oriented feature-first?

What operation might we use to adjust the data?

We could of course use further block expressions to rewrite data; the operation family is general and sufficient for any structural rewrite we may wish to perform. But doing so would push the problem back up a recursion level; we’d be scheduling operations which existed solely to reorder data for other operations.

Under sufficiently strong graph rewrite and operator fusion assumptions, such an approach could work efficiently, but it raises the standards needed for an effective optimizer.

So we look for a weaker operator family, which would be simpler to schedule and more amenable to rewrites and optimization.

Additionally, consider that the input for a given operation may be collected from multiple disparate tensor shards, from distributed execution environments.

Possibly sharded by batch:

Or by feature group:

Defining some terms:

  • a Tensor View is a logical slice of tensor-structured data; and
  • a Selection is an expression to assemble a Tensor View from other tensors.

Note: a reminder that as these describe sharding operations, Tensor Views are slices of tensor coordinate space; and a given Tensor View may be indexed at any point in tensor space.

A feature which appears to be present from the examined cases is that a Selection routes each cell in its resultant View back to some cell in some source tensor.

We can formalize this as a requirement:

  • a Selection maps each coordinate in a Tensor View space to a pair of exactly one source tensor.

Fundamentally, moving data from the output of one operation to the input of another operation, which may be on another machine, is an operation centered on copying portions of buffers; and by being careful in our restriction of legal Selection operations, we can evaluate them by simply copying different buffer data; many of these operations will have zero marginal cost over direct movement.

Affine Selections

Consider the following potential index/stride-manipulation Selection operations:

  • permute/transpose - reordering the dimension indexes of a tensor is free.
  • reverse/flip - flipping a tensor dimension is free.
  • select - selecting a subset of a tensor is free.
  • stride - skipping every k items along a dimension is free.
  • squeeze/unsqueeze - adding/removing a size-1 dimension is free.
  • broadcast* - treat a size 1 dimension as though it were size n, without copying data.

These operations are free on local tensors, because they’re all indexing tricks; and can be implemented using discrete affine projections, the same as the index projection functions.

On remote tensors, we can transmit the Selection operation to the tensor holder, evaluate the indexing trick operation where it is free, and transmit back the selected data block.

*broadcast is something we’d much prefer to implement on the local consumer; as implementing broadcast remotely would cause an unnecessary duplication of data. And we see now an operation where a good optimizer may wish to rewrite a Selection cascade for efficiency in some situations.

Composite Selections

A careful reader of the given examples may note that we have a case for both some form of concat (for the example of fusing partial feature results from ); and of interleave (for the example of fusing the results of a sharded dilated convolution kernel).

  • concat - assemble a tensor view by concatenating multiple tensors along a dimension.
  • interleave - assemble a tensor view by interleaving multiple tensors along a dimension.
  • repeat* - assemble a tensor view by repeating a tensor along a dimension.
  • pad* - supply data outside the selection region with a default value, or a reflection.
  • where, max, min - conditionally construct a view by switching on the data from multiple views.

These operations cannot be implemented using discrete affine projections; they generally perform routing by applying some range limit comparison operation, with or without a modulo, to one dimension of the input, and use that to derive a target tensor and new coordinates.

On local machines, concat and interleave are generally implemented as full-copy operations, because otherwise supporting them as transparent views would require a tree of tensor view objects; but as Selection operations, they are still fundamentally performing simple index operations and then differing to a backing view.

*pad and repeat are Selections we’d also prefer to implement on the local consumer; as the data is either a default value, or a reflection or duplication of data we already have; and these are also good targets for Selection optimization and re-write.

Atomic Selections

We could define a Selection an arbitrary tree of the above or similar operations; but as each of these operations has real impact on the cost model of execution through data sharing impacts, and we desire to be able to see and optimize through those operations; we forbid composite selections:

  • All Selection operations are “simple”, and complex Tensor Views are assembled via trees of chained atomic Selections; not composite Selections.

Under evaluation, it will generally be trivial to fuse Selectors; but for analytic modeling, we keep them separate.

An Example From Conv

We now have the operations necessary to fully describe the previous dilated example:

Where a dilated convolution input is sharded into dense convolutions; and the resulting dense results are interleaved back into the convolution result we would have seen if the original convolution had been applied:

Generators

An additional potentially useful category of Selector are generators; consider:

  • zeros, ones, full - generate a view full of a given value.
  • rand - generate “random” data.

Generators can be used to create an entire Tensor View on the local machine; as the data is a deterministic function, there’s no data to transmit.

Random Generators

It is important to note that rand is an entire sub-category of problems:

  • We require an idempotent, and thus deterministic random field; in a given evaluation, we must be able to rebuild the same random values every time a tensor view is accessed; so rand needs a seed and to compute data as some function of the coordinate space.
  • There are many random number generation distributions to select from, before we consider that properties (such as ) may be functions of the input coordinates.

It may prove simpler in practice to model random generators as block expressions than as Selector generators.