This post develops part of this document:

Sharding , and , over the in dimension

In the previous posts on Index Projection Functions and Sharding Linear Operations over Weight Out Dimensions, we developed affine projection sharding over the and dimensions of a tensor-valued operaton, assuming dimensions: , , , :

To examine sharding over the dimension, we’ll need to focus on the nature of the matrix multiplication operation, and discuss and operations.

What’s important here is that, while is linearly shardable in its and dimensions, it contains an implicit reduce sum reduction operation in its dimension.

📝 Note: careful readers may note that there exists a large body of work dedicated to the question of how to implement more efficiently. The point of this exercise is to use and as a lens to examine data covariance in sharding block operations; and a naive treatment of is useful to these needs.
In a fully developed tensor expression sharding environment, it could be useful to hoist some operations, such as to the level that the compiler were directly aware of them; and could more aggressively use the existing research in those spaces; but it is not necessary to develop the foundations of such an environment.

Returning to , we can rewrite as a composition of and :

Applying this re-write would restructure our expression graph from this:

To this:

A block operation sharding solution for on should translate to a solution for on .

We can decompose by distinguishing between the matrix multiplication operator () and the cell-wise product operation (); and generate an intermediate product with shape .

To do this, we need to extend and broadcast and to the combined shape , to produce an intermediate result :

And we need to introduce a new operator which sums along and removes one dim of .

We can now define in terms of this intermediate result, and

This decomposition yields the following expression graph:

In this decomposition, is a well-behaved block operation; but is represented differently, it is not a block operation as we’ve represented them before, but a reduction operation.

Sharding

Consider ; a simple cell-wise multiplication. We expect the output to have the same shape and dimensions as the input:

To achieve this in tensor operations over inputs where the shapes are not initially the same, but can be manipulated to be the same; it’s common to use broadcasting; to treat any dimension which is for one input, but non for another input as though it were broadcast or spread to cover the size of the other:

Unknown environment 'eqnarry*'\begin{eqnarry*} Prod(A_{[1,n,o]}, B_{[m,1,o]})_{[m,n,o]} := \left( \begin{split} (a_{1,n,o} \cdot b_{m,1,o}) &\qquad& … \\ … &\qquad& … \end{split} \right) \end{eqnarray*}

It is also common in tensor operations to perform various permutations, transpositions, and reversals to achieve appropriate alignment for broadcasting operations; all tensor libraries have a host of features, some more convenient than others.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
>>> import torch
>>> batch = 10
>>> input = 2
>>> output = 3

>>> x = torch.rand((batch, input)
>>> x.shape
torch.Size([10, 2]))
>>> x.unsqueeze(-1).shape
torch.Size([10, 2, 1])

>>> w = torch.rand((input, output))
>>> w.shape
torch.Size([2, 3]))
>>> w.unsqueeze(0).shape
torch.Size([1, 2, 3])

>>> (x.unsqueeze(-1) * w.unsqueeze(0)).shape
torch.Size([10, 2, 3])

Index projection functions permit working directly in the dimensions of the input and output tensors; provided there is enough space in the dimensionality of the index space to count all points in the block; so we can directly describe the above operation used by the with a simple index space that covers the full shape of the output.

📝 Note: careful readers may note that this involves the same input data being read by multiple output cells.

Reduction Operations

Reduction operations require information between cells, on the face they don’t appear shardable. Consider the index projections for a operation over two dimensions:

, as a block operation, cannot be sharded along the dimension.

Additional information about , and about rewrites to which are semantics-preserving; beyond what can be expressed about Block Operations, would permit us to break it apart.

In modeling tensor expression graphs, we’re interested in recurrent classes of operations; a solution specific to might be useful, but a larger class of answers would hold more value.

Suppose we notice that the summation reduction follows the monadic laws (it is associative and commutative); such that we can re-order and regroup it as we see fit:

Any operation with this property, no matter what the implementation is doing, permits us to mechanically rewrite evaluation order.

If we can attest that is a reduction operation along the reduction dimension; then we know we can split the operation into intermediate results.

Suppose we introduced a index dimension, to model partial reductions over blocks of the reduction dimension, producing an intermediate result with an additional dimension; and then and then applied a second stage to complete the reduction:

When an operation is known to be a monoidal reduction along a given dimension of the input, a broad family of equivalent rewrite schedules become possible; but it complicates representation of the index space, as is no longer a simple countable dimension.

Rewriting

Returning to the definition of ,

We can now construct from the combination of a block operation and a reduce operation:

Sharding over

Putting this together with the definition of ,

We can now express as a form of high-level reduction operation, over the , , and dimensions:

When sharding is desired over the dimension, expands to the following sub-graph of , , and operations:

And when sharding is desired over the dimension; expands to a graph over the one-step operation, which behaves the way our previous description of behaved:

Being able to express this re-write option, when the dimension is not sharded, will require us to develop high-order meta-operator representation above the index projection function formalism.

Next

The full decomposition of provides a pathway to sharding potentially large operations, at the cost of decomposing operations which can be represented by highly space and time efficient kernel implementations when they are not decomposed.

Were we able to select between this decomposition, when was large enough to require sharding, and the block representation of , when fit within our execution boundaries; we’d have a flexible mechanism to handle both large and small cases.

Decorating operators with re-write production rules will be developed in future work in this series.

There’s a great deal of conversation at the moment about AI Art Models; and there’s a fair amount of confusion and complex feelings about them. In online discussions, some friends have asked me to take a stab at describing what they are, for people outside the ML community.

This post aims to explain some of what’s going on from a very high view; outside the math, to give people more grounding in the ideas as to what’s happening right now.

De-Noising Images

All modern AI Art Models are built on a concept called “de-noising”, literally, “removing noise”. Denoising models have existed in various forms for a long time, but deep learning has made them powerful enough to be useful for generating new data.

Let’s define some basic ideas; this is a picture of my cat, Eliza:

If I add some random noise to the picture, I might get this:

Seen side-by-side, it’s easy to see the noise:

The noise I added to mess up the image was random; but if I already had the messed up image, and wanted to ask what the difference between the noisy image and the true image was, I’d see something like this, which is clearly not random:

The core idea of a denoising algorithm is, given some data, remove the data that looks like it’s noise, leaving the data that looks consistent with what we expect things like this to look like.

So, given a picture like this:

Can I apply some magic Enhance button to clean up the image, and get a picture like this:

Or side-by-side, can I go from the left, to the right:

Simple, right? Just fix all the bits that don’t look like a cat.

Magic Enhancement Buttons

A magic Enhance Button has long been a trope in sci-fi police procedural dramas. It’s common to see a low resolution image, taken in bad lighting, reflected off of a surface, “cleaned up” by the data techs or AI to recover details (“the killer’s face!”, “the licences plate!”) that were difficult or impossible to see in the original image.

And while some information is always present which isn’t easy to distinguish by the human eye; and some amount of information recovery is possible; there are limits, we can only recover information which is there.

This is such a common and silly trope, that it has become a point of pride in some media and tech-savy fans to point it out when it occurs in fiction.

But … what if we didn’t care if the recovered information was accurate?

What if we only cared if the recovered information were plausible?

Enhance My Cat

If I wished to build software which could take a noisy image of my cat, Eliza, and clean it up, I could now do that.

I’d start by taking a lot of pictures of my cat, on the same carpet, and messing them up to various degrees:

And then I’d take the steps in this degeneration, say Step K and Step K+1, and I’d train a model to go from Step K+1 to Step K.

Teaching it to remove one step of noise from pictures of my cat:

If I trained this with just one picture, I could build a model that would be very good at starting with random noise, and when applied iteratively, would make it look more and more like a picture of my cat:

And if I trained a model on many pictures of my cat against the same background, I could start with random noise and iteratively apply that algorithm to get an image which was plausibly like a picture of my cat against the same background.

I’m leaving a lot out, like How?

This is a big field of research, there’s a lot of math and statistics involved in working out how to imagine pictures of my cat.

But what we’ve described so far is only good if every picture has the same kind of content, it’s limited because we’ve trained one statistical model to remove one kind of noise:

  • Noise that doesn’t look like Eliza on a Carpet.

Denoising Lots of Things

If I want to be able to enhance / denoise pictures of lots of things other than pictures of my cat, Eliza; the simplest way would be to just train a lot of different models.

Let’s name the current model:

  • “Maine coon cat laying on a white and black salt and pepper rug.”

Now, my model of my cat is certainly bigger than one picture, but it’s probably a lot smaller than all the pictures I used to train it; probably a lot smaller if I’ve used good math to do the job.

If I want to be able to denoise other things, I could train other models, say:

  • “Maine coon cat sitting on a white and black salt and pepper rug.”
  • “Maine coon cat laying on wooden floor.”
  • “Maine coon cat sitting on a wooden floor.”
  • “Planes on a runway.”
  • Etc.

For a good few years, this is what people did. They built very specialized denoising models targeting exactly what they were working with. The models did a good job at enhancing those things with plausible hallucinated details; but failed spectacularly when unexpected details were mixed in.

A number of commercial image editing and enhancement packages had, internally, large libraries of models they could use to fix up images; and they’d first run other models to tell them what they were looking at (“that’s a cat”) before selecting a model from the library to apply.

But the size of these models, and the cost of training them, while individually small, is very large in aggregate. We can’t effectively create a separate model for every kind of image we ever might want to denoise, enhance, or just make up.

So we need to find a way to compress them together.

It would be nice if we could take a description of the kind of image we’re looking at, and turn it into some kind of key so that models can share information.

It would be nice if these two models could share their cat-related information:

  • “Maine coon cat laying on a white and black salt and pepper rug.”
  • “Maine coon cat laying on wooden floor.”

But how do we translate text into a key that models can use?

Description to Key

In 2013, along came something called word2vec.

word2vec is a form of text-vectorization. It turned words (like “cat”) into a vector of numbers, that represented a point in a semantic embedding space.

What was interesting about this was that the embedding space was learned in such a way that you could do rough “semantic math” on the results:

  • “Queen” - “Female” + “Male” =~ “King”

And that rough equivalency started to enable text manipulation (Natural Language Processing) models to get pretty powerful.

Later, mechanisms to fuse series of words into vectors, which maintained some notion of their original ordering, were invented.

So, today, we can perform embedding math similar to:

  • “Maine coon cat laying on a white and black salt and pepper rug.”
  • - (offset:7)“white and black salt and pepper rug”
  • + (offset:7)“wooden floor”
  • =~ “Maine coon cat laying on wooden floor.”

Overlapping Diffusion Models

Now we can train a bigger model, much bigger than we would have trained for just one kind of thing:

  • “Maine coon cat laying on a white and black salt and pepper rug.”

But trained on many, many kinds of things. But in addition to training the model on the noisy image, and the slightly-less noisy image; we’ll also show it the embedded description context, as a key.


"Maine coon cat laying on a white and black salt and pepper rug."

"Maine coon cat laying on a white and black salt and pepper rug."

"Maine coon cat laying on a wooden floor."

The embedded key lets the model decide which parts of the model to use to generate the image; and so parts which are shared according to the key between different images will be trained together and shared in the model.

Making More Data

Our models do better when we train them on more data; but we only have so much data, so … we make some up.

There are a lot of entirely mechanical things we can do which (probably) don’t change the original description.

A really common form of data-augmentation is flipping the original image:


"Maine coon cat laying on a white and black salt and pepper rug."

"Maine coon cat laying on a white and black salt and pepper rug."

"Maine coon cat laying on a wooden floor."

"Maine coon cat laying on a wooden floor."

But if we’ve got access to image processing tools, we can augment the data and the description:


"Maine coon cat laying on a white and black salt and pepper rug. Cubism."

"Maine coon cat laying on a white and black salt and pepper rug. Oil painting."

Generating New Images

We can start by generating some random noise:

And giving the model a prompt:

  • “Maine coon cat laying on a carpet.”

Starting from random noise, the denoiser picks local features that are kinda plausible from the prompt, and begins moving the image towards something that has some of those features.

As parts of the image come into focus, they are forced to make sense relative to their neighbors, and the features refine towards more plausible results.


"Maine coon cat laying on a carpet."
... many steps later ...

"Maine coon cat laying on a carpet."
... many steps later ...

"Maine coon cat laying on a carpet."

Novel Combinations

Once we’ve got this big overlapping model, keyed on embedded descriptions; we can start asking it for things we don’t have pictures of:

  • “Main coon cat laying on runway on Mars.”

The question of “Does this Work?” is tied up in the math of how good a job we did on the math for the text embeddings; and how good a job we did on the math of how the model uses those embeddings to decide which parts of itself to use for a given problem.

It’s all about sharing the appropriate things; and is a very active field of research. They year-over-year improvements in the basic math are, at this point, astounding; and we have no reason to currently believe that the surface has really been scratched on the math.

  • Everyone expects this math to get MUCH better, quickly.

How does this give us Art?

All AI Art Models are using variations of this technology; and they’re trained upon art, and generated and augmented images.

Many of them are also crowd-sourcing the up and down votes on the new images they produce, and using them to further augment and train the dataset for the models they train.

The more images a model has been trained with, and the better the descriptions of those images, the better the results will tend to be.

There are lots of mathy details, and lots of active research; but that’s the gist.

Is this a copy of the training data?

The very, very short answer is: No.

Examining the data for open source AI Art Model (so we can look at the sizes of things) Stable Diffusion

Some numbers there:

  • trained on 2.3 billion 256x256 3-color 24 bit images
  • trained on ~= 450 terrabytes
  • model size: 4.2 gigabytes
  • 100000 / 1 ratio of training data to model size
  • ~= 2 bytes of model data for every training image

And this is ignoring the size of the text prompts in the training; and their representation in the model’s embedding spaces.

Technically, mathematically, the model contains at most 2 bytes of information for every training image. Were we to claim it had copied the training data, we’d need to argue that we can compress a 256x256 image down to less than one pixel; to just 16 numbers; so it seems impossible to argue that the model has copied the data. What the model has done is learned how to describe things like each input image, in terms of and relative to other images it has seen, conditioned on the embedding contexts it was given.

It’s pretty compelling to me to argue that the training data isn’t in the model; the models have learned what high level concepts imply; concepts like:

  • Cat
  • Oil Painting
  • Picasso

Now, some of those concepts are tightly coupled with the style and work of current artists and companies with a stake in IP law and financial survival, which raise other interesting and important questions:

  • is it ok to learn from IP controlled information, even without copying it?
    • what are the limits on this?
    • is this covered by existing law and contracts?
    • do we need new law and contracts?
  • do we want to extend IP to cover style and inspiration?
    • can we do this in a way which doesn’t hurt artists more than it helps them?

Is this just copying by obfuscation?

I want to call out a point here. The above argument (2 bytes per training example can’t be copying) is extremely compelling to me; and I’ve found it to be compelling to others with a formal background in information theory, we see it as wildly too small to be anything like a compression or copy of the source data.

But it isn’t a compelling argument to many people without that background. The feedback I’ve received is that they’re concerned this is a form of shell game, copying by obfuscation. They’re moving the math around to hide the trick, is the argument.

And I don’t know how to counter that argument, without saying “learn the math”; but it feels like a real crux of one of the issues here.

In discussion with people, when it feels like some progress is being made on the question of, is this copying in some sense, an even more interesting objection arises, that I’ll paraphrase here:

The argument against robotic fair-use:
While this might not be copying, I object to calling this “learning” and I’m unhappy with the idea that this is ok.
Humans have a fair-use right to learn from material, that grants them moral rights to observe and consider things which supersede the rights of the creators of art to control the use of their material. The right to observe is protected above other rights.
Calling this “learning” implies that right; but the computer isn’t a person, and so can’t have that moral right. The information derived from observing art, even if it conclusively isn’t a copy, is still violating rights of the author of the training examples which take precedence because they are not being superseded by fair-use rights to observe.
Humans should have a legal right to look at things without compensation that robots should not have.

I believe that there is a deep cultural divide on this argument, and it’s worth clarifying what the actual disagreement is when moving forward with AI training. I think some people accept this strongly as deeply and morally true, and some people reject it strongly as deeply and morally untrue; and bringing this question forward is more clarifying of that disagreement than an argument about the information content of a model.

Is it legal to train on copyrighted art?

This seems like a simple question, but we can probably break it down into a whole family:

  • Should it be legal to train on copyrighted art?
  • Is it legal to train on copyrighted art in jurisdiction X?
    • If no, is it legal to look at copyrighted art, as an artist?
    • What is the difference?
  • Should artist be a protected financial job?

But we can kick this back further:

  • When AI is taught to do a job, what do we owe to the people who are losing work to that AI?

I, and many other people, have strong opinions about these issues, but this isn’t a post about the law or the ethics of applying AI to existing industries.

Expect a lot of lawsuits and new legislation in this space for the rest of ever; this question isn’t going away.

This post develops part of this document:

The Tapestry Plan

Overview

I’m developing out a project in defining the bottom-up sharding and scheduling of grid-scale tensor expression languages; its name is “Tapestry”, for the way expression value flow graphs weave between execution contexts.

I am of the opinion that this is a project which requires no new computer science; just the careful and methodical application of pieces from a number of sub-fields.

As there are many projects exploring how to take existing evaluation environments and back-fit sharding machinery too them, and as those projects are continuing to make reasonable progress, I feel that there’s no short-term urgency to solve this; so I’m taking the pure-language design route.

  • We don’t have users, and won’t have them till the whole stack works. We won’t have to worry about maintaining semantics or operational decisions when problems are encountered with them.

  • We will have some trouble acquiring people to help; everything is going to appear very abstract until the functional machinery is in-place.

Stages

Tapestry will be built out in the following stages of work, which correspond to a series of technical embeddings going deeper into the stack, and will remain as rewrite layers.

  • Tensor Valued Block Operation Graph Language
  • Block Operation Index Projection Sharding Graph Language
  • Block Operation Substitution Rewrite Graph Language
  • Block Operation Fusion Rewrite Graph Language
  • Block Operation Rewrite Sharding Optimizer
    • Configurable Execution Cost Model
    • Pareto-Front Stochastic Search Optimizer
  • Block Shard Operational Embedding
  • Block Shard Grid Host
    • Shard Scheduling
    • Shard Metric Instrumentation

When this stack is semantically complete, even in a preview form; we can begin to preview applications written in the block operation graph language.

From this stage onward, development will bifurcate:

  • Applications and extensions written above the block operation language; and
  • Optimization research and operational implementation work done below the block operation language.

The goal is:

Provide an existence proof that a provably shardable formal language is possible (we can prove that it can be made fast); then make it easy to program for to get more help; then make it fast.

This post develops part of this document:

Sharding over the out dimension

In the previous post on Index Projection Functions, we developed affine projections for the dimension of a tensor-valued operation, assuming dimensions: , , , :

We’ll now consider the projection functions , , and ; and how we’ll handle batching over out dimensions:

The values of in the out dimension are independent of each other; each out value is computed using one column of and one value in ; and as a result the op can be cleanly and trivially sharded by chunking and :

By extending the space to index the dimension, we can express the index functions , , and coordinates in terms of the indexed coordinate, and the shapes in terms of the out dimension size.

We also cleanly get the property that coherent ranges in the index space correspond to coherent tensor ranges in the mapped coordinate space:

Sharding over the in dimension

Sharding over the dimension is more complex, as it requires sharding a reduce operation; which breaks our current block model; as a preview for a future post, we can see that this can be rewritten as a reduction:

This post develops part of this document:

This post explores Index Projection Functions; as a pathway to developing a tensor expression evaluation environment:

Restricting to Shardable Operators

Suppose we’ve got a toy tensor expression language :

1
2
3
X, W, b, Z: Tensor
Z = Linear(X, W, b)
Y = ReLU(Z)

And we’re interested in mechanical sharding optimizations of the resultant expression graph:

Shardable Operators

Let be a block-operation, taking tensor-valued inputs, and producing tensor-valued outputs.

As discussed in the previous post, we’re attempting to find a family of such that, for any given , we’ll have additional information:

  • Given the shapes of the parameters, what are the expected shapes of the results?
  • Given the shapes of the parameters, what independent shards are possible which can be fused back into the same results?
  • How do the shards share resources (which sharding choices are more or less expensive)?

Consider the abstract one- flow graph:

We’re interested in families of such that we can shard operations mechanically, and re-assemble the results mechanically, and produce the same value as though the operation had been done in one pass.

Operator Index Counting

Crucially, the goal is to be able to shard:

  • With a strong ability to predict execution costs before evaluation; and
  • Without examining anything about the implementation of .

This can be reframed as a counting problem:

  • Can we enumerate all simple sub-problems of a given call to ?

To make this concrete, let’s reconsider from above. If we add an space to count all sub-problems of :

  • What is the shape of ?
    • How many dimensions does have?
    • What are their sizes?
  • What relationship does the shape of have to the inputs (, , ) and outputs ()?
  • What portions of the inputs and outputs are associated with each point in ?

Given a block , and knowledge about the structural co-variance of its inputs and outputs, we seek an index space, and a collection of projection functions for each input or output , such that we can mechanically enumerate sub-problems and re-assemble the results.

It is important to state that the top-down approach (starting with an , find sharding) is a potentially intractable problem; while the bottom-up approach (starting with sharding, define s) is solvable by construction (but limited to findable constructions):

  • Top-Down: Given this , can I find projection functions ?
  • Bottom-Up: Given a menagerie of known projection functions , what can I construct?

Affine Projection Functions

One design approach for solving the projection design problem is the use of coordinate space (integer, ) affine transforms (linear projections) from the index space to the tensor spaces.

Affine projection functions are an approach I explored in depth working at 3Scan, and an approach that’s also been incorporated into the MLIR project’s Polyhedral Types.

What components make up an affine projection function?:

  • an affine expression mapping points in space to starts in the coordinate space of input/output tensors;
  • a fixed defining the shape of region selected relative to the mapped point.

The simplest representation of this is a simple affine transform + a shape:

Are affine expressions the right or best solution to te design of projection functions? We don’t know; affine expressions can only be compared to other proposals, not all possible families of functions; there may be better ideas yet to be surfaced. We do know that affine expressions make some common patterns easy to express and to compute the shards of; and make some performance critical patterns tractable to express and compute the shards of.

Affine projection function have an important non-obvious property; it is generally tractable to arrange them such that coherent range blocks in the index space map to coherent space blocks in the input or output tensors. This property falls out of the fact that affine projection functions have constant marginal delta strides (the incremental change resulting from changing an input by one step is constant). Coherent input/output blocks dramatically simplify processing expectations, particularly in the face of shared input (as with convolution operations).

As with many matrix transform operations, the basic definitions are simple; but some of the implications can be complex to unpack. We’ll explore a few here.

Linear Strides Over a Batch Dimension

Consider 's input tensor; let’s assume a 2D shape .

We’d like to be able to describe a affine projection such that we can describe the following shards:

A very simple linear projection is sufficient to describe the mapping from a point in index space to a batch row of the input .

We also cleanly get the property that coherent ranges in the index space correspond to coherent tensor ranges in the mappend coordinate space:

I’ll continue developing this theme in future posts. More can be read in the tapestry work:

This post develops part of this document:

The Distributed Tensor Expression Problem

The distributed tensor expression “problem”:

  • Given a tensor expression ( ), where the tensors may be arbitrarily large, how do we efficiently schedule the expression over large numbers of GPUs?

Much of the existing work in this space has focused upon scaling programs written in existing tensor expression languages (pytorch, tensorflow, numpy); most of which were modeled upon the stats language R; and none of which were built to permit the ready calculation of operation sharding, or graph optimization.

It’s understandable why the focus has been on extending the semantics and scalability of the languages that so much of the existing AI application stacks have been written in; incremental improvements have direct impact on the ability to train and deploy existing applications.

However, quite a few pieces of the current system pose problems for these smart compilers:

  • the existing APIs have many entry points;
  • the entry points don’t all follow consistent semantics;
  • the apis were not written to enforce a stable co-variance between parameters and results;
  • the tensor APIs are data/shape polymorphic;
  • and python itself is obnoxious to trace symbolically

If, as an exercise, we drop any notion of compatibility with existing numpy-derived apis; I’m interested in the question of how far we can get?

Expanding a Toy Example

The process of designing new evaluation environments is the process of searching over spaces of functor embeddings to attempt to fit the formal semantics we desire to the operational requirements we’d like to satisfy in evaluation.

Consider a tensor expression in a toy language, call it :

1
2
3
X, W, b, Z: Tensor
Z = Linear(X, W, b)
Y = ReLU(Z)

At this point there are no formal semantics for ; we’re searching design space for formal semantics such that:

  1. Common operations in AI can be represented in the semantics;
  2. can be sharded to a distributed GPU fabric using existing optimization theory.

If we were attempting to shard python+numpy, or python+pytorch, or any number of other existing problem spaces, we’d be forced to find an embedding which permitted hosting the entire semantic surface of those environments.

But since we’ve decided to drop that requirement, we can break the semantics; since is only a sketch towards a language, we can explore restrictions to which simplify embedding.

Consider one functional dependency interpretation of our toy example:

Taking motivation from the toy example; we’d like to be able to shard the node. The operation is intended as a stand-in for the fully-connected linear layer operation from neural networks:

By examining the implementation of , and assuming that has shape , we can show that the operation can be cleanly sharded along any batch dimensions of the input :

By exploiting our knowledge of the implementation of :

We know that we can also re-write expressions upon the batch dimensions:

And finally, seeing and do not escape, we can fuse and into the combined operation, and collapse the shards:

These series of transformations are possible because we know (or assume) details about the structural co-variance of the inputs and outputs to the operations and .

Restricting Operators to Known Structural Co-Variance

We cannot assume that any arbitrary operation from a collection of named tensors (the parameters) to a collection of named tensors (the results) will have cleanly explicable structural co-variance (the relationship between the data in the input cells and the data in the output cells); but we can observe that the tractability and explicability of the structural co-variance of operators bears directly upon our ability to design mechanical sharding and graph-rewrite algorithms over expression graphs.

  • If we take as a design requirement the ability to make intelligent sharding choices about operators, and to be able to chain the results of those choices through subsequent layers of the graph, then we can reframe the semantics problem of our toy language as searching for a family of operators with this property.

For any given , we need additional information:

  • Given the shapes of the parameters, what are the expected shapes of the results?
  • Given the shapes of the parameters, what independent shards are possible which can be fused back into the same results?
  • How do the shards share resources (which sharding choices are more or less expensive)?

But we also need to ensure that connective expression language between operators has the same properties.

This is an active field of research for me; I believe that index projection functions are a viable solution to this, and I’ve done a fair amount of background work on large transform environments.

Continue reading:

Suppose you wish to define a proper type-forwarding decorator in python, which supports both the common default call pattern; and the argument override call pattern:

1
2
3
4
5
6
7
@foo
def some_method(x: int) -> float:
...

@foo(a="xyz")
def some_other_method(x: int) -> float:
...

The mechanics of this at call time are relatively straightforward in python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def foo(
fn = None,
*,
a = "xyz",
):
def decorator(fn):
setattr(fn, "__foo__", a)
return fn

if fn is None:
return decorator

else:
return decorator(fn)

@foo
def some_method(x: int) -> float:
...

@foo
def some_other_method(x: int) -> float:
...

We can ask mypy the type of the resultant decorated method:

1
2
$ mypy -c 'import simple; reveal_type(simple.some_method)'
<string>:1: note: Revealed type is "Any"

But establishing appropriate types, such that the types of the decorated method are well-formed, is a challenge which requires use of TypeVar and the @overload mechanic, and a fair amount of boilerplate:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from typing import (
Callable,
Optional,
overload,
TypeVar,
Union,
)

C = TypeVar("C", bound=Callable)

@overload
def foo(fn: C) -> C:
...

@overload
def foo(
*,
a: Optional[str] = "xyz",
) -> Callable[[C], C]:
...

def foo(
fn: Optional[C] = None,
*,
a: Optional[str] = "xyz",
) -> Union[Callable[[C], C], C]:
def decorator(fn: C) -> C:
setattr(fn, "__foo__", a)
return fn

if fn is None:
return decorator

else:
return decorator(fn)


@foo
def foo_example(x: int, *, y: int) -> float:
return float(x * y)

We can ask mypy the type of the decorated value:

1
2
$ mypy -c 'import example; reveal_type(example.foo_example)'
<string>:1: note: Revealed type is "def (x: builtins.int, *, y: builtins.int) -> builtins.float"

Of note is that most of the actual core of this is very simple, suppose we could say the following:

1
2
3
4
@typed_decorator
def foo(fn: C, *, a: str = "xyz") -> C:
setattr(fn, '__foo__', a)
return fn

What remains an open question to me, and I’ve tried many approaches, is if is possible to define @typed_decorator in mypys current semantics.

My name is Crutcher Dunnavant, and I’ve been working in the big-computers industry for some time.

Though I’ve used and worked with Open Source software for most of my career, and contributed to some large projects, most of my work has been behind the wall of large companies; with limited visibility outside those spaces.

I’ve generally worked on and in deep compute stacks, building petabyte scale systems on deeply nested toolchains, and working at every level of those toolchains simultaneously to move the needle forward.

A challenge I’ve had in getting projects out into the wider world is that in many cases, the tools to build the tools to build the tools I’ve worked in and focused on are only partially extant outside of shops like Google; and the getting-started costs of establishing presence and reputation to begin to move the needles on issues important to me have been high.

So my plan is to begin producing small library projects, things useful enough to use in other projects; mostly focusing on AI/ML development cycle problems; and fleshing out structural utilities in the AI spaces.

0%