Skip to article frontmatterSkip to article content

Auxilliary Networks

Empowering Injective Networks

Stanford University

We now introduce, Auxilliary Networks, additions to the basic Injective Network architecture that enhances their representation power.

# Basic imports
import torch
from torch import nn
import geosimilarity as gs
from NIGnets import NIGnet

from assets.utils import automate_training, plot_curves

Pre-Auxilliary Networks

Closed Condition: A Closer Look

Let’s first try fitting an Injective Network to a square using the PReLU function. Indeed as discussed before the PReLU activation does not guarantee non-self-intersection but we use it to gain insights.

from assets.shapes import square

# Generate target curve points
num_pts = 1000
t = torch.linspace(0, 1, num_pts).reshape(-1, 1)
Xt_square = square(num_pts)

square_net = NIGnet(layer_count = 3, act_fn = nn.PReLU)
automate_training(
    model = square_net, loss_fn = gs.MSELoss(), X_train = t, Y_train = Xt_square,
    learning_rate = 0.1, epochs = 1000, print_cost_every = 200
)

Xp_square = square_net(t)
plot_curves(Xp_square, Xt_square)
Epoch: [   1/1000]. Loss:    1.052965
Epoch: [ 200/1000]. Loss:    0.006894
Epoch: [ 400/1000]. Loss:    0.006118
Epoch: [ 600/1000]. Loss:    0.005084
Epoch: [ 800/1000]. Loss:    0.003844
Epoch: [1000/1000]. Loss:    0.003531
<Figure size 640x480 with 1 Axes>

We observe that the fit is really bad.

This is because the network first transforms the interval t[0,1]t\in [0, 1] to a circle and then deeper layers try to transform that to a square. This is not an easy task as the network has to transform the circle arcs to the four straight edges.

Let’s have a look at the first transformation as discussed in the section Condition 2 - Closed Curves:

C(t)=[cos(2πt)sin(2πt)]C(t) = \begin{bmatrix} cos(2\pi t)\\ sin(2\pi t) \end{bmatrix}

The point [cos(2πt),sin(2πt)][\cos(2\pi t), \sin(2\pi t)] is a point on the unit circle centered at the origin. What is happening here is that the open interval t[0,1]t \in [0, 1] is transformed to a circle that makes sure that the first and the last point are the same and when fed into the network lead to the same output point and hence create closed curves. This is explained visually in Figure 4.

Transforming the line segment [0, 1] to the unit circle centered at the origin.

Figure 4:Transforming the line segment [0,1][0, 1] to the unit circle centered at the origin.

We could have also achieved the closed condition by transforming to a square instead of a circle. For the above problem of fitting to a square this should help immensely as the Injective Network has to learn the identity mapping only! This is easy to do when using the PReLU activation.

Code for transforming [0,1][0, 1] to unit square
shapes.py
def square_from_t(t: torch.Tensor) -> torch.Tensor:
    # Generate theta values corresponding to t
    theta = 2 * torch.pi * t.reshape(-1)
    
    x, y = torch.cos(theta), torch.sin(theta)
    s = torch.maximum(torch.abs(x), torch.abs(y))
    x_sq, y_sq = x/s, y/s

    X = torch.stack([x_sq, y_sq], dim = 1)
    return X
from assets.shapes import square_from_t

class InjectiveNet_SquareClosed(nn.Module):
    def __init__(self, layer_count, act_fn):
        super().__init__()

        # Transform from t on the [0, 1] interval to unit square for closed shapes
        self.closed_transform = square_from_t

        layers = []
        for i in range(layer_count):
            layers.append(nn.Linear(2, 2))
            layers.append(act_fn())
        
        self.linear_act_stack = nn.Sequential(*layers)
    
    def forward(self, t):
        x = self.closed_transform(t)
        x = self.linear_act_stack(x)
        return x
# Generate target curve points
num_pts = 1000
t = torch.linspace(0, 1, num_pts).reshape(-1, 1)
X_t_square = square(num_pts)

square_net = InjectiveNet_SquareClosed(layer_count = 1, act_fn = nn.PReLU)
automate_training(
    model = square_net, loss_fn = gs.MSELoss(), X_train = t, Y_train = X_t_square,
    learning_rate = 0.1, epochs = 1000, print_cost_every = 200
)

X_p_square = square_net(t)
plot_curves(X_p_square, X_t_square)

# Print model parameters after learning
for name, param in square_net.named_parameters():
    print(f"Layer: {name} | Values : {param[:2]} \n")
Epoch: [   1/1000]. Loss:    0.579582
Epoch: [ 200/1000]. Loss:    0.000000
Epoch: [ 400/1000]. Loss:    0.000000
Epoch: [ 600/1000]. Loss:    0.000000
Epoch: [ 800/1000]. Loss:    0.000000
Epoch: [1000/1000]. Loss:    0.000000
<Figure size 640x480 with 1 Axes>
Layer: linear_act_stack.0.weight | Values : tensor([[1.0000e+00, 1.6361e-08],
        [2.4072e-09, 1.0000e+00]], grad_fn=<SliceBackward0>) 

Layer: linear_act_stack.0.bias | Values : tensor([-1.3566e-07, -6.1809e-08], grad_fn=<SliceBackward0>) 

Layer: linear_act_stack.1.weight | Values : tensor([1.0000], grad_fn=<SliceBackward0>) 

Observe above that the network indeed learns the identity mapping as the linear transformation matrix ends up as the identity matrix and the PReLU activation function ends up as xx that is the identity map with the slope as 1 and bias being 0.

Pre-Auxilliary Network

Indeed we could have transformed to any simple closed curve first and then attached the neural network layers after that. But since representing general simple closed curves is what we are trying to achieve, we can do a simpler thing. We transform the interval [0,1][0, 1] to a closed loop represented through polar coordinates in the form:

r=f(θ),θ[0,2π)r = f(\theta), \quad \theta \in [0, 2\pi)

where θ=2πt\theta = 2\pi t with tt in [0,1][0, 1].

We can then use the vector [rcos(θ),rsin(θ)]T[r \cos(\theta), r \sin(\theta)]^{T} as the first layer and attach the usual Injective Network layers on top. This transformation from tt to the vector [rcos(θ),rsin(θ)]T[r \cos(\theta), r \sin(\theta)]^{T} is injective and satisfies the condition F(0)=F(1)F(0) = F(1) and thus will create closed curves.

The function ff can be represented using a full neural network with no additional constraints except that the output has to be positive. We call this network the Pre-Auxilliary Network.

The basic idea is that the Pre-Auxilliary Network will provide a favorable initial shape to the Injective Network and make its learning task simpler.

Using Pre-Auxilliary Networks with NIGnets is really simple! Just create an appropriate network and pass that to the NIGnets constructor. We demonstrate this below.

The power of PreAux nets lies in the fact that a full scale MLP can be used to represent them. But a few conditions must be met to create a valid PreAux Net:

class PreAuxNet(nn.Module):
    def __init__(self, layer_count, hidden_dim):
        super().__init__()

        # Pre-Auxilliary net needs closed transform to get same r at theta = 0, 2pi
        self.closed_transform = lambda t: torch.hstack([
            torch.cos(2 * torch.pi * t),
            torch.sin(2 * torch.pi * t)
        ])

        layers = [nn.Linear(2, hidden_dim), nn.ReLU()]
        for i in range(layer_count):
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(nn.ReLU())
        layers.append(nn.Linear(hidden_dim, 1))
        layers.append(nn.ReLU())

        self.forward_stack = nn.Sequential(*layers)
    
    def forward(self, t):
        unit_circle = self.closed_transform(t) # Rows are cos(theta), sin(theta)
        r = self.forward_stack(unit_circle)
        x = r * unit_circle # Each row is now r*cos(theta), r*sin(theta)
        return x
from assets.shapes import square

# Generate target curve points
num_pts = 1000
t = torch.linspace(0, 1, num_pts).reshape(-1, 1)
Xt_square = square(num_pts)

preaux_net = PreAuxNet(layer_count = 3, hidden_dim = 10)
square_net = NIGnet(layer_count = 1, act_fn = nn.PReLU, preaux_net = preaux_net)
automate_training(
    model = square_net, loss_fn = gs.MSELoss(), X_train = t, Y_train = Xt_square,
    learning_rate = 0.1, epochs = 1000, print_cost_every = 200
)

Xp_square = square_net(t)
plot_curves(Xp_square, Xt_square)
Epoch: [   1/1000]. Loss:    0.976244
Epoch: [ 200/1000]. Loss:    0.000039
Epoch: [ 400/1000]. Loss:    0.000025
Epoch: [ 600/1000]. Loss:    0.000031
Epoch: [ 800/1000]. Loss:    0.000035
Epoch: [1000/1000]. Loss:    0.000036
<Figure size 640x480 with 1 Axes>

Great! Using a Pre-Auxilliary Network helps us build an Injective Network on top of a favorable closed transform. This therefore helps us cover both the circle and the square fitting cases without specifying a particular initial transform such as the circle or square. Pre-Auxilliary Networks are therefore a way of adding to the capacity of Injective Networks using full scale neural networks.

from assets.shapes import stanford_bunny

# Generate target curve points
num_pts = 1000
t = torch.linspace(0, 1, num_pts).reshape(-1, 1)
Xt_bunny = stanford_bunny(num_pts)

preaux_net = PreAuxNet(layer_count = 2, hidden_dim = 50)
bunny_net = NIGnet(layer_count = 1, act_fn = nn.PReLU, preaux_net = preaux_net)
automate_training(
    model = bunny_net, loss_fn = gs.MSELoss(), X_train = t, Y_train = Xt_bunny,
    learning_rate = 0.1, epochs = 1000, print_cost_every = 200
)

Xp_bunny = bunny_net(t)
plot_curves(Xp_bunny, Xt_bunny)
Epoch: [   1/1000]. Loss:    0.550154
Epoch: [ 200/1000]. Loss:    0.005555
Epoch: [ 400/1000]. Loss:    0.004586
Epoch: [ 600/1000]. Loss:    0.004364
Epoch: [ 800/1000]. Loss:    0.004427
Epoch: [1000/1000]. Loss:    0.004183
<Figure size 640x480 with 1 Axes>

We now look at another technique that allows us to use full scale neural networks to augment the power of Injective Networks.

Post-Auxilliary Networks

More Representation Power: Augmenting in Polar Coordinates

The Injective Network architecture can represent simple closed curves. But as we saw the requirement of network injectivity for imparting non-self-intersection to the parameterization restricted the hidden layer size to 2. Therefore the only thing we can control for increasing representation power is the network depth.

There is a way of working with polar coordinates where we can post-augment the parameterization with a general neural network giving a boost to representation power. We discuss this technique now. But first we need to discuss the polar version of the Injective Network.

Polar Neural Injective Geometry

Consider first a polar setup of the parameterization similar to the cartesian case:

r=f(t)θ=g(t),t[0,1)\begin{align*} r &= f(t)\\ \theta &= g(t), \quad t \in [0, 1) \end{align*}

Again as before we look at a vector-valued equivalent:

F:t[rθ]F: t \rightarrow \begin{bmatrix} r\\ \theta \end{bmatrix}

We use the same network architecture as before with the only difference that now the output is [r,θ]T[r, \theta]^{T}:

Figure showing a neural network architecture for polar representation of simple closed curves.

Figure 2:The network architecture for polar representation of simple closed curves looks deceptively similar to the one for cartesian coordinates with only the interpretation of the outputs being [r,θ]T[r, \theta]^{T} instead of [x,y]T[x, y]^{T}. But for the polar representation we need more constraints to hold to guarantee simple closed curves.

The network architecture for the polar representation looks deceptively similar to the one for cartesian coordinates with only the interpretation of the outputs being [r,θ]T[r, \theta]^{T} instead of [x,y]T[x, y]^{T}. But for the polar representation we need additional constraints to hold to guarantee simple closed curves.

Positive rr

The polar representation requires that rr is positive. The above architecture by default puts no restriction on what values rr can take. To generate only positive rr values the activation function used at the last layer should be such that its outputs are always positive. Note that this activation function should still be injective. Valid activation functions include sigmoidsigmoid, softplussoftplus, modified tanhtanh or ELUELU etc. An example modification of tanhtanh would be to add 1 to it and use tanh(x)+1tanh(x) + 1 as the final activation function, this works as tanhtanh is injective with range (1,1)(-1, 1).

Restricting θ[0,2π)\theta \in [0, 2 \pi)

Once we make sure that we choose the right activation function that only outputs positive values we need to make sure of the range of θ for a couple of reasons:

  1. The first is related to self-intersection. The mapping from t[r,θ]t \to [r, \theta] is injective. But the curve it may trace out in the current form may self-intersect. Consider as an e.g. the points [1,π][1, \pi] and [1,3π][1, 3\pi]. These are different outputs and does not violate injectivity of the network but they correspond to the same point in the plane!
  2. The second one is a representation problem. Let’s say we use the last layer activation function to be tanh+1tanh + 1. This function has the range (0,2)(0, 2) and therefore this will also be the possible θ that we can generate. Thus, our curves will all be limited to θ(0,2)\theta \in (0, 2). Of course, this is an artifact of using tanh+1tanh + 1 and a different activation function like the ELU+1ELU + 1 would not have this issue, but it will suffer from the first issue above.

Therefore the possible activation functions we can use at the last layer must be restricted to those that can further be scaled to have a range [0,2π)[0, 2\pi). One valid choice would be the function 2π(tanh+1)2\pi(tanh + 1).

Note: There is then also the concern on the range of r.

Post-Auxilliary Networks

Now consider a polar representation of the form:

r=h(θ),θ[0,2π)r = h(\theta), \quad \theta \in [0, 2\pi)

As we discussed before this is a guaranteed non-self-intersecting representation but suffers from the general curve representation problem discussed in the section Naive Usage of Polar Coordinates: The Representation Power Problem.

But this can be a very powerful representation as we can use a full scale neural network to parameterize f(θ)f(\theta). We now see a way of combining the injective polar representation with this parameterization.

Note: The final activation function of the neural network should be chosen such that the outputs are always positive to agree with rr being positive.

Probably the simplest thing to do would be to simply add the outputs of the two networks. One would do this by adding the rr value output from the auxilliary network to the different rr values that are output from the injective network at any given θ. But performing this directly has a problem. The auxilliary network is defined at all θ by definition but the injective network may not output some values of θ and therefore we cannot add the two. This is shown in Figure 3.

Figure showing a neural network architecture for polar representation of simple closed curves.

Figure 3:We cannot add the two networks at θ=π\theta = \pi since the injective network does not output all possible θ as tt varies from [0,1)[0, 1).

But there is a simple way of bypassing this problem. The basic idea is to use rr values only at tt values that lead to valid θ. The valid θ are generated as g(t)g(t). Therefore the corresponding r=h(θ)=h(g(t))r = h(\theta) = h(g(t)) are the rr values generated at valid θ that occur as tt is traversed from [0,1)[0, 1).

Consider a vector-valued mapping associated with the auxilliary network:

A:t[h(g(t))0]A: t \rightarrow \begin{bmatrix} h(g(t))\\ 0 \end{bmatrix}

Note this requires that we first feed in tt into the injective network, get the corresponding θ and then feed that into the auxilliary network to generate the rr for the AA mapping.

We now have the following two vector-valued mappings:

F:t[f(t)g(t)]A:t[h(g(t))0]F: t \rightarrow \begin{bmatrix} f(t)\\ g(t) \end{bmatrix} \quad A: t \rightarrow \begin{bmatrix} h(g(t))\\ 0 \end{bmatrix}

Consider the addition of these:

F+A:t[f(t)+h(g(t))g(t)]F + A: t \rightarrow \begin{bmatrix} f(t) + h(g(t))\\ g(t) \end{bmatrix}

We now need to prove that this mapping generates simple closed curves. This is quite intuitive and the proof is as follows:

Closed
The curves will be closed if (F+A)(0)=(F+A)(1)(F + A)(0) = (F + A)(1). But we have F(0)=F(1)F(0) = F(1) as FF is the injective network. That is,
(f(0),g(0))=(f(1),g(1))(f(0), g(0)) = (f(1), g(1))
We have A(0)=(h(g(0)),0)A(0) = (h(g(0)), 0) and A(1)=(h(g(1)),0)A(1) = (h(g(1)), 0). Now since g(0)=g(1)g(0) = g(1) we also have A(0)=A(1)A(0) = A(1) and hence the curves will be closed.
Simple
The curves will be simple if the mapping F+AF + A is injective. Which is true if:
t1t2    (F+A)(t1)(F+A)(t2)t_1 \neq t_2 \implies (F + A)(t_1) \neq (F + A)(t_2)
We start with the assumption that F+AF + A is not injective. That is for some t1t_1 and t2t_2 the outputs are the same. Then we have:
g(t1)=g(t2)\begin{align*} g(t_1) &= g(t_2) \end{align*}
and
f(t1)+h(g(t1))=f(t2)+h(g(t2))    f(t1)+h(g(t1))=f(t2)+h(g(t1))    f(t1)=f(t2)\begin{align*} &f(t_1) + h(g(t_1)) = f(t_2) + h(g(t_2)) \\ \implies &f(t_1) + h(g(t_1)) = f(t_2) + h(g(t_1)) \\ \implies &f(t_1) = f(t_2) \end{align*}
But this means that F(t1)=F(t2)F(t_1) = F(t_2) which is a contradiction as FF is the injective network. Therefore our original assumption that F+AF + A is not injective is false, and we conclude that the mapping F+AF + A is indeed injective.

Figure 4 shows the architecture for the augmented network

Combination of the injective network and the auxilliary network in the correct way.

Figure 4:First tt is fed into the injective network to obtain its rir_i and θi\theta_i output. This θi\theta_i is then fed into the auxilliary network to generate rar_a which is then added to rir_i to generate the final value.