Asish Mandoi

Hopfield Neural Networks for Combinatorial Optimization

Creative Commons Licence

Hopfield Neural Networks

Overview

  • A class of artificial neural network that falls under the category of Recurrent Neural Networks
  • Well suited for pattern recognition, certain kinds of optimization, and associative memory applications
  • Can be operated with discrete or continuous variables as inputs
  • In the context of this work, we focus on an HNN architecture with binary neurons operating on values {1,1}\{-1, 1\}

Network Parameters & Dynamics

Given a fully connected network of NN neurons,
x\text{x} : An NN-dimensional input vector (input signal to the NN neurons)
y\text{y} : An NN-dimensional output vector (output signal from the NN neurons)
W\text{W} : An N×NN \times N weight matrix (synaptic connection strengths between neurons)
b\text{b} : An NN-dimensional bias vector (threshold for activation of neurons)

The state of a neuron at any stage is given by:

y=sign(WTx+b)\text{y}=\text{sign}(\text{W}^{T}\text{x}+\text{b})

We use a slightly modified signum function:

sign(f(z(t)))={1if f(z(t1))>01if f(z(t1))<0z(t1)if f(z(t1))=0\text{sign}(f(\text{z}^{(t)}))= \begin{cases} 1 & \text{if } f(\text{z}^{(t-1)})>0 \\ -1 & \text{if } f(\text{z}^{(t-1)})<0 \\ \text{z}^{(t-1)} & \text{if } f(\text{z}^{(t-1)})=0 \end{cases}

where f(z)f(\text{z}) must be a linear transformation of z\text{z}, and z(t)\text{z}^{(t)} denotes the state of the neuron at time tt.

Energy of the Network

Energy of an HNN is given by:

E=12(xTWx+bTx)=12(i=1Nj=1Nwijxixj+i=1Nbixi)\begin{align*} E &= -\frac{1}{2} \left( \text{x}^{T}\text{W}\text{x}+\text{b}^{T}\text{x} \right) \\ &= -\frac{1}{2}\left(\sum_{i=1}^{N}\sum_{j=1}^{N}w_{ij}x_{i}x_{j}+\sum_{i=1}^{N}b_{i}x_{i}\right) \end{align*}

For a learning type problem like pattern recognition:

  • We are given the training data in form of multiple sets of input vectors x\text{x} and output vectors y\text{y}.
  • The goal is to come up with a good model (W,b)(\text{W}, \text{b}) such that the predictions made with this model match closely with those of real prediction data.

For an optimization type problem (in graphs):

  • W\text{W} and b\text{b} can be obtained by translating the objective (+constraints) of the problem at hand into the energy function of the network defined above.
  • Given the edge weights and nodes (W,b)(\text{W}, \text{b}), the goal is to find the optimal solution i.e. an input vector x\text{x} for which the HNN energy E(x)E(\text{x}) is globally minimized.

NOTE

Realizing the dual nature between the above two problem types, can we say anything about the computational complexity of the former if we know the complexity of the latter?

Natural Energy Minimization in HNN

Consider an update made to the ithi^{th} neuron.

yi=sign((j=1nwijxj)+bi)y_{i}=\text{sign} \left( \left( \sum_{j=1}^{n}w_{ij}x_{j} \right) + b_{i} \right)

The energy change in the network due to the above single change is:

ΔE=EfEi=EifEii=12[j=1Nwij(yixi)xj+bi(yixi)]=12(yixi)[j=1Nwijxj+bi]\begin{align*} \Delta E &= E^{f}-E^{i}=E_{i}^{f}-E_{i}^{i} \\ &=-\frac{1}{2} \left[ \sum_{j=1}^{N}w_{ij}(y_{i}-x_{i})x_{j}+b_{i}(y_{i}-x_{i}) \right] \\ &=-\frac{1}{2}(y_{i}-x_{i}) \left[ \sum_{j=1}^{N}w_{ij}x_{j}+b_{i} \right] \end{align*}

This is true under the important assumption that wii=0  i{1,...,N}w_{ii}=0\ \forall\ i\in\{1,...,N\}.

Note that the sign of j=1Nwijxj+bi\sum_{j=1}^{N}w_{ij}x_{j}+b_{i} is exactly the output at the ithi^{th} neuron, yiy_{i}.

Now,

j=1Nwijxj+bi>0yi=1yixiΔE0\sum_{j=1}^{N}w_{ij}x_{j}+b_{i}>0 \quad\Rightarrow\quad y_{i}=1 \quad\Rightarrow\quad y_{i}\ge x_{i} \quad\Rightarrow\quad \Delta E\le0

Similarly,

j=1Nwijxj+bi<0yi=1yixiΔE0\sum_{j=1}^{N}w_{ij}x_{j}+b_{i}<0 \quad\Rightarrow\quad y_{i}=-1 \quad\Rightarrow\quad y_{i}\le x_{i} \quad\Rightarrow\quad \Delta E\le0

and

j=1Nwijxj+bi=0yi=xiΔE=0\sum_{j=1}^{N}w_{ij}x_{j}+b_{i}=0 \quad\Rightarrow\quad y_{i}=x_{i} \quad\Rightarrow\quad \Delta E=0

Thus, we always have ΔE0\Delta E\le0

A few things to note:

  • We assumed zero diagonal weights for W\text{W}. If we relax this assumption, there will be an upper limit to the magnitude of wiiw_{ii}, i.e., wii<k maxij(wij)|w_{ii}|<k~\text{max}_{i\ne j}(|w_{ij}|).
  • We update only one neuron at a time. This is called the synchronous approach. For the asynchronous case, multiple neurons can be updated in a batch of say kk selected neurons (kk can be equal to NN). Depending on the weight matrix W\text{W}, this, this may or may not converge to the expected solution.
  • Any QUBO problem can be mapped to a unique HNN.
  • The network is updated for a series of nsn_{s} steps to allow the energy to go down such that the corresponding final output vector y\text{y} can be obtained.
  • The whole update procedure is run for nen_{e} epochs, and every epoch starts with a randomly initialized input vector x\text{x}.
  • Although the epoch for which the final energy is minimum is considered as the result. Most update paths may not converge to this solution, in which case, the energy that most epochs converge to can be considered a local minima.

Annealing Techniques

Overview

Annealing techniques are ways to modify the problem landscape, that the neural network maps to, by introducing external parameters (like temperature, annealing time, etc.), in real time as the network searches for a solution.

Depending on the starting conditions and how the network evolves by gradually varying these external parameters, such techniques can increase the probabilities of reaching to a solution thus helping the network arrive at a solution faster than employing it naively to solve the problem.

Stochastic Simulated Annealing

Consider the following update mechanism for the ithi^{th} neuron:

yi={1with probability piT(jwijxj+bi)1with probability 1piT(jwijxj+bi)y_{i}= \begin{cases} 1 & \text{with probability } \quad p_{i}^{T}\left(\sum_{j}w_{ij}x_{j}+b_{i}\right) \\ -1 & \text{with probability } \quad 1-p_{i}^{T}\left(\sum_{j}w_{ij}x_{j}+b_{i}\right) \end{cases}

where piTp_{i}^{T} is the probability distribution which in this case is taken to be a signum function from 00 to 11, centered at 12\frac{1}{2}. For a given input xix_{i}, the closer pip_{i} is to 11, the higher the chances of getting an output 11. On the other hand, the closer it is to 00, the higher the chances of the state of the neuron being updated to 1-1. TT in superscript denotes the dependence on temperature.

piT(z)=11+exp(z/T)p_{i}^{T}(z)=\frac{1}{1+\text{exp}(-z/T)}

Weight Annealing

In combinatorial optimization, to maximize/minimize an objective function, it is common to only vary the binary variables by trial and error following some annealing technique. Although the weight matrix (and bias) is fixed for a problem, we don't necessarily have to start with those values. In weight annealing, we start with an easy problem, say a constant weight matrix (and bias), and gradually move towards that of our required problem.

In the initial stages, the HNN would automatically converge to the lowest energy solution for the modified (easier) problem, since there won't be any possibility of getting stuck in a local minima. Weight annealing relies on the idea that if the lowest energy solution for the previous iterations has been achieved, it is highly likely that the network will stay in the lowest energy for the current problem as well, as long as the weights in the network are changed gradually.

We often have a parameter called annealing time (τ\tau), that controls how slow the whole process of changing the weights from constant values to the required values is done. This idea of gradual change in the network weights might be inspired by quantum annealing.

Some weight annealing approaches:

  • Exponential Annealing
dα(i,j)=d(1exp(1/α))α:0\begin{array}{l c r r} d_{\alpha}(i,j) = d\cdot(1-\exp(-1/\alpha)) &\qquad &\alpha : \infty \rightarrow 0 & \\ \end{array}
  • Linear (Semi-exponential) Annealing
dα=dαα:01\begin{array}{l c r r} d_{\alpha} = d\cdot \alpha &\qquad &\alpha : 0 \rightarrow 1 & \\ \end{array}
  • Exponential Smoothing 1
dα(i,j)={d+αexp(αd(i,j)d)1if d(i,j)ddαexp(αdd(i,j))1if d(i,j)<dα:0\begin{array}{l c r r} d_{\alpha}(i,j) = \begin{cases} \overline{d} + \frac{\alpha}{\exp\left(\frac{\alpha}{d(i,j)-\overline{d}}\right) - 1} & \text{if } d(i,j) \ge \overline{d} \\ \overline{d} - \frac{\alpha}{\exp\left(\frac{\alpha}{\overline{d}-d(i,j)}\right) - 1} & \text{if } d(i,j) < \overline{d} \end{cases} &\qquad &\alpha : \infty \rightarrow 0 & \\ \end{array}

\qquadwhere dα0(i,j)=d(i,j)d_{\alpha \rightarrow 0}(i,j) = d(i,j) and dα(i,j)=dd_{\alpha \rightarrow \infty}(i,j) = \overline{d}.

  • Power Smoothing 1
dα(i,j)={(dd)αif d(i,j)d(dd)αif d(i,j)<dα:01\begin{array}{l c r r} d_{\alpha}(i,j) = \begin{cases} (d-\overline{d})^{\alpha} & \text{if } d(i,j) \ge \overline{d} \\ -(d-\overline{d})^{\alpha} & \text{if } d(i,j) < \overline{d} \end{cases} &\qquad &\alpha : 0 \rightarrow 1 & \\ \end{array}

dd here denotes the overall augmented weight matrix [W|b]\left[\text{W|b}\right] of the network.

Quantum Annealing or Quantum Adiabatic Optimization 2

Suppose we have a Quantum Hamiltonian HPH_{P} whose ground state encodes the solution to a problem of interest, and another Hamiltonian H0H_{0}, whose ground state is "easy" (both to find and to prepare in an experimental setup). Then, if we prepare a quantum system to be in the ground state of H0H_{0}, and then adiabatically change the Hamiltonian for a time τ\tau according to

H(t)=(1tτ)H0+tτHPH(t)=(1-\frac{t}{\tau})H_{0}+\frac{t}{\tau}H_{P}

if τ\tau is large enough, and H0H_{0} and HPH_{P} do not commute, the quantum system will remain in the ground state for all times, by the adiabatic theorem of quantum mechanics. At time τ\tau, measuring the quantum state will return a solution to our problem.

Hybrid Approach

A combined approach involving simultaneous weight as well as stochastic annealing was employed in this work.


Problem Solving

Through an implementation of a Hopfield Neural Network, the goal is to solve certain NP-Hard/NP-Complete problems in combinatorial optimization such as Maximum Cut, Maximum Clique, Minimum Vertex Cover, Maximum Independent Set. The energy functions for each of the problems below are to be minimized.

  • Max-Weighted-Cut
min[14(1TW1xTWx)]\text{min}\left[ \frac{1}{4}\left( 1^T\text{W}1-\text{x}^T\text{W}\text{x} \right) \right]
  • Max-Weighted-Clique
min[18((x+1)T(W1b)(x+1)TW(x+1))]\text{min}\left[ \frac{1}{8}\left( (\text{x}+1)^T(\text{W}1-\text{b}) - (\text{x}+1)^T\text{W}(\text{x}+1) \right) \right]
  • Min-Weighted-Vertex-Cover
min[18((x+1)T(W1+b)+(x1)TW(x1))]\text{min}\left[ -\frac{1}{8}\left( (\text{x}+1)^T(\text{W}1+\text{b}) + (\text{x}-1)^T\text{W}(\text{x}-1) \right) \right]
  • Max-Weighted-Independent-Set
min[18((x+1)T(W1b)(x+1)TW(x+1))]\text{min}\left[ \frac{1}{8}\left( (\text{x}+1)^T(\text{W}1-\text{b}) - (\text{x}+1)^T\text{W}(\text{x}+1) \right) \right]

Results

The following is a subset of results obtained from an implementation of the Hopfield Neural Network with respective annealing techniques in the context of solving the Max-Cut problem.

Source of Max-Cut data: biqbin.eu/MaxCut

Baseline

datasetoptimal (or approx.) solution energyobtained solution energytime to solution (s)
mc_21_1991091093.13
mc_50_156176217623.80
mc_63_17299569564.62
mc_100_2421971975.00
mc_101_5006432243224.87
mc_800_466730442786170.13

Stochastic Annealing

datasetoptimal (or approx.) solution energyobtained solution energytime to solution (s)
mc_21_19910910912.89
mc_50_1561762176213.56
mc_63_172995695613.51
mc_100_24219719715.72
mc_101_50064322432214.87
mc_800_466730442786*346.11

Exponential Annealing

datasetoptimal (or approx.) solution energyobtained solution energytime to solution (s)
mc_21_1991091094.25
mc_50_156176217629.98
mc_63_172995695610.97
mc_100_24219719713.12
mc_101_50064322432212.69
mc_800_466730442881*357.67

* denotes a higher number of steps applied because of the large dataset.
For more details on the simulations and results, refer to the GitHub repository.


Future Scope

  1. Using negative diagonal weights in the Hopfield network

  2. Operating HNN near the critical point

    • Critical point - the edge between order and disorder
    • Critical phenomena occurs at a lot of interesting places:
      • phase change
      • spins in lattices
      • brain
    • Critical dynamics are best demonstrated in a simplified system known as the Ising model.
  3. Neural Networks: Identifying the Dual Nature of learning and solving

IMPORTANT

Appendix: Limits of using negative diagonal weights 3

Diagonal Weights: A common misconception in the literature is that zero diagonals of the weight matrix (Wii=0W_{ii}=0) are necessary for the stable states of the continuous model to coincide with those of the original discrete model. This belief has no doubt evolved due to Hopfield's original simplifying assumption that Wii=0W_{ii}=0. In fact, there are no restrictive conditions on the diagonals of W for the continuous model to converge to a minimum of EcE_{c}. If Wii<0W_{ii}<0 however, that minimum may lie in the interior of the hypercube, due to the convexity of the energy function. In this case, annealing techniques are usually employed to drive the solution trace towards the vertices.

Unlike the continuous model however, non-zero diagonal elements of the discrete Hopfield network do not necessarily allow Liapunov descent to local minima of EdE_{d}. This is because the change in energy EdE_{d} due to a change in output level viv_{i} is:

ΔEd=(uiΔvi)12Wii(Δvi)2\Delta E_{d}=-(u_{i}\Delta v_{i})-\frac{1}{2}W_{ii}(\Delta v_{i})^{2}

Since ui0u_{i}\ge0 results in Δvi0\Delta v_{i}\ge0, and ui<0u_{i}<0 results in Δvi0\Delta v_{i}\le0 under the discrete model, the first term on the right-hand side of the above equation is always negative. The second term is positive however for Wii<0W_{ii}<0. Consequently, ΔEd\Delta E_{d} is only negative provided

Δvi2uiWii||\Delta v_{i}||\le2\frac{||u_{i}||}{||W_{ii}||}

References

  1. J. Schneider et al., "Search-space smoothing for combinatorial optimization problems", Physica A, 243, pp. 77-112, 1997. 2

  2. L. Andrew, "Ising formulations of many NP problems", Front. Phys., Vol. 2, 2014.

  3. J. Y. Potvin and K. A. Smith, "Artificial Neural Networks for Combinatorial Optimization," in F. Glover, G.A. Kochenberger (eds) Handbook of Metaheuristics in International Series in Operations Research and Management Science, Kluwer Academic Publishers. International Series in Operations Research & Management Science, vol 57. Springer, Boston, MA., 2003, Ch. 15, pp. 429-455, Kluwer Academic Publishers, 2003.