Chris Kawecki Addition by Neural Net April, 1992 Various models are proposed for parallel binary addition. Each is based on software available in Rumelhart & McClelland's 1988 Volume. All but one model is run using the back-propogation (generalized delta rule) learning paradigm. The one exception was a model run using the pattern associator. All models were tested with various parameters. In all cases, changing parameters had minimal effect on the outcome of learning; however, changing parameter values during training was nonetheless discouraged to keep with the localist spirit of PDP. Unless otherwise mentioned, results found here were correct not only for specific parameters, but all parameters within a reasonable range. The 4x3 Pattern Associator The first model designed was a pattern associator with four inputs and three outputs, using the delta learning rule and linear activation rule. The four input units were devided into two pairs, and each of the two pairs was devided into a tens unit and a ones unit or a twos unit and a ones unit, depending on whether the problem was base 10 or binary. The target activation values were set for each pattern. Thus, the base 10 pattern 10 plus 10 equals 20 would have input and target activations of 1,0,1,0 and 0,2,0. Similairly, the binary problem 1 plus 3 equals 4 would be represented by an input vector of 0,1,1,1 and a corresponding target vector of 1,0,0. After training for 100 epochs with simple, no-carry, base 10 patterns and sufficient learning rate, the network had learned the rule to add the first and third digits for what an outside observer would call the tens digit and the second and fourth for the ones digit. These were the patterns used in learning: input desired output meaning 0000 000 0+0=0 1111 022 11+11=22 0101 002 1+1=2 1010 020 10+10=20 I proceeded to check the following problems without learning any more than had been learned already. problem network response 1) 34+21? (correct- 55) 2) 72+15? (correct- 87) 3) 96+3? (correct- 99) 4) 11+0? (correct- 11) 5) 55+55? For problem 5, the network yielded an output set of (0) (10) (10), which is altogether not a surprise, considering the nature of the network. The weights from the input layer to the output layer cleary illustrated that the network had learned that each of the ones digits in the input vector added uniformly with weights of 1 to the activation of the units output unit, that each tens unit contributed its exact value to the output tens unit, and that the ones digits should not add to the ones or hundered's digit, as the tens digits should not add to the ones or hundreds digit. (picture goes here) Thus, the network generalized, with only information that 1+1=2 and 0+0=0, that in each of the units and the tens digits any two numbers from the units digits in the input should be added to produce the desired output activation. Clearly, it should not be an assumption that a PDP model designed for addition must be trained with at least one example of each set of numbers (i.e. 1+1=2, 1+2=3, 1+3=4...1+8=9, 1+9=10, 2+2=4, 2+3=5...9+9=18) This is primarily due to the PDP software's linear activation rule and should therefore not be assumed for all models, or for the brain itself. The model was even able to generalize to adding non-equal integers. These positive results agree with Rumelhart and McClelland's (1989) suggestion that linear separability determines the ability of a pattern associator to learn. For, to examine the output set, we see that it is linearly separable, so long as no carrying takes place. Once sufficient carries are included in the training regime, the set loses its linear separability. To think of a spacial analogue, view the ones digit's output vertically and the inputs on the two horizontal axis. The output unit's activation grows larger and larger until it reaches 10, at which point it turns into a 0 (and 1 is carried to the tens digit). If you can imagine two planes between which only and all of the output values of 0 are located, and two planes for all output values of 1, of 2, until all the output values have been accounted for, then the training set is linearly separable. Network With 2 to 3 Single Unit Hidden Layers Rumelhart and McClelland (1986) propose two models for binary addition: one with two single-unit hidden layers, and one with three single-unit hidden layers. The first, simpler model (4x1x1x3) is identical in input and output units to the 4x3 pattern associator; however, it has two hidden units, one before the other, to act as 'carrying units'. It is reported to have solved the problem half the time. The other half of the time, the first hidden unit took the role of carrying from the twos to the fours digit, and therefore did not receive input from the other hidden unit. To solve this problem, a 4x1x1x13 model was introduced, so that, if two hidden units ran into the same situation the 4x1x1x3 model did, the third would slowly take over the role of the incorrectly placed unit. In both cases, it was unclear whether biases were allowed for the output units; therefore, models were made with variable biases on the output units as well as models where the output units' biases were fixed at 0. The delta rule learning algorithm requires that the activation function (relative to net input, the sum of each previous units activation times the weight of the connection between the two units) be continuously differentiable, and the PDP models use the continuous sigmoid function (McClelland & Rumelhart, 1988). This activation function also allows weights to approximate linear threshold activation simply by increasing all weights so that the sigmoid function is horizontally smushed. To a lesser extent, it allows the activation rule to approximate the linear activation rule by keeping all weights low, so that the sigmoid function is horizontally stretched and the activations of all units lie relatively close to 0, (biases included) so that the activation function is nearly a straight line. The 4x1x1x3 Network Experiments with the 4x1x1x3 network turned out to be disappointing. With biases on the output units fixed at 0, the network never learned correct weights for its connections; every time a tss of 1 remained, corresponding to one error out of 16 test patterns. Manual setting of weights and biases set the network on a clear course for correct generalization. This can be explained by the hidden units' functions being correct after manual setting of weights, and the connections only having to adjust minimally to account for the continuous activation function. Many more trials probably would have resulted in at least one set of random weights that would have designated correct functions for hidden units; twenty to thirty trials did not yield a single correct result. Allowing variable biases on the output units resulted in one out of ten random weight selections yielding a solution to the problem; strangely, the functions of the hidden units was almost unbelievable: the first hidden unit was active unless there was a carry from the ones input units; the second hidden unit was active if there is no number from the either of the twos digits, or if there was only one digit from the ones units and no activation of the input twos units. Almost always, the network made its hidden units into "anti-carry" units, almost always having exceptions to being simple "anti-carry" units. The network was also very unsuccessful in realizing that only the second hidden unit should contribute to the output of the fours' output. Elimination of all other connections to the fours output unit yielded a successful run at generalization as well as numerous unsuccessfull attempts. The 4x1x1x1x3 Net The flexibility allowed to the 4x1x1x3 net beyond what was required to complete the learning task (demonstrated by successfull implementations of the 4x1x1x3 net) met with an increase in apparent arbitrariness in the weight matrix, most specifically in the roles of the hidden units. Some networks ended up having hidden units that obviously functioned as carrying units; others had hidden units that obviously functioned as anti-carrying units; still others had hidden units whose functions were incomprehensible. To the near-disbelief of this programmer (and contradicting Rumelhart & McClelland's claim), the network occasionally fell into local minima at TSS=1. A fully trained (working) newtrok with biases set and clamped at 0 on its output units generated the weights and activation patterns found in Appendix 2 ("tupshin"). Analysis of the model has shown the following conclusions: 1) With one exception (p1), the activation of the first hidden unit was below .5 when a carry would be in order for to the fours digit and above .5 when no carry was required. Developing in parallel to this tendency was the inhibitive role of the first hidden unit on the output of the fours unit. 2) The network learned the commutative property. That is, it genereted correct answers and nearly equal hidden unit activations for problems represented by a+b=c and b+a=c. 3) Though the network developed positive weights from the ones units in input to ones units in output and from the tens input to tens output, it also developed inhibitory connections from the tens input to the ones output and from the ones input to the tens output. This tendency must have developed with the strange activation patterns for the second two hidden units. A network with variable biases on its output units developed at least one hidden unit with an obvious function. However, this was the exception to the algorithm chosen by this set of weights (found in Appendix II, "tup2"). 1) The third hidden unit was clearly a non-carrying unit for the fours unit. Its activations of .9 and 1 match exactly the trials where the target pattern does not have an active four unit, and the activations of 0 and .1 match the target patterns representing 4 or greater (where there must have been a carry to the fours output) 2) The network did not generalize the commutative property onto its hidden units for the tens units. This can be seen by the differences in weights from the left tens input (first unit) and the right tens input (third unit). More spectacularly, by examining the activation patterns for patters 15 and 6, we see that the first two hidden units have activations of 1 for pattern 6 and activations of 0 for pattern 15. It seems likely that the freedom to allow this strange internal representation was granted by the addition of extra bias unit for the output functions, yet this is uncertain, especially since these two units were the only two with such a widely varying difference in effects. Generalization Ability With Half of the Possible Patterns Experiments with the 4x1x1x1x3 network and eight to ten correct patterns (Appendix, pattern set "n") yielded 0 weight matrixes that had a tss below 1 and 15 with local minima at tss=1. One more combination of patterns (Appendix, pattern set "m") was attempted with five of the same initial weight matrixes, and each of the five times the newtork arrived at local minima at tss=1. For the purposes of comparison, two random weight matrixes were constructed (pattern sets "b" and "c"), and, with the same initial weights, the patterns with random target activations were almost always successfull. When the network was successfull in learning the random weights, the binary addition problem ("2 +2=4"- 1 0 1 0 1 0 0) was added to the training regime; the number of epochs listed is for pattern set "b" to get to ecrit of .05 with lrate .2 and momentum .9, first with only the random patterns, then with the addition of 2+2=4. Biases were allowed on output units as well as hidden units. Initial weight number ecrit .05 With extra pattern (9 initial random weights) 1 227 315 2 local tss=1 3 local tss=1 4 294 413 5 364 603 6 298 855 (8 random weights) 9 343 local tss=1 10 277 733 11 323 local tss=1 12 261 556 13 243 516 14 255 390 15 211 975 with patterns from "c"- note that the test pattern 1 0 1 0 was initially trained into the network with target activation 110 so that when it was added, the best possible result would be tss=.5. 10 tss=1 11 tss=1 12 199 tss=.5 13 tss=1 14 253 tss=.5 15 183 tss=1 Not only can we generalize from these data that without sufficient representations of a pattern it cannot be learned, but further that without suffiecient representations of a pattern, there is a large chance that the pattern will double-cross itself and force the network into local minima. Conclusion The message of this research that should be remembered is that connectionist models using back propogation do not hesitate to form obscure internal representations. Small variations in initialized random weights generate varied weight matrixes, allow or disallow symmetry, and even force local minima. The implications of this evidence in neural modeling are still uncertain, mainly for this reason: the deterministic aspect of back-propogation may be nonexistent or less noticeable in other learning algorithms. Conclusion Connectionist models in general, perhaps by using algorithms other than back propogation, must find ways of being less dependant on initial weights. Further, the vast number of different weight matrixes which resulted in similair output activations makes back-propogation, and perhaps connectionist models, a possibly very powerful computational tool, though many runs with different random weights would be necessary to determine the merits of a particular model, rather than a few runs. I assume this tendency would be exacerbated with more complex models. Appendix 1: Location and permission of use of models All models can be found in files2/users/kawecki/pdp in their respective directories. (i.e. The 4x4x4x3 file would be found in the 4x4x4x3 directory, within the pdp directory) Each model has a different name, most of them strange. The names can be viewed by viewing the contents of a directory ("ls" in Unix); a name which has a template, startup, and network postfix in each directory is the one which should be used for the application program. Saved weight matrixes (with names like march2, march3, march4, etc.) can be found in each folder, and selections of patterns are in the pdp folder itself, as well as some of the folders. Anybody can copy and use these models, under the condition that if they fiugre anything interesting out, they have to email me and tell me what they found out. (ckawecki@hamp.hampshire.edu) I would be happy to show anyone these models if they are unable to figure them out themselves as long as they are interesting people who are actually interested in the model. Bibliography Bechtel, W., and Abrahamsen, A. (1991) Connectionism and The Mind, Padstow, England: Blackwell Publishers. McClelland, J.L. and Rumelhart, D.E. (1988) Explorations in Parallel Distributed Processing, Camprbidge, MA: MIT Press/Bradford Books. Rumelhart, D.E., McClelland, J. L., and the PDP Research Group (1986) Parallel Distributed Processing, Cambridge, MA: MIT Press/Bradford Books.