Linear algebra would only allow approximation of linear function
Introducing non-linearities puts us in the right setup so that: with non-linearities, given an approximation error, there is a neural network of a given architecture composed of both linear and non-linear operations that will meet this error threshold.
Global optimum = set of parameters value
To achieve global optimum, gradient-based approaches require those operations to be differentiable.
4 aspects to keep in mind for non-linearities:
need to be differentiable to allow gradient-based optimization
avoid vanishing gradients (saturation for instance) to prevent slower convergence
avoid dead neurons (wasted parameter)
preserve normalization for easier optimization. However, parameter normalization can tackle this issue
avoid saturation = avoid vanishing gradient
zero-centered: gradient not restricted to certain sign