The Surprising Science Behind Neural Networks: How They Work and How to Build Your Own (Beginner’s Guide)

Neural networks are brain-inspired models that learn from data by adjusting weights and biases, not by hand-coding every rule.
They consist of layers—input, one or more hidden layers, and an output layer—where each neuron connects to the next layer.
Each connection has a weight and each neuron includes a bias, and these are the learnable parameters adjusted during training.
Activation functions such as sigmoid, ReLU, and tanh introduce non-linearity enabling the network to model complex patterns.
A forward pass moves data from the input layer through hidden layers to the output, computing weighted sums, adding biases, and applying activations.
Training uses a loss function, with Mean Squared Error (MSE) common for regression and cross-entropy for classification, to quantify prediction error.
Backpropagation computes gradients of the loss with respect to weights and biases, and gradient descent (with a learning rate like 0.1 or 0.001) updates them to reduce loss.
Training is performed over many iterations called epochs, often with mini-batches, and requires lots of data to enable generalization, as highlighted by Geoffrey Hinton’s emphasis on data.
Most modern networks are feedforward, though there are recurrent networks and other architectures; the beginner guide centers on feedforward networks.
A simple from-scratch example in the article uses 2 input neurons, a small hidden layer of 2–3 neurons, and 1 output neuron, usually with a sigmoid activation, and can be implemented in Python (with or without NumPy).

What Are Neural Networks? (Mimicking the Brain)

Neural networks are brain-inspired computer programs that learn from data. In simple terms, a neural network is a machine learning model designed to mimic the way the human brain processes information ^[1] ^[2]. Just as the brain has billions of interconnected neurons firing in parallel, artificial neural networks have layers of interconnected nodes (also called neurons) that transmit and transform information. In the early days of AI, many systems were based on hand-crafted logical rules, but neural networks followed “the second route … from biology: [trying] to make computers that can perceive and act and adapt like animals” ^[3], as Geoffrey Hinton (a pioneer of deep learning) explained. In other words, instead of explicitly programming every rule, we design neural nets to learn those rules from examples – much like a brain learning from experience.

Each neuron in a neural network performs a simple calculation, but when many are connected into layers, they can solve complex tasks. In fact, neural networks today power all sorts of AI applications – from recognizing speech and images to making recommendations. A well-known example is Google’s search algorithm, which uses neural network techniques to rapidly classify and rank information ^[4]. What makes neural networks special is that they learn by example: they adjust themselves based on training data rather than following a fixed program. As the model sees more examples and their expected outcomes, it continuously refines its internal parameters to improve accuracy ^[5]. This adaptive, data-driven learning is why neural networks have become so powerful in recent years.

Key Components of a Neural Network

To understand how neural networks work, let’s break down their core components ^[6]:

Neurons (Nodes) and Layers: Just like biological neurons, artificial neurons receive inputs and produce an output. Neurons are organized into layers: an input layer (taking in the raw data features), one or more hidden layers (where intermediate computations happen), and an output layer (producing the final prediction) ^[7]. Each neuron in a layer connects to neurons in the next layer, forming a web of connections (analogous to synapses in the brain). The input layer simply passes the data into the network. The hidden layers perform transformations on the data through weighted connections, enabling the network to learn complex patterns ^[8]. The output layer produces the network’s result – for example, a category label or a numeric prediction.
Weights and Biases: Every connection between neurons has an associated weight – a number that represents the importance or strength of that connection. A neuron multiplies each of its inputs by a weight, sums them up, then adds a special constant called the bias. The bias shifts the neuron’s overall input up or down, much like an intercept term in linear regression. Together, the weights and biases are the primary learnable parameters of the network. Initially these parameters are set randomly, and during learning they are adjusted to improve performance ^[9]. A larger weight means that connection has a stronger influence on the neuron’s output. (Biologically, you can think of weights as the strength of synapses between neurons.)
Activation Functions: After computing the weighted sum of inputs (plus bias), a neuron uses an activation function to determine its output signal. The activation function introduces non-linearity, which allows neural networks to learn complicated relationships. Without it, multiple layers of neurons would collapse into an equivalent single layer (because stacking linear combinations yields another linear combination). Common activation functions include sigmoid, which squashes outputs to a range (0,1); ReLU (Rectified Linear Unit), which outputs 0 for negative inputs and a linear value for positive inputs; and tanh, which ranges between -1 and 1. For example, a sigmoid activation will output a value near 1 if the neuron’s weighted sum is very high (firing strongly) or near 0 if the sum is very low ^[10]. In essence, the activation function decides whether and how much a neuron “fires” given its inputs, loosely analogous to a biological neuron firing an electrical impulse. This mechanism enables the network to model non-linear patterns – a key to solving complex tasks.

By combining these components – neurons arranged in layers, weighted connections with biases, and non-linear activations – a neural network can represent extremely rich functions. During a forward pass, input data moves through the network layer by layer, with each neuron applying its weights and activation, to produce an output. But how does the network learn the right weights? For that, we need to look at the learning process: training the network using data.

How Neural Networks Learn: Forward Pass, Backpropagation, and Optimization

Learning by example is the crux of neural networks. Rather than being explicitly programmed, a neural net gradually adjusts its own weights and biases to improve at a task, using a process called training. As Yann LeCun (another deep learning pioneer) put it, “Everything that lives can adapt but everything that has a brain can learn” – and he knew early on that “learning was going to be critical to make machines more intelligent” ^[11]. Neural networks learn in two major steps: a forward pass to make a prediction, and backpropagation to update the parameters based on the error of that prediction. This is repeated many times with many examples, gradually honing the network’s performance.

1. Forward Pass (Making a Prediction): In the forward pass, the input data is fed through the network to obtain an output. Each neuron takes the outputs from the previous layer, multiplies each by its respective weight, adds the bias, and then applies the activation function. These outputs become the inputs to the next layer, and so on, feeding forward from the first layer to the last ^[12]. For example, if we input an image to a neural network that classifies images, the input layer neurons each take a pixel value. Those feed into hidden layer neurons which compute weighted sums and activations, then those feed into deeper layers, and finally the output layer produces a prediction (say, a probability distribution over classes). At this stage, the network is simply computing an answer with its current weights – it hasn’t learned anything new yet. Initially, because the weights are usually set randomly, the output will likely be wrong or nonsensical.

2. Calculating Loss (Measuring the Error): After the forward pass yields a prediction, we need to measure how good or bad that prediction was. This is done using a loss function (also called a cost function). The loss function computes a numerical error based on the difference between the network’s prediction and the true expected output (the “right answer” from our training data) ^[13] ^[14]. For example, if we’re doing a yes/no prediction, a simple loss could be the squared error: Loss=(ypredicted−ytrue)2text{Loss} = (y_{text{predicted}} – y_{text{true}})^2Loss=(ypredicted−ytrue)2. For multiple outputs, one common choice is Mean Squared Error (MSE) – the average of squared differences between predictions and actual values ^[15]. Another popular loss for classification is cross-entropy, which works well for probabilities. The key idea is that the loss is high (large positive value) when the network is very wrong, and it would be zero if the network is perfectly correct. The training goal is to minimize this loss.

3. Backpropagation (Learning from Errors): Once we have the loss, the network learns by reducing that loss. This happens in the backward pass, using an algorithm called backpropagation. Backpropagation works by propagating the error backward through the network, determining how each weight contributed to the error ^[16]. In practice, backpropagation involves calculating the gradient of the loss with respect to each weight – essentially, how much a small change in a weight would affect the loss. The network then adjusts the weights in the direction that decreases the loss. This is achieved with a technique called gradient descent (or one of its variants): for each weight www, we subtract a small fraction of the gradient ∂Loss∂wfrac{partial text{Loss}}{partial w}∂w∂Loss from www. That fraction is scaled by a learning rate, which is a small positive number (like 0.1 or 0.001) determining the step size of each update. By performing this adjustment for every weight (and bias), the network slightly improves its parameters for the next round ^[17] ^[18]. Conceptually, you can imagine the loss as a hilly landscape and the training process as trying to descend to the lowest valley point (minimum loss) by following the slope (gradient).

Backpropagation uses the chain rule from calculus to efficiently compute these weight gradients from the output layer back to the input layer. The math can get intricate, but the concept is straightforward: assign blame for the error to each neuron and its weights, and tweak them to reduce the error. An intuitive way to think about it is to imagine each weight asking, “If I were a little higher or lower, would the overall error go up or down?” – backprop gives the answer, and the weight is then nudged in the favorable direction.

4. Iterating the Process (Training): Updating the weights once is not enough. The network will perform forward passes and backpropagation repeatedly for many examples, often for many cycles (called epochs) through the training dataset. Each cycle makes the network a bit better. Over time, the network’s predictions on the training examples get closer to the true outputs, and the overall loss decreases, ideally converging to a low value. This iterative learning process continues until the model reaches a desired level of accuracy or the improvements become very small. Crucially, the effectiveness of learning depends on having lots of training data. As Geoffrey Hinton famously said, “All you need is lots and lots of data and lots of information about what the right answer is, and you’ll be able to train a big neural net to do what you want” ^[19]. In other words, neural networks achieve their power by training on massive amounts of examples that teach them the correct output for a given input. The combination of big data, computational power, and the backpropagation algorithm is what ignited the deep learning revolution ^[20].

Over many iterations, the network effectively “learns” the task. The layers of neurons develop weights that pick out meaningful features in the data. For instance, in an image classifier, early hidden layers might learn to detect simple shapes or edges, while deeper layers recognize more complex structures (like faces or objects). The end result is a trained neural network model that can generalize to new inputs – meaning it can make predictions on data it hasn’t seen before, by recognizing patterns it learned during training.

(It’s worth noting that most modern neural networks are feedforward as described above – information flows one way, input to output. There are also recurrent networks and other architectures with feedback loops, but for this beginner guide we focus on the basic feedforward network.)

With the concepts of neurons, weights, activations, forward passes, loss, and backpropagation in mind, let’s move from theory to practice. Next, we’ll walk through how you can program a simple neural network from scratch in Python, to solidify your understanding.

Programming a Simple Neural Network from Scratch (in Python)

One of the best ways to fully grasp neural networks is to build a simple one yourself. In real projects, developers use powerful libraries like TensorFlow or PyTorch, but here we’ll use minimal tools (just basic Python, and optionally NumPy for convenience) to code a basic neural network. By doing so, you’ll see how each component – from weight initialization to feedforward to backprop – comes to life in code ^[21] ^[22]. (Even if you’re not a programming expert, following these steps will give you insight into how neural networks operate under the hood.)

As Andrew Ng – a leading AI researcher – once noted from his first experiences, “I thought it was amazing you could write software that would learn by itself and make predictions.” ^[23] Building your own neural network will let you witness that amazing process of a program learning by itself. Ready? Let’s outline the steps to create a simple neural network that learns from data:

Define the Network Architecture: Decide how many inputs and outputs your network will have, and how many hidden neurons to use. For example, if we want to learn a simple function with two inputs and one output, we might use 2 input neurons, a hidden layer with a handful of neurons (say 2 or 3 for simplicity), and 1 output neuron. The input layer size should match the number of features in your data, and the output layer size matches the number of prediction values or classes. (If you’re doing a classification with multiple classes, you’d have one output neuron per class. For a yes/no outcome, one output neuron can suffice, e.g. outputting a probability of “yes”.) Keep it simple – for a first network, a single hidden layer (making it a “shallow” neural network) is enough to demonstrate the concepts.
Initialize Weights and Biases: Before training, we need to start with some initial weights and biases for all the connections and neurons. We typically initialize these to small random numbers ^[24]. Random initialization breaks symmetry (so neurons don’t all produce the same output) and gives the network a starting point to begin learning. For example, in code you might create Python lists or NumPy arrays filled with random values for each weight matrix (e.g., use np.random.rand() for small random floats). Often the biases can start at zero or small random values too. At this stage, the network doesn’t “know” anything – it’s essentially making random guesses.
Choose an Activation Function: Decide on the activation function for the neurons in your hidden layer (and possibly for the output neuron, depending on the task). A safe choice for a simple network is the sigmoid function, which outputs values between 0 and 1. Sigmoid is nice for beginners because it’s easy to work with and was historically used in early neural nets ^[25]. If your output is a probability, sigmoid is appropriate for a single output neuron, or softmax (a related function) can be used for multiple output classes. In code, you can define the sigmoid function as: def sigmoid(x): return 1 / (1 + np.exp(-x)). This will squash any input x into the 0–1 range. Using an activation function means our neuron’s output = activation(weighted_sum + bias). (For reference, modern deep learning often uses ReLU for hidden layers because it mitigates some problems like vanishing gradients, but to keep things simple, we’ll stick with sigmoid or similar here.)
Implement the Forward Pass: Now, use your weights, biases, and activation function to compute the output for a given input. This is the feedforward computation ^[26] ^[27]. For each neuron in the hidden layer, calculate the weighted sum of inputs: z=w1x1+w2x2+⋯+bz = w_1 x_1 + w_2 x_2 + cdots + bz=w1x1+w2x2+⋯+b. Then apply the activation: a=sigmoid(z)a = text{sigmoid}(z)a=sigmoid(z). That aaa is the neuron’s output. Do this for all neurons in the hidden layer. Their outputs collectively form the input to the next layer (in our simple network, the next layer might just be the output layer). For the output neuron(s), repeat the process: multiply each hidden neuron’s output by the corresponding output weight, sum them up plus bias, and apply activation (for regression tasks you might use a linear output or for classification a sigmoid/softmax as discussed). After this step, you’ll have the network’s predicted output for the given input. In code, this can be done with matrix multiplication for efficiency. For example, if X is the input vector, W1 is the matrix of weights from input to hidden layer, and W2 is the matrix from hidden to output, you can compute hidden outputs as A1 = sigmoid(np.dot(X, W1) + b1) and then output Y_pred = sigmoid(np.dot(A1, W2) + b2). (The dimensions need to line up, but conceptually it’s input W1 -> hidden activations -> W2 -> output.)
Calculate the Loss: Compare the network’s prediction to the true target value. Choose a loss function to quantify the error. For simplicity, we can use Mean Squared Error (MSE) as our loss ^[28]. For a single output, MSE is 12(ypred−ytrue)2frac{1}{2}(y_{text{pred}} – y_{text{true}})^221(ypred−ytrue)2 (the ½ is optional, it just cancels out a 2 in the derivative later). If using multiple outputs (say, a one-hot encoded vector for classes), you would compute the average of squared differences across all output nodes. In code, if y_pred is the predicted value and y_true is the actual value, MSE can be computed as loss = np.mean((y_pred - y_true)2). This gives a sense of how far off the prediction is. At the start of training, the loss will likely be high (since weights are random). Our goal is to minimize this loss by adjusting weights.
Backpropagate the Error: This is the heart of learning – using the loss to update the weights. Backpropagation will compute the gradients (partial derivatives of the loss with respect to each weight and bias). While the detailed calculus might be beyond a beginner tutorial, the idea is to find out for each weight: should it be increased or decreased, and by how much, to reduce the error? For a network with sigmoid activations, the derivations involve multiplying the error by the sigmoid’s derivative and the input activations (this comes from the chain rule). The end result is an update rule for each weight. For example, a simple form of the update is:
Δw=−α∂Loss∂w,Delta w = -alpha frac{partial text{Loss}}{partial w},Δw=−α∂w∂Loss,
where αalphaα is the learning rate (a small positive number you choose, e.g. 0.1). The negative sign means we move weights in the direction that lowers the loss. Practically, you’d compute the gradient for each layer in reverse order: first find the gradient at the output layer (how the loss changes with output activation, and how that activation changes with each output weight), then propagate that “blame” back to the hidden layer weights, and so on. In code, this might involve matrix operations as well. For instance, if error_out = y_pred - y_true, then the gradient at the output layer (for sigmoid) can be delta_out = error_out y_pred (1 - y_pred) (since derivative of sigmoid is a∗(1−a)a(1-a)a∗(1−a)). Then the gradient for weights W2 is proportional to hidden_activation delta_out. Similarly, the error is propagated to the hidden layer: error_hidden = delta_out dot W2^T, and delta_hidden = error_hidden A1 (1 - A1) for sigmoid in the hidden layer. Then the gradient for W1 is X delta_hidden. This is a simplified sketch, but the key point is: each weight gets adjusted by an amount proportional to its contribution to the error ^[29]. Backpropagation thus “nudges” all the weights in the right direction.
Update Weights and Biases (Optimization): Once the gradients are computed, update each weight: New weight = Old weight $-$ learning_rate (gradient). For biases, similarly: New bias = Old bias $-$ learning_rate * (gradient_of_loss_wrt_bias). This step actually performs the learning by applying the calculated adjustments to the network’s parameters ^[30]. After updating, if you feed the same input again, the loss should ideally be a tiny bit lower. Modern neural networks use sophisticated optimization algorithms (like Adam or RMSprop) that adapt the learning rate for each weight, but the basic principle remains gradient descent on the error surface.
Repeat for Many Iterations: Training is an iterative process. You will loop over your training dataset multiple times, each time performing forward passes and backprop updates. Each pass makes the network a little more accurate. For example, you might loop for a fixed number of epochs (passes through the data) or until the loss falls below a threshold. It’s common to shuffle the data and use mini-batches for efficiency (updating on small groups of examples at a time), but for a simple network you can even update one example at a time (called stochastic gradient descent). Throughout training, monitor the loss – it should trend downward. If it plateaus or oscillates, you might need to reduce the learning rate or ensure your implementation of backprop is correct. With enough data and iterations, the network will converge to a set of weights that (hopefully) generalize well. At this point, it has “learned” from the training data.
Test Your Network: After training, it’s important to test the neural network on new data it hasn’t seen (if available) to verify that it learned the general patterns, not just the training examples. You would do a forward pass on test inputs and check if the outputs make sense (e.g. accuracy on labeled test data). If the network performs well on new data, congratulations – you’ve successfully built and trained a neural network from scratch! 🎉

Throughout this coding process, you can literally see the neural network learn. Initially, its predictions will be essentially random. But as you loop through training, the loss will decrease and the predictions will start to align with the true outputs. The network parameters (weights/biases) are being fit to your data through those small backprop updates. This simple exercise encapsulates the same fundamental process that happens in large-scale deep learning systems, just on a much smaller and manageable scale.

Wrapping Up: From Brain Inspiration to Working Code

By now, you should have a clearer understanding of how neural networks work and how to build one. We started with the core idea that neural networks are inspired by the brain’s network of neurons, and they learn by adjusting connections (weights) through experience (data). We discussed the essential components – neurons organized in layers, weights & biases that get tuned, and activation functions that introduce the flexibility needed to learn complex patterns. We then walked through the learning procedure: the forward pass where a prediction is made, the calculation of a loss to see the error, and backpropagation with gradient descent to tweak the weights in the right direction. This loop of predict-and-correct is how neural nets “learn” from their mistakes, gradually becoming more accurate.

Finally, we outlined how to implement a toy neural network in Python from scratch. While modern AI projects use high-level frameworks, there’s immense educational value in seeing the nuts and bolts yourself – watching numbers flow forward and gradients flow backward. This hands-on approach demystifies the “magic” of neural networks and shows that, at its heart, a neural network is just a lot of simple math operations (weighted sums and nonlinear squashing) guided by an algorithm (backpropagation) to improve those weights. The pioneers of deep learning had the conviction that teaching machines to learn from data was the path to AI. As Yann LeCun’s and Geoffrey Hinton’s experiences showed, even when neural nets were unfashionable, they believed in the power of learning algorithms ^[31] ^[32]. That faith paid off – today’s AI breakthroughs, from self-driving cars to language translators, are built on these neural network principles.

To delve deeper, you can explore resources like Andrew Ng’s Deep Learning courses, the online book Neural Networks and Deep Learning by Michael Nielsen, or MIT’s and Stanford’s free lecture materials – all of which expand on the concepts here. But with the foundational knowledge from this report, you’re well on your way. Whether you were drawn in by the brain analogy or the promise of coding an intelligent system, remember that neural networks “learn” by example – a powerful paradigm shift from traditional programming. As you experiment with your own neural network code, you’ll likely share the same amazement Andrew Ng had: a few lines of code, inspired by the brain’s design, can learn by itself and start making predictions ^[33]. That is the intriguing beauty of neural networks – simple computational pieces coming together to produce behavior that appears intelligent.

Sources: