Vector and matrix derivatives
These are definitions. They are all intuitive. (I've picked one convention for the definitions. The other convention gives all results transposed.)
Taking the derivative of a scalar by a column vector is just taking the grad:
$$\frac{\mathrm{d}a}{\mathrm{d} \bf x} = \nabla a = \left[\frac{\partial a}{\partial x_1}\ \frac{\partial a}{\partial x_2} \dots \ \frac{\partial a }{\partial x_n}\right]^\top$$
Taking the derivative of a vector by a scalar is componentwise differentiation. The result is a row vector by convention.
$$\frac{\mathrm{d}\bf{x}}{\mathrm{d}a} = \left[\frac{\mathrm{d}x_1}{\mathrm{d}a}\ \frac{\mathrm{d}x_2}{\mathrm{d}a}\ \dots\frac{\mathrm{d}x_n}{\mathrm{d}a}\right]$$
These basic derivatives are easily obtained from the definitions:
$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top b = b$$
$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top \mathbf x = 2\mathbf x$$
$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top B \mathbf x = 2B\mathbf x$$
Note: taking the derivative of a row vector by a column vector gives a Jacobian matrix. The denominator of the derivative ranges across rows and the numerator ranges across columns. This is consistent with the two definitions above.
$$\frac{\mathrm{d} \mathbf y}{\mathrm{d}\bf x} = \begin{pmatrix}\frac{\mathrm d y_1}{\mathrm d x_1} & \dots & \frac{\mathrm d y_n}{\mathrm d x_1} \\ \vdots & \ddots & \vdots\\ \frac{\mathrm d y_1}{\mathrm d x_n} & \dots & \frac{\mathrm d y_n}{\mathrm d x_n} \end{pmatrix}$$
This yields an additional useful derivative, but we won't use it:
$$\frac{\mathrm{d}}{\mathrm{d}\bf x}\mathbf{x}^\top B = B$$
Also note that $\bf x^\top y = y^\top x$ and these are both scalars. Same goes for $\mathbf{x}^\top A \mathbf{y} = \mathbf{y}^\top A^\top \mathbf{x}$. But $\bf xy^\top \neq yx^\top$; both sides are matrices and the rhs is the transpose of the lhs.
Algebra
We have a matrix of features $X$ whose first column is 1s. We have a column vector of responses $y$. We wish to find a column vector of coefficients $\beta$ (the first entry of which represents the constant term) such that
$$X\beta \approx y$$
Specifically we wish to minimise the sum of squares of elements in the vector $y-X\beta$. Fortunately, sum of squares is simply
$$\begin{align*}(y-X\beta)^\top (y-X\beta) &= y^\top y - y^\top X\beta - \beta^\top X^\top y + \beta^\top X^\top X \beta\\ &= y^\top y - 2\beta^T X^\top y + \beta^\top X^\top X \beta \end{align*}$$
and we wish to minimise this quantity over $\beta$. It is quadratic in $\beta$ so it has one critical point, and the coefficient of $\beta^T\beta$ is positive so the critical point is a minimum. Taking the derivative w.r.t. $\beta$ using the rules above and setting equal to 0 gives
$$-2X^\top y + 2X^\top X \beta= 0$$
Rearrange to give the normal equation
$$\beta = (X^\top X)^{-1}X^\top y$$
Geometric derivation
Minimising the sum of squares $\|X \beta - y\|^2$ corresponds geometrically to choosing $\beta$ to minimise the Euclidean distance between two vectors. The $y$ vector is fixed. $\beta$ allows us to choose a linear combination of the column of $X$ as our other vector. This other vector lies in the column space of $X$ (i.e. its image when interpreted as a linear map), and this column space won't contain $y$ (if it did, that would mean $X$'s features can perfectly recreate the response for all samples).
The minimum distance between a vector and a plane is located at the orthogonal projection of the vector onto the plane. This point is $X\beta$ (for the optimal $\beta$), and so the residual vector $y-X\beta$ is orthogonal to the plane. This implies that $(y-X\beta) \cdot Xv$ is 0 for all vectors $v$ (since all such vectors lie on the plane), i.e. $(y-X\beta)^\top Xv = 0$ for all $v$. Since this holds for all $v$, then indeed $(y-X\beta)^\top X = 0$. This rearranges to give the normal equation.
$\hat y = X\beta = X(X^\top X)^{-1}X^\top y$ is the projection of $y$ onto the column space of $X$, and more generally the matrix $H = X(X^\top X)^{-1}X^\top$ projects a response vector to its corresponding prediction vector. It is known as the "hat matrix" since it "puts the hat on y", and it is a projection (i.e. idempotent) operator: $H^2 = H$.
Since $Hy=\hat y$, the diagonal entry $H_{ii}$ gives the weighting of the response $y_i$ in the prediction $\hat y_i$, compared to the weighting of other responses $y_j$. This is known as the leverage of the $i$th observation and is a number between 0 and 1 (follows from the idempotency and symmetry conditions of the hat matrix, see Wikipedia). It tells us: for this prediction on this sample, how much did the observed response of the sample play a role, compared to the responses on other samples? In a linear regression model, points at extreme $x$ values (which are not necessarily outliers!) have high leverage. If these points are outliers, it can cause problems. See https://online.stat.psu.edu/stat462/node/170/
Statistics
First, the variance of a random vector $x$ is defined to be the variance-covariance matrix $\text{Var}(x) = E[(x-E[x])(x-E[x])^\top]$.
Fix our dataset $X$ (i.e. it is a constant, not a random matrix). We suppose our dataset was generated by the model $y = X\beta + \epsilon$ where $y$ is a random vector, $\beta$ is an unknown parameter and $\epsilon \sim N(0,\sigma^2)$ is a random vector with parameter $\sigma$. This explains the usual regression assumptions of normal iid (implying homoscedastic) errors.
Now let $\hat \beta = (X^\top X)^{-1}X^\top y$ be an estimator of $\beta$. We would like to know its bias and its variance.
First, substitute $y$ into our expression for $\hat \beta$ to give (after expanding and simplifying)
$$\hat \beta = \beta + (X^\top X)^{-1}X^\top \epsilon$$
The only random variable here is $\epsilon$ which has mean 0, so
$$E[\hat \beta] = \beta$$
meaning our estimator is unbiased. We now compute the variance-covariance matrix of $\hat \beta$, which will be
$$\begin{align*}\text{Var}(\hat \beta) &=E[(\hat \beta - \beta)(\hat \beta - \beta)^\top] \\&= E[\left((X^\top X)^{-1}X^\top \epsilon\right)\left((X^\top X)^{-1} X^\top \epsilon\right)^\top]\\&= E[(X^\top X)^{-1}X^\top \epsilon \epsilon^\top X (X^\top X)^{-1}]\\&= (X^\top X)^{-1}X^\top E[\epsilon \epsilon^\top] X (X^\top X)^{-1}\\\end{align*}$$
Now $E[\epsilon \epsilon^\top]$ is a component-wise expected value of the matrix $\epsilon \epsilon^\top$. For all off-diagonal entries we have $E[\epsilon_i \epsilon_j] = E[\epsilon_i]E[\epsilon_j] = 0$ by independence. On diagonals we have entries $E[\epsilon_i^2]$. Now $\epsilon_i \sim N(0, \sigma^2)$ therefore $\sigma^2 = \text{Var}(\epsilon_i) = E[\epsilon_i^2] - E[\epsilon_i]^2$, hence $E[\epsilon_i^2] = \sigma^2$ (well isn't that a neat little trick!). So $E[\epsilon \epsilon^\top] = \sigma^2 I_n$, a diagonal matrix which hence commutes in multiplication. [N.B. you can also simply observe that $\epsilon \epsilon^\top$ is the variance-covariance matrix of $\epsilon$, simplifying the argument above.] Continuing,
$$\begin{align*} \text{Var}(\hat \beta) &= (X^\top X)^{-1}X^\top X \sigma^2 (X^\top X)^{-1}\\ &= \sigma^2 (X^\top X)^{-1} \end{align*}$$
What is this saying? The variance of a parameter estimate reflects our uncertainty about the value of that parameter; the parameter itself is a constant. Our estimator is a random vector and this is its covariance structure. All the covariances in the matrix are proportional to $\sigma^2$, which makes intuitive sense: the variance of the slope of the line of best fit is naturally proportional to the variance of the errors $\sigma^2$. But the covariances are also based on the inverse of an expression in $X$. The more samples contained in $X$, the larger the values of $X^\top X$, and so the smaller the inverse term in the covariance. This corresponds to the fact that the more samples we have, the more information (and hence the more certainty) we have about the parameters $\beta$. All the uncertainty in $\beta$ comes from us having to infer its value from our limited data. In the extreme case where the number of samples goes to infinity, there is no uncertainty left, and our $\beta$ can be read straight out of the data.
On the diagonal of this matrix are the variances of the regression coefficients $\hat \beta_i$. These are variances of the sampling distribution of the $\hat \beta_i$. We can take the square root of these to get the standard errors for the coefs.
We might also ask about the consistency of $\hat \beta$, i.e. whether $\hat \beta \to_p \beta$ as the dataset grows. This is the formal posing of the extreme case above. Answering the question is a bit tricky to do in the general case, and it involves reasoning formally about the distribution of $X$ which we have thus far avoided. $\hat \beta$ is indeed consistent, and the argument for this in short is that in the expression $\hat \beta = \beta + (X^\top X)^{-1}X^\top \epsilon$, the size of the $X^\top \epsilon$ can be cleverly bounded by the law of large numbers using the independence of the samples and the errors as the randomly sampled $X$ grows large. See 9.1 of http://cameron.econ.ucdavis.edu/e240a/asymptotic.pdf .
Simple OLS
Let $$X = \begin{pmatrix}1 & x_1 \\ \vdots & \vdots\\ 1 & x_n\end{pmatrix}, \quad y=(y_1\ \dots \ y_n)^\top, \quad \beta = (\beta_0 \ \beta_1)^\top$$
and write $\bar x = \frac 1 n \sum_i x_i$ and likewise for other vectors. (the $\beta$s should have hats on them but we are lazy now)
Then $$X^\top X = \begin{pmatrix}n & \sum_i x_i \\ \sum_i x_i &\sum_i x_i^2\end{pmatrix} = \begin{pmatrix}n & n\bar x \\ n \bar x & n\overline{x^2}\end{pmatrix}$$
$$(X^\top X)^{-1} = \frac{1}{n^2 \overline{x^2} - n^2 \bar x ^2} \begin{pmatrix}n\overline{x^2} & -n \bar x \\ -n \bar x & n\end{pmatrix} = \frac{1}{n(\overline{x^2} - \bar x^2)} \begin{pmatrix}\overline{x^2} & -\bar x \\ -\bar x & 1\end{pmatrix}$$
$$X^\top y = \begin{pmatrix}n\bar y \\ \sum_i x_i y_i\end{pmatrix} = \begin{pmatrix}\bar ny \\ n \overline{xy}\end{pmatrix}$$
Thus
$$\beta = \frac{1}{\overline{x^2} - \bar x^2} \begin{pmatrix}\overline{x^2}\bar y - \bar x \overline{xy} \\ -\bar x \bar y + \overline{xy} \end{pmatrix}$$
So the slope is $$\beta_1 = \frac{\overline{xy} - \bar x \bar y}{\overline{x^2} - \bar{x}^2} = \frac{\text{Cov}(x,y)}{\text{Var}(x)}$$
Recalling that $\text{Corr}(x,y) = \frac{\text{Cov}(x,y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}$, and letting $\sigma_x,\sigma_y$ be the standard deviations of $x,y$, we have
$$\beta_1 = \text{Corr}(x,y) \frac{\sigma_y}{\sigma_x}$$
which is an intuitive way of understanding the line of best fit in simple regression. The slope shows the correlation between the independent and the dependent variable, scaled appropriately by each one's standard deviation so it's in the right units. $\sigma_y, \sigma_x$ go in the numerator, denominator respectively to match the rise/run on the plot.