COMP219

2021-09-27

Lec-5

introduction for numpy

正态分布

from numpy import random

x = random.normal(size=(2,3))
print(x)

三个参数

loc - 平均值

scale - 标准偏差

size - 返回数组形状

Vector operations
- inner
- outer
- dot
sum 让所有的数加起来
cumsum 逐行逐个相加返回一维矩阵
～～～～～～设定参数返回n+1维矩阵
Slicing array 切片矩阵

a = np.random.random((4,5))

a[2, :] #third row, all columns
a[1:3] # 2nd, 3rd row, all columns
a[:, 2:4] # all rows, columns 3 and 4

Iterating over arrays
- Iterating over multidimensional arrays is done with respect to first axis: for row in A
- Looping over all elements: for element in A.flat
Reshapping
- Reshape
  - using reshape. Total size must remain the same
- Resize
  - using resize, always works: chopping or appending zeros
  - First dimension has ‘priority’, so beware of unexpected results

Linear algebra

import numpy.linalg

qr  # computes the QR decomposition
cholesky    # computes the cholesky decomposition
inv(A)  # Inverse
solve(A,b)  # Solves Ax = b for Ax = b for A full rank
lstsq(A,b)  # Solves argminx||Ax-b||2
eig(A)  # Eigenvalue decomposition 特征值分解
eig(A)  # Eigenvalue decomposition for symmetric or hermitian 对称或厄密矩阵的特征值分解
eigvals(A)  #Computer eigenvalues 计算特征值
svd(A, full)    #singular value decomposition
pinv(A)     #computes pseudo-inverse of A 计算A的伪逆

Fourier transform

import numpy.fft
fft # 1-dimensional DFT
fft2 # 2-dimensional DFT
fftn # n-dimensional DFT
ifft # 1-dimensional inverse DFT
rfft # Real DFT(1-dim)
ifft # Imaginary DFT(1-dim)

Random sampling

import numpy.random
    rand(d0,d1,.....,dn) # Random values in a given shape
    randn(d0,d1,.....,dn) # Random standard normal
    randint(lo,hi,size) # Random integers [lo,ji)
    choice(a,size,repl,p) # Sample from a
    shuffle(a) # Permutation(in-place)
    permutation(a)  # Permutation(new array)

distributions in random
- beta
- binomial
- chisquare
- exponential
- dirichlet
- gamma
- laplace
- lognormal
- pareto
- power
SciPy
A library of algorithms and mathe matical tools bulit to work with NumPy arrays.
linear algebra - scipy.linalg
- cloud faster than numpy.linalg
- some more function
- slightly different
statistics - scipy.stats
- Mean, median, mode, variance, kurtosis
- Pearson correlation coefficient
- Hypothesis test
- Gaussian kernel density estimation
optimization - scipy.optimize
- General purpose minimization: CG, BFGS, least-squares
- Constrained minimization; non-negative least-squares
- Minimize using simulated annealing
- Scalar function minimization
- Root finding
- Check gradient function Line search
sparse matrices - scipy.sparse
- Sparse matrix classes: CSC,CSR, etc
- Functions to build sparse matrices
- sparse.linalg module for sparse linear algebra
- sparse.csgraph for sparse graph routines
signal processing - scipy.signal
- Convolutions
- B-splines
- Filtering
- Continuous-time linear system
- Wavelets
- Peak finding
etc
Matplotlib
- Plotting library for Python
- Works well with Numpy
- Syntax similar to Matlab

Scatter Plot
adding titles and labels -> adding multiple lines and legend

adding multiple lines and legend

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 1000)
y = np.power(x, 2)
plt.plot(x, y)
plt.show()

Seaborn makes plot pretty

import numpy as np
import matplotlib.pylot as plt
import seaborn as sns
x = np.linspace(0, 10, 1000)
y = np.power(x, 2)
plt.plot（x, y)
plt.show()

Histogram

data = np.random.randn(1000)
f, (ax1,ax2) = plt.subplots(1, 2, figsize=(6.3))

# histogram(pdf)
ax1.hist(data, bins=30, normed=True, color = 'b')

# empirical cdf
ax2.hist(data, bins=30, normed=True, color='r', cumulative=True)

plt.savefig('histogram.pdf')

Box Plot

samp1 = np.random.normal(loc=0., scale=1., size=100)
samp2 = np.random.normal(loc=1., scale=2., size=100)
samp3 = np.random.normal(loc=0.3, scale=1.2, size=100)

f, ax = plt.subplots(1, 1, figsize=(5,4))

ax.boxplot((samp1, samp2, samp3))
ax.set_xticklabels(['sample1', 'sample2', 'sample3'])
plt.savefig('boxplot.pdf')

Image Plot

A = np.random.random((100,100))

plt.imshow(A)
plt.hot()
plt.colorbar()

plt.savefig('imageplot.pdf')

Wire Plot
matplotlib toolkits extend funtionality for other kinds of visualization

from mpl_toolkits.mplot3d import axes3d

ax = plt.subplot(111, projection='3d')
X, Y, Z = axes3d.get_test_data(0.1)
ax.plot_wireframe(X, Y, Z, linewidth=0.1)

plt.saveifg('wire.pdf')

scikit-learn algorithm cheat-sheet
alt p1

alt p2

Lec-6

Decision Tree

the decision tree representation
the standard top-down approach to learning a tree
Occam’s razor
entropy and information gain
types of decision-tree splits

A decision tree to predict heart disease
alt p1

Decision tree exercise

Suppose X1…X5 are Boolean features, and Y is also Boolean
How would you represent the following with decision trees?

alt p1

Y = X₂X₅ any of them is false then it is false

Y = X₂vX₅ any of them is true then it is true

Exercise:
- Y = X₂X₅vX₃-|X₁

Lec-7 Decision Tree

Topic

A general top-down algorithm
How to do splitting on numeric features
Occam’s razor

History of decision tree learning

Decision tree learning algorithms

ID3, or Iterative Dichotomizer
- one property is tested on each node of the tree
- maximzing information gain
CART, or Classification and Regression Trees (回归树)
- binary trees
- splits are selected using the twoing criteria
C4.5, improved version on ID3

Decision tree learning algorithms
alt p1

Top-down decision tree learning

alt p1

Step(1): FindBestSplit
Candidate splits on numeric features

given a set of training instances D and a specific feature X_i
- sort the values of X_i in D
- evaluate split thresholds in intervals between instances of different classes
could use midpoint of each considered interval as the threshold
C4.5 instead picks the largest value of X_i in the entire training set that does not exceed the midpoint
In more detail
Example
Candidate splits
- instead of using k-way splits for k-valued features, could require binary splits on all discrete features(CART does this)

Step(2): DetermineCandidateSplit

Finding the best split
- How should we select the best feature to split on at each step
- Key hypothesis: the simplest tree that classifies the training instances accurately will well on perviously unseen instances
- Can we find and return the smallest possible decision tree that accurately classifies the training set?
  This is an NP-hard problem
- Instead, we’ll use an information-theoretic heuristics to greedily choose splits
Occam’s razor
- attributed to 14th century William of Ockham
- “when you have two competing theories that make exactly the same predictions, the simpler one is the better”
Occam’s razor and decision trees
- Why is Occam’s razor a reasonable heuristic for decison tree leaning?
  - there are fewer short models(i.e. small trees) than long ones
  - a short model is unlikely to fit the training data well by chance
  - a long model is more likeky to fit the training data well coincidentally
Recap: Expected Value(Finite Case)
- Let X be a random variable with a finite number of finite outcomes X₁,X₂,….,X_k occurring with probability P₁,P₂,…..,P_k, respectively. The expectation of X is defined as
- Expectation is a weighted average
- Example

Lec-8 Decision Tree

Topics

Entropy and information gain
Types of decision-tree splits

Recap: Expected Value Example
alt p1

Information theory background

consider a problem in which you are using a code to communicate information to receiver
Example: as bikes go past, you are communicating the manufacturer of each bike
suppose there are only four types of bikes
we colud use the following code
expected number of bits we have to communicate
- 2 bits/bike
we can do better if the bike types aren’t equiprobable
expected number of bits we have to communicate
optimal code uses -log₂P(y) bits for event with probability P(y)

Entropy

entropy is measure of uncertainty assocaited with a random variable
defined as the expected numebr of bits required to communicate the value of the variable
- entropy function for binary variable
Conditional entropy
quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known.
Whats the entropy of Y if we condition on some other variable X?
Example
Information gain
- choosing splits in ID3: select the split that most reduces the conditional entropy of Y for training set D
- example
  - What’s the information gain of splitting on Humidity?
- One limitation of information gain
  - information gain is biased towards tests with many outcomes
  - e.g. consider a feature that uniquely identifies each training instace
    - splitting on this feature would result in many branches, each of which is “pure” (has instances of only one class)
    - maximal information gain!
- Relations between the concepts

Gain Ratio

to address this limitation, C4.5 uses a splitting criterion called gain ratio
gain ratio normalizes the information gain by the entropy of the split being considered
Exercise
- Compute the following:
  GainRatio(D, Humidity)=
  GainRatio(D, W ind)=
  GainRatio(D, Outlook)=

Lec-9 Decision Tree Learning

Topic:

Stopping criteria of decision trees
Accuracy of decision trees
Pruning and validartion dataset
Overfitting

Stopping criteria

We should form a leaf when
- all of the given subset of instances qre of the same class
- we’ve exhausted all of the candidate splits

Accuracy of Decision Tree

Definition of Accuracy and Error

Given a set D of samples and a trained model M, the accuracy is the precentage of correctly labeled samples
```
  Accurarcy(D,M) = |{M(x) = l<sub>x</sub> | x $\in$ D}| / |D|
```
Where l_x is the true label of sample X and M(x) gives the predicted label of x by M
Error is a dual concept of accuracy
```
  Error(D,M) = 1 - Accuracy(D,M)
```

Assess the accuracy of a tree

Can we just calculate the fraction of training instances that are correctly classified
Consider a problem domain in which instances are assigned labels at random with P(Y = t) = 0.5
- how accurate would a learned decision tree be on previously unseen instances?
  - Can never reach 1.0
- how accurate would it be on its training set?
  - Can be arbitrarily close to, or reach, 1.0 if model can be very large
to get an unbiased estimate of a learned model’s accuracy, we must use a set of instances that are held-aside during learning
this is called a test set

Pruning and Validation Dataset

Stopping ctriteria

We should form a leaf when
- all of the given subset o f instances are of the same class
- we’ve exhausted all of the candidate splits
Is there a reanson to stop eariler, or to prune back the tree?

Pruning in C4.5

split given datainto training and validation(tuning) sets
a validation set(a.k.a. tuning set) is a subset of the training set that is held aside
- not used for primary training process(e.g. tree growing)
- but used to select among models(e.g. trees pruned to varying degrees)
Grow a complete tree
do until further pruning is harmful
- evaluate impact on tuning-set accuracy of pruning each node
- greedily remove the one that least reduces tuning-set accuracy