Lec-5
introduction for numpy
- 正态分布
1
2
3
4from numpy import random
x = random.normal(size=(2,3))
print(x)
三个参数
- loc - 平均值
- scale - 标准偏差
- size - 返回数组形状
Vector operations
- inner
- outer
- dot
sum 让所有的数加起来
cumsum 逐行逐个相加返回一维矩阵
~~~~~~ 设定参数返回n+1维矩阵
Slicing array 切片矩阵
1 | a = np.random.random((4,5)) |
Iterating over arrays
- Iterating over multidimensional arrays is done with respect to first axis: for row in A
- Looping over all elements: for element in A.flat
Reshapping
- Reshape
- using reshape. Total size must remain the same
- Resize
- using resize, always works: chopping or appending zeros
- First dimension has ‘priority’, so beware of unexpected results
- Reshape
Linear algebra
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15import numpy.linalg
qr # computes the QR decomposition
cholesky # computes the cholesky decomposition
inv(A) # Inverse
solve(A,b) # Solves Ax = b for Ax = b for A full rank
lstsq(A,b) # Solves argminx||Ax-b||2
eig(A) # Eigenvalue decomposition 特征值分解
eig(A) # Eigenvalue decomposition for symmetric or hermitian 对称或厄密矩阵的特征值分解
eigvals(A) #Computer eigenvalues 计算特征值
svd(A, full) #singular value decomposition
pinv(A) #computes pseudo-inverse of A 计算A的伪逆
Fourier transform
1
2
3
4
5
6
7
8import numpy.fft
fft # 1-dimensional DFT
fft2 # 2-dimensional DFT
fftn # n-dimensional DFT
ifft # 1-dimensional inverse DFT
rfft # Real DFT(1-dim)
ifft # Imaginary DFT(1-dim)Random sampling
1
2
3
4
5
6
7
8import numpy.random
rand(d0,d1,.....,dn) # Random values in a given shape
randn(d0,d1,.....,dn) # Random standard normal
randint(lo,hi,size) # Random integers [lo,ji)
choice(a,size,repl,p) # Sample from a
shuffle(a) # Permutation(in-place)
permutation(a) # Permutation(new array)distributions in random
- beta
- binomial
- chisquare
- exponential
- dirichlet
- gamma
- laplace
- lognormal
- pareto
- power
SciPy
A library of algorithms and mathe matical tools bulit to work with NumPy arrays.linear algebra - scipy.linalg
- cloud faster than numpy.linalg
- some more function
- slightly different
statistics - scipy.stats
- Mean, median, mode, variance, kurtosis
- Pearson correlation coefficient
- Hypothesis test
- Gaussian kernel density estimation
optimization - scipy.optimize
- General purpose minimization: CG, BFGS, least-squares
- Constrained minimization; non-negative least-squares
- Minimize using simulated annealing
- Scalar function minimization
- Root finding
- Check gradient function Line search
sparse matrices - scipy.sparse
- Sparse matrix classes: CSC,CSR, etc
- Functions to build sparse matrices
- sparse.linalg module for sparse linear algebra
- sparse.csgraph for sparse graph routines
signal processing - scipy.signal
- Convolutions
- B-splines
- Filtering
- Continuous-time linear system
- Wavelets
- Peak finding
etc
Matplotlib
- Plotting library for Python
- Works well with Numpy
- Syntax similar to Matlab
- Scatter Plot
adding titles and labels -> adding multiple lines and legend
adding multiple lines and legend
1 | import numpy as np |
Seaborn makes plot pretty
1 | import numpy as np |
Histogram
1
2
3
4
5
6
7
8
9
10data = np.random.randn(1000)
f, (ax1,ax2) = plt.subplots(1, 2, figsize=(6.3))
# histogram(pdf)
ax1.hist(data, bins=30, normed=True, color = 'b')
# empirical cdf
ax2.hist(data, bins=30, normed=True, color='r', cumulative=True)
plt.savefig('histogram.pdf')Box Plot
1
2
3
4
5
6
7
8
9
10samp1 = np.random.normal(loc=0., scale=1., size=100)
samp2 = np.random.normal(loc=1., scale=2., size=100)
samp3 = np.random.normal(loc=0.3, scale=1.2, size=100)
f, ax = plt.subplots(1, 1, figsize=(5,4))
ax.boxplot((samp1, samp2, samp3))
ax.set_xticklabels(['sample1', 'sample2', 'sample3'])
plt.savefig('boxplot.pdf')Image Plot
1
2
3
4
5
6
7A = np.random.random((100,100))
plt.imshow(A)
plt.hot()
plt.colorbar()
plt.savefig('imageplot.pdf')Wire Plot
matplotlib toolkits extend funtionality for other kinds of visualization1
2
3
4
5
6
7from mpl_toolkits.mplot3d import axes3d
ax = plt.subplot(111, projection='3d')
X, Y, Z = axes3d.get_test_data(0.1)
ax.plot_wireframe(X, Y, Z, linewidth=0.1)
plt.saveifg('wire.pdf')
scikit-learn algorithm cheat-sheet
Lec-6
Decision Tree
- the decision tree representation
- the standard top-down approach to learning a tree
- Occam’s razor
- entropy and information gain
- types of decision-tree splits
A decision tree to predict heart disease
Decision tree exercise
- Suppose X1…X5 are Boolean features, and Y is also Boolean
- How would you represent the following with decision trees?
- Y = X2X5 any of them is false then it is false
- Y = X2vX5 any of them is true then it is true
Exercise:
- Y = X2X5vX3-|X1
- Y = X2X5vX3-|X1
Lec-7 Decision Tree
Topic
- A general top-down algorithm
- How to do splitting on numeric features
- Occam’s razor
History of decision tree learning
- Decision tree learning algorithms
- ID3, or Iterative Dichotomizer
- one property is tested on each node of the tree
- maximzing information gain
- CART, or Classification and Regression Trees (回归树)
- binary trees
- splits are selected using the twoing criteria
- C4.5, improved version on ID3
Decision tree learning algorithms
Top-down decision tree learning
Step(1): FindBestSplit
Candidate splits on numeric features
- given a set of training instances D and a specific feature Xi
- sort the values of Xi in D
- evaluate split thresholds in intervals between instances of different classes
- could use midpoint of each considered interval as the threshold
- C4.5 instead picks the largest value of Xi in the entire training set that does not exceed the midpoint
- In more detail
- Example
- Candidate splits
- instead of using k-way splits for k-valued features, could require binary splits on all discrete features(CART does this)
- instead of using k-way splits for k-valued features, could require binary splits on all discrete features(CART does this)
Step(2): DetermineCandidateSplit
Finding the best split
How should we select the best feature to split on at each step
Key hypothesis: the simplest tree that classifies the training instances accurately will well on perviously unseen instances
Can we find and return the smallest possible decision tree that accurately classifies the training set?
This is an NP-hard problemInstead, we’ll use an information-theoretic heuristics to greedily choose splits
Occam’s razor
- attributed to 14th century William of Ockham
- “when you have two competing theories that make exactly the same predictions, the simpler one is the better”
- attributed to 14th century William of Ockham
Occam’s razor and decision trees
- Why is Occam’s razor a reasonable heuristic for decison tree leaning?
- there are fewer short models(i.e. small trees) than long ones
- a short model is unlikely to fit the training data well by chance
- a long model is more likeky to fit the training data well coincidentally
- Why is Occam’s razor a reasonable heuristic for decison tree leaning?
Recap: Expected Value(Finite Case)
- Let X be a random variable with a finite number of finite outcomes X1,X2,….,Xk occurring with probability P1,P2,…..,Pk, respectively. The expectation of X is defined as
- Expectation is a weighted average
- Example
- Let X be a random variable with a finite number of finite outcomes X1,X2,….,Xk occurring with probability P1,P2,…..,Pk, respectively. The expectation of X is defined as
Lec-8 Decision Tree
Topics
- Entropy and information gain
- Types of decision-tree splits
Recap: Expected Value Example
Information theory background
- consider a problem in which you are using a code to communicate information to receiver
- Example: as bikes go past, you are communicating the manufacturer of each bike
- suppose there are only four types of bikes
- we colud use the following code
- expected number of bits we have to communicate
- 2 bits/bike
- we can do better if the bike types aren’t equiprobable
- expected number of bits we have to communicate
- optimal code uses -log2P(y) bits for event with probability P(y)
Entropy
entropy is measure of uncertainty assocaited with a random variable
defined as the expected numebr of bits required to communicate the value of the variable
- entropy function for binary variable
- entropy function for binary variable
Conditional entropy
quantifies the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known.
Whats the entropy of Y if we condition on some other variable X?
Example
Information gain
choosing splits in ID3: select the split that most reduces the conditional entropy of Y for training set D
example
- What’s the information gain of splitting on Humidity?
- What’s the information gain of splitting on Humidity?
One limitation of information gain
- information gain is biased towards tests with many outcomes
- e.g. consider a feature that uniquely identifies each training instace
- splitting on this feature would result in many branches, each of which is “pure” (has instances of only one class)
- maximal information gain!
Relations between the concepts
Gain Ratio
to address this limitation, C4.5 uses a splitting criterion called gain ratio
gain ratio normalizes the information gain by the entropy of the split being considered
Exercise
- Compute the following:
GainRatio(D, Humidity)=
GainRatio(D, W ind)=
GainRatio(D, Outlook)=
- Compute the following:
Lec-9 Decision Tree Learning
Topic:
- Stopping criteria of decision trees
- Accuracy of decision trees
- Pruning and validartion dataset
- Overfitting
Stopping criteria
- We should form a leaf when
- all of the given subset of instances qre of the same class
- we’ve exhausted all of the candidate splits
Accuracy of Decision Tree
Definition of Accuracy and Error
Given a set D of samples and a trained model M, the accuracy is the precentage of correctly labeled samples
Accurarcy(D,M) = |{M(x) = l<sub>x</sub> | x $\in$ D}| / |D|
Where lx is the true label of sample X and M(x) gives the predicted label of x by M
Error is a dual concept of accuracy
Error(D,M) = 1 - Accuracy(D,M)
Assess the accuracy of a tree
Can we just calculate the fraction of training instances that are correctly classified
Consider a problem domain in which instances are assigned labels at random with P(Y = t) = 0.5
- how accurate would a learned decision tree be on previously unseen instances?
- Can never reach 1.0
- how accurate would it be on its training set?
- Can be arbitrarily close to, or reach, 1.0 if model can be very large
- how accurate would a learned decision tree be on previously unseen instances?
to get an unbiased estimate of a learned model’s accuracy, we must use a set of instances that are held-aside during learning
this is called a test set
Pruning and Validation Dataset
Stopping ctriteria
- We should form a leaf when
- all of the given subset o f instances are of the same class
- we’ve exhausted all of the candidate splits
- Is there a reanson to stop eariler, or to prune back the tree?
Pruning in C4.5
- split given datainto training and validation(tuning) sets
- a validation set(a.k.a. tuning set) is a subset of the training set that is held aside
- not used for primary training process(e.g. tree growing)
- but used to select among models(e.g. trees pruned to varying degrees)
- Grow a complete tree
- do until further pruning is harmful
- evaluate impact on tuning-set accuracy of pruning each node
- greedily remove the one that least reduces tuning-set accuracy