Lec10-overfitting and K-Nearset Neighbour
Topic
- Overfitting
- K-NN classification
Regression using polynomial
General phenomenon
Ovwefitting in decision trees
Prevent overfitting
- cause: training error and expected error are different
- there may be noise in the training data
- training data is of limited size, the easier to find a hypothesis that fits the difference between the training data and the true distribution
- prevent overfitting:
- cleaner training data help!
- more training data help!
- throwing away unnecessary hypotheses helps!
Overfitting in Decision Tree
Overfitting
consider error of model M over
- training data: Error(Dtraining, M)
- entire distribution of Data: Error(Dtrue, M)
model M $\in$ H overfits the training data if there is an alternative model M’ $\in$ such that
Example: overfitting with noisy data
- Suppose
- the target concept is Y = X1 $\bigwedge$ X2
- there is noise in some feature values
- we’re given the following training set
- Avoiding overfitting in DT learning
- two general strattegies to avoid overfitting
- early stopping: stop if further splitting not justified by a statistical test
- Quinlan’s original approach in ID3
- post-pruning: grow a large tree, then prune back some nodes
- more robust to myopia of greedy tree learning
- early stopping: stop if further splitting not justified by a statistical test
- two general strattegies to avoid overfitting
- Suppose
Nearest-neighbor classification
Nearest-neighbor classification
- learning stage
- given a training set(X(1), y(1))…(x(m), y(m)), do nothing
- (it’s sometimes called a lazy learner)
- given a training set(X(1), y(1))…(x(m), y(m)), do nothing
- classification stage
- given: an instance (q) to classify
- find the training-set instance x(i) that is most similar to x(q)
- return the class value y(i)
Nearest Neighbor
- When to consider
- Less than 20 attributes perinstance
- Lots of training data
- Advantages
- Training is very fast
- Learn complex target function
- Disadvantages
- Slow at query time
- Easily fooled by irrelevant attributes
- To classify a new input vector x, examine the k-closest training data points to x and assing the object to the most frequently occurring class
The decision regions for nearest-neighbor classification
- Voronoi diagram: each polyhedron indicates the region of feature space that is in the nearest neighborhood of each training instance
Lec11-KNN(cont.)
Topic
Definition
- How can we determine similarity/distance
- Standardizing numeric features(leave this to you)
Speeding up k-NN
- edited nearest neighbour
- k-d trees for nearest neighbour identification
Variants of k-NN
- k-NN regression
- Distance-weighted nearest neighbor
- Locally weighted regression to handle irrelevant features
Discussions
- Strengths and limition of instance-based learning
- Inductive bias
Definition
K-nearest-neighbor classification
- Classification task
- given: an instance x(q) to classify
- find the k training-set instances(x(1),y(1))…(x(k),y(k)) that are the most similar to x(q)
- return the class value
- (i.e. rerturn the class that have the most number of instance in the k training instances)
How can we determine similarity/distance
- suppose all features are categorical
- Hamming distance(or L0 normal): count the number of features for which two instances differ
- Example: X = (Weekday, Happy?, Weather) Y = AttendLecture
How can we detemine similarity/distance
![](https://pic.imgdb.cn/item/6197ed042ab3f51d91266b92.png)
How can we determine similarity/distance
- suppose all feature are continuous
- Euclidean distance
- Manhattan distance
- Euclidean distance
Standardizing numeric features
Speeding up k-NN
Issues
- Choosing k
- Increasing k reduces variance, increase bias
- For high-dimensional space, problem that the nearest neighbor may not be very at all!
- Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets
Nearest neighbour problem
- Given sample S = ((X1,Y1),…..,(Xm,Ym)) and a test point x,
- it is to find the nearest k neighbours of x
- Note: for thhe algorithms, dimensionality N, i.e., number of features, is crucial
Efficient Indexing: N = 2
- Algorithm
- comopute Voronoi diagram in O(m log m)
- See algorithm ~~~~~~~~~~~~
- use point loaction data structure to determine nearest neighbours
- comopute Voronoi diagram in O(m log m)
Efficient Indexing: N > 2
- Two general strategies for alleviating this weakness
- don’t retain every training instance(edited nearest neighbor)
- pre-processing. Use a smart data structure to look up nearest neighbors(e.g. a k-d tree)
Edited instance-based learning
k-d trees
Construction of k-d tree
Finding nearest neighbors with a k-d tree
- use branh-and-bound search
- priority queue stores
- nodes considered
- lower bound on their distance to query instance
- lower bound given by distance using a single feature
- average case: O(log2m)
- worst case: O(m) where m is size of training set
k-d tree example (Manhattan distance)
Extended Materials: Voronoi Diagram Generation
•https://en.wikipedia.org/wiki/Voronoi_diagram
•https://courses.cs.washington.edu/courses/cse326/00wi/projects/voronoi.html
Variants of k-NN
k-NN regression
- learing stage
- given a training set(x(1),y(1))…(x(m),y(m)), do nothing
- (it’s sometimes called a lazy learner)
- given a training set(x(1),y(1))…(x(m),y(m)), do nothing
- classification stage
- given: an instance x(q) to classify
- find the k training-set instances(x(1),y(1)…(x(k),y(k))) that are most similar to x(q)
- return the value
Distance-weighted nearset neighbor
- We can have instances contribute to a prediction according to their distance from x(q)