COMP219-P2

2021-11-17

Lec10-overfitting and K-Nearset Neighbour

Topic

Overfitting
K-NN classification

Regression using polynomial

General phenomenon

Ovwefitting in decision trees

Prevent overfitting

cause: training error and expected error are different
- there may be noise in the training data
- training data is of limited size, the easier to find a hypothesis that fits the difference between the training data and the true distribution
prevent overfitting:
- cleaner training data help!
- more training data help!
- throwing away unnecessary hypotheses helps!

Overfitting in Decision Tree

Overfitting

consider error of model M over
- training data: Error(D_training, M)
- entire distribution of Data: Error(D_true, M)
model M $\in$ H overfits the training data if there is an alternative model M’ $\in$ such that
Example: overfitting with noisy data
- Suppose
  - the target concept is Y = X1 $\bigwedge$ X2
  - there is noise in some feature values
  - we’re given the following training set
- Avoiding overfitting in DT learning
  - two general strattegies to avoid overfitting
    1. early stopping: stop if further splitting not justified by a statistical test
      - Quinlan’s original approach in ID3
    2. post-pruning: grow a large tree, then prune back some nodes
      - more robust to myopia of greedy tree learning

Nearest-neighbor classification

learning stage
- given a training set(X(1), y(1))…(x(m), y(m)), do nothing
  - (it’s sometimes called a lazy learner)
classification stage
- given: an instance (q) to classify
- find the training-set instance x(i) that is most similar to x(q)
- return the class value y(i)

Nearest Neighbor

When to consider
- Less than 20 attributes perinstance
- Lots of training data
Advantages
- Training is very fast
- Learn complex target function
Disadvantages
- Slow at query time
- Easily fooled by irrelevant attributes
To classify a new input vector x, examine the k-closest training data points to x and assing the object to the most frequently occurring class

The decision regions for nearest-neighbor classification

Voronoi diagram: each polyhedron indicates the region of feature space that is in the nearest neighborhood of each training instance

Lec11-KNN(cont.)

Topic

Definition
- How can we determine similarity/distance
- Standardizing numeric features(leave this to you)
Speeding up k-NN
- edited nearest neighbour
- k-d trees for nearest neighbour identification
Variants of k-NN
- k-NN regression
- Distance-weighted nearest neighbor
- Locally weighted regression to handle irrelevant features
Discussions
- Strengths and limition of instance-based learning
- Inductive bias

Definition

K-nearest-neighbor classification

Classification task
- given: an instance x(q) to classify
- find the k training-set instances(x(1),y(1))…(x(k),y(k)) that are the most similar to x(q)
- return the class value
- (i.e. rerturn the class that have the most number of instance in the k training instances)

How can we determine similarity/distance

suppose all features are categorical
- Hamming distance(or L0 normal): count the number of features for which two instances differ
Example: X = (Weekday, Happy?, Weather) Y = AttendLecture

How can we detemine similarity/distance

![](https://pic.imgdb.cn/item/6197ed042ab3f51d91266b92.png)

How can we determine similarity/distance

suppose all feature are continuous
- Euclidean distance
- Manhattan distance

Standardizing numeric features

Speeding up k-NN

Issues

Choosing k
- Increasing k reduces variance, increase bias
For high-dimensional space, problem that the nearest neighbor may not be very at all!
Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets

Nearest neighbour problem

Given sample S = ((X1,Y1),…..,(Xm,Ym)) and a test point x,
it is to find the nearest k neighbours of x
Note: for thhe algorithms, dimensionality N, i.e., number of features, is crucial

Efficient Indexing: N = 2

Algorithm
- comopute Voronoi diagram in O(m log m)
  - See algorithm ~~~~~~~~~~~~
- use point loaction data structure to determine nearest neighbours

Efficient Indexing: N > 2

Two general strategies for alleviating this weakness
- don’t retain every training instance(edited nearest neighbor)
- pre-processing. Use a smart data structure to look up nearest neighbors(e.g. a k-d tree)

Edited instance-based learning

k-d trees

Construction of k-d tree

Finding nearest neighbors with a k-d tree

use branh-and-bound search
priority queue stores
- nodes considered
- lower bound on their distance to query instance
lower bound given by distance using a single feature
average case: O(log2m)
worst case: O(m) where m is size of training set

k-d tree example (Manhattan distance)

Extended Materials: Voronoi Diagram Generation

•https://en.wikipedia.org/wiki/Voronoi_diagram
•https://courses.cs.washington.edu/courses/cse326/00wi/projects/voronoi.html

Variants of k-NN

k-NN regression

learing stage
- given a training set(x(1),y(1))…(x(m),y(m)), do nothing
  - (it’s sometimes called a lazy learner)
classification stage
- given: an instance x(q) to classify
- find the k training-set instances(x(1),y(1)…(x(k),y(k))) that are most similar to x(q)
- return the value

Distance-weighted nearset neighbor

We can have instances contribute to a prediction according to their distance from x(q)

Lec10-overfitting and K-Nearset Neighbour

Topic

Regression using polynomial

General phenomenon

Ovwefitting in decision trees

Prevent overfitting

Overfitting in Decision Tree

Overfitting

Nearest-neighbor classification

Nearest-neighbor classification

Nearest Neighbor

The decision regions for nearest-neighbor classification

Lec11-KNN(cont.)

Topic

Definition

K-nearest-neighbor classification

How can we determine similarity/distance

How can we detemine similarity/distance

How can we determine similarity/distance

Standardizing numeric features

Speeding up k-NN

Issues

Nearest neighbour problem

Efficient Indexing: N = 2

Efficient Indexing: N > 2

Edited instance-based learning

k-d trees

Construction of k-d tree

Finding nearest neighbors with a k-d tree

k-d tree example (Manhattan distance)

Extended Materials: Voronoi Diagram Generation

Variants of k-NN

k-NN regression

Distance-weighted nearset neighbor

Irrelevant features in instance-based learning

Locally weighted regression