머신러닝

Chapter 4. Parametric methods

jun1-cs 2025. 11. 2. 20:36

1. What is parametric model?
Parametric model: we assume that the sample is drawn from some distribution, for example, Gaussian.
-> Advantage: model is defined with small number of parameters(mean and variance)
-> The goal for parametric model is to estimate parameters, plug in these parameters to model, and finally get the prediction model based on the parameters.
=> X -> estimate θ(parameters) -> get the prediction model: p(x|θ) -> predict y for x
Equation below describes the structure of parametric prediction model.

y=Integral(g(x|θ)p(θ|X))dθ      

 
2. Maximum likelihood Estimation
We assume iid sample X(consisting of xt, t=1~n)
- xt ~ p(x|θ) 
  ->  based on parameter θ, we can describe datapoint xt (xt is derived from distribution p)
- l(θ|X)=p(X|θ) (= Pi(p(xt|θ)))
  ->  p is in perspective of predicting X based on θ(before getting predicted X), l is perspective of finding best θ based on predicted X
 
=> log likelihood L
L(θ|X)=Sigma( log(p(xt|θ)) ) : ability of θ to predict all of the xt values correctly
-> use predicted xt values to calculate this ability of θ: Perspective of posteriori
 
There may be multiple prior functions: bernoulli density, multinomial density, gaussian density
(1) Bernoulli density: in case of classifying into binary output.
p(x)=p^x (1-p)^(1-x)  (x=0,1)  -> we must find parameter p!
 
(2) Multinomial density: in case of classifying multiclass output.
If classifying into K classes, assume K-1 indicators with 0/1 binary value each.
p(x)=Pi(pi^xi) -> pi hat estimate = number of cases classified into class i / number of total cases
 
(3) Gaussian density: in case of regressing numercial output.

Fig 1. Gaussian distribution p(x), based on mean, std, and x

-> we must find two parameters! mean and std
 
3. Estimators for finding parameters: assume that θ also follow such distribution
1) MAP(maximum a posteriori) estimate
: Use this estimate when it is difficult to calculate the full integration of the equation above, as calculating the g(x|θ)p(θ|X) 
with θ <- θMAP.
θMAP = argmax p(θ|X) = argmax p(X|θ)p(θ)
This means that we're gonna only focus on the θMAP, which is most likely to be appeared, instead of considering all paths with all of θ. θMAP is θ that has maximum ability to predict X, considering p(θ)(prior distribution of θ).
 
2) ML(maximum likelihood) estimate
: Use this estimate when we use MAP, when the prior p(θ) is uniform distribution.
θMAP = argmax p(X|θ)
This means that we're gonna get θ that has maximum probability to predict X, without considering p(θ).
 
3) Bayes' estimate
: Use this estimate when we want to get expected value of θ based on data X.
θBayes = E[θ|X] = Integral(θ p(θ|X))dθ
 
4. Parametric classification
The goal is to classify x to Ci class (i classes) -> to find p(Ci|x), probability for correctly classifying
p(Ci|x)=p(x|Ci)p(Ci) / Sigma k (p(x|Ck)p(Ck))     
-> it is important to get gi(x)=p(x|Ci)p(Ci) !! <- g(x) is a kind of discriminant function that determines which class x have to be classified into. if gm(x) has the biggest value among gi(x) (i=1~n), x must be classified into class m.
-> We can get a parametric model p(Ci|x) with p(x|Ci) in discriminant function. p(x|Ci) is based on posteriori.(predicted x)
 
Ex) p(x|Ci) is Gaussian function
-> gi(x)=log p(x|Ci) + log p(Ci)= C(constants) -k(x-mi)^2
=> x should be classified into class i that |x- mi| minimizes.(maximize gi(x))
=> x should be classified into class i when x is nearest to point mi among mk(k=1~n).
*normality test is essential for θ distribution (before using gaussian density)
 
5. Regression
r=f(x|θ)+e (e~N(0,s^2)) -> MSE below is caused by this, assuming that e follows Gaussian
p(xt,rt)=p(xt) p(rt|xt) ~ p(xt) N(g(xt|θ),s^2) -> (g: predicting rt hat, using xt based on θ ) (xt,rt: train data)
Regression goal!
-> Maximizing L(θ|X)=Sigma(log p(rt|xt)) + Sigma(log p(xt))
-> Maximize  Sigma(log p(rt|xt)) = log Pi(p(rt|xt))      -> p(rt|xt) ~ N
-> Maximize log exp(-Sigma[ {(rt-g(xt | θ)}^2 ] / 2 s^2) 
-> Minimize Error caused by θ, E(θ|X)=1/2 Sigma[ {(rt-g(xt | θ)}^2 ] (std, s is independent of θ)

* To find optimal point in bias-variance dillema
-> build complex models to decrease bias
-> take an average to complex models to find low-variability model

* adjusted error=error on data+gamma*model complexity
-> find optimal gamma value using hp optimization and cross validation
-> gamma is adopted to decrease model complexity too.
 
 
 
 
 
 
 

'머신러닝' 카테고리의 다른 글

Linear methods for classification - LDA, QDA, RDA  (0) 2026.02.02
EM algorithm  (0) 2026.01.20
Bootstrap in machine learning  (0) 2026.01.19
Chapter 9. Decision Trees  (0) 2025.11.01
Chapter 2 - Note  (0) 2025.10.29