1. What is parametric model?
Parametric model: we assume that the sample is drawn from some distribution, for example, Gaussian.
-> Advantage: model is defined with small number of parameters(mean and variance)
-> The goal for parametric model is to estimate parameters, plug in these parameters to model, and finally get the prediction model based on the parameters.
=> X -> estimate θ(parameters) -> get the prediction model: p(x|θ) -> predict y for x
Equation below describes the structure of parametric prediction model.
y=Integral(g(x|θ)p(θ|X))dθ
2. Maximum likelihood Estimation
We assume iid sample X(consisting of xt, t=1~n)
- xt ~ p(x|θ)
-> based on parameter θ, we can describe datapoint xt (xt is derived from distribution p)
- l(θ|X)=p(X|θ) (= Pi(p(xt|θ)))
-> p is in perspective of predicting X based on θ(before getting predicted X), l is perspective of finding best θ based on predicted X
=> log likelihood L
L(θ|X)=Sigma( log(p(xt|θ)) ) : ability of θ to predict all of the xt values correctly
-> use predicted xt values to calculate this ability of θ: Perspective of posteriori
There may be multiple prior functions: bernoulli density, multinomial density, gaussian density
(1) Bernoulli density: in case of classifying into binary output.
p(x)=p^x (1-p)^(1-x) (x=0,1) -> we must find parameter p!
(2) Multinomial density: in case of classifying multiclass output.
If classifying into K classes, assume K-1 indicators with 0/1 binary value each.
p(x)=Pi(pi^xi) -> pi hat estimate = number of cases classified into class i / number of total cases
(3) Gaussian density: in case of regressing numercial output.

-> we must find two parameters! mean and std
3. Estimators for finding parameters: assume that θ also follow such distribution
1) MAP(maximum a posteriori) estimate
: Use this estimate when it is difficult to calculate the full integration of the equation above, as calculating the g(x|θ)p(θ|X)
with θ <- θMAP.
θMAP = argmax p(θ|X) = argmax p(X|θ)p(θ)
This means that we're gonna only focus on the θMAP, which is most likely to be appeared, instead of considering all paths with all of θ. θMAP is θ that has maximum ability to predict X, considering p(θ)(prior distribution of θ).
2) ML(maximum likelihood) estimate
: Use this estimate when we use MAP, when the prior p(θ) is uniform distribution.
θMAP = argmax p(X|θ)
This means that we're gonna get θ that has maximum probability to predict X, without considering p(θ).
3) Bayes' estimate
: Use this estimate when we want to get expected value of θ based on data X.
θBayes = E[θ|X] = Integral(θ p(θ|X))dθ
4. Parametric classification
The goal is to classify x to Ci class (i classes) -> to find p(Ci|x), probability for correctly classifying
p(Ci|x)=p(x|Ci)p(Ci) / Sigma k (p(x|Ck)p(Ck))
-> it is important to get gi(x)=p(x|Ci)p(Ci) !! <- g(x) is a kind of discriminant function that determines which class x have to be classified into. if gm(x) has the biggest value among gi(x) (i=1~n), x must be classified into class m.
-> We can get a parametric model p(Ci|x) with p(x|Ci) in discriminant function. p(x|Ci) is based on posteriori.(predicted x)
Ex) p(x|Ci) is Gaussian function
-> gi(x)=log p(x|Ci) + log p(Ci)= C(constants) -k(x-mi)^2
=> x should be classified into class i that |x- mi| minimizes.(maximize gi(x))
=> x should be classified into class i when x is nearest to point mi among mk(k=1~n).
*normality test is essential for θ distribution (before using gaussian density)
5. Regression
r=f(x|θ)+e (e~N(0,s^2)) -> MSE below is caused by this, assuming that e follows Gaussian
p(xt,rt)=p(xt) p(rt|xt) ~ p(xt) N(g(xt|θ),s^2) -> (g: predicting rt hat, using xt based on θ ) (xt,rt: train data)
Regression goal!
-> Maximizing L(θ|X)=Sigma(log p(rt|xt)) + Sigma(log p(xt))
-> Maximize Sigma(log p(rt|xt)) = log Pi(p(rt|xt)) -> p(rt|xt) ~ N
-> Maximize log exp(-Sigma[ {(rt-g(xt | θ)}^2 ] / 2 s^2)
-> Minimize Error caused by θ, E(θ|X)=1/2 Sigma[ {(rt-g(xt | θ)}^2 ] (std, s is independent of θ)
* To find optimal point in bias-variance dillema
-> build complex models to decrease bias
-> take an average to complex models to find low-variability model
* adjusted error=error on data+gamma*model complexity
-> find optimal gamma value using hp optimization and cross validation
-> gamma is adopted to decrease model complexity too.
'머신러닝' 카테고리의 다른 글
| Linear methods for classification - LDA, QDA, RDA (0) | 2026.02.02 |
|---|---|
| EM algorithm (0) | 2026.01.20 |
| Bootstrap in machine learning (0) | 2026.01.19 |
| Chapter 9. Decision Trees (0) | 2025.11.01 |
| Chapter 2 - Note (0) | 2025.10.29 |