Chapter 4. Parametric methods

머신러닝

Chapter 4. Parametric methods

jun1-cs 2025. 11. 2. 20:36

1. What is parametric model?
Parametric model: we assume that the sample is drawn from some distribution, for example, Gaussian.
-> Advantage: model is defined with small number of parameters(mean and variance)
-> The goal for parametric model is to estimate parameters, plug in these parameters to model, and finally get the prediction model based on the parameters.
=> X -> estimate θ(parameters) -> get the prediction model: p(x|θ) -> predict y for x
Equation below describes the structure of parametric prediction model.

y=Integral(g(x|θ)p(θ|X))dθ

2. Maximum likelihood Estimation
We assume iid sample X(consisting of xt, t=1~n)
- xt ~ p(x|θ)
-> based on parameter θ, we can describe datapoint xt (xt is derived from distribution p)
- l(θ|X)=p(X|θ) (= Pi(p(xt|θ)))
-> p is in perspective of predicting X based on θ(before getting predicted X), l is perspective of finding best θ based on predicted X

=> log likelihood L
L(θ|X)=Sigma( log(p(xt|θ)) ) : ability of θ to predict all of the xt values correctly
-> use predicted xt values to calculate this ability of θ: Perspective of posteriori

There may be multiple prior functions: bernoulli density, multinomial density, gaussian density
(1) Bernoulli density: in case of classifying into binary output.
p(x)=p^x (1-p)^(1-x) (x=0,1) -> we must find parameter p!

(2) Multinomial density: in case of classifying multiclass output.
If classifying into K classes, assume K-1 indicators with 0/1 binary value each.
p(x)=Pi(pi^xi) -> pi hat estimate = number of cases classified into class i / number of total cases

(3) Gaussian density: in case of regressing numercial output.

Fig 1. Gaussian distribution p(x), based on mean, std, and x

-> we must find two parameters! mean and std

3. Estimators for finding parameters: assume that θ also follow such distribution
1) MAP(maximum a posteriori) estimate
: Use this estimate when it is difficult to calculate the full integration of the equation above, as calculating the g(x|θ)p(θ|X)
with θ <- θMAP.
θMAP = argmax p(θ|X) = argmax p(X|θ)p(θ)
This means that we're gonna only focus on the θMAP, which is most likely to be appeared, instead of considering all paths with all of θ. θMAP is θ that has maximum ability to predict X, considering p(θ)(prior distribution of θ).

2) ML(maximum likelihood) estimate
: Use this estimate when we use MAP, when the prior p(θ) is uniform distribution.
θMAP = argmax p(X|θ)
This means that we're gonna get θ that has maximum probability to predict X, without considering p(θ).

3) Bayes' estimate
: Use this estimate when we want to get expected value of θ based on data X.
θBayes = E[θ|X] = Integral(θ p(θ|X))dθ

4. Parametric classification
The goal is to classify x to Ci class (i classes) -> to find p(Ci|x), probability for correctly classifying
p(Ci|x)=p(x|Ci)p(Ci) / Sigma k (p(x|Ck)p(Ck))
-> it is important to get gi(x)=p(x|Ci)p(Ci) !! <- g(x) is a kind of discriminant function that determines which class x have to be classified into. if gm(x) has the biggest value among gi(x) (i=1~n), x must be classified into class m.
-> We can get a parametric model p(Ci|x) with p(x|Ci) in discriminant function. p(x|Ci) is based on posteriori.(predicted x)

Ex) p(x|Ci) is Gaussian function
-> gi(x)=log p(x|Ci) + log p(Ci)= C(constants) -k(x-mi)^2
=> x should be classified into class i that |x- mi| minimizes.(maximize gi(x))
=> x should be classified into class i when x is nearest to point mi among mk(k=1~n).
*normality test is essential for θ distribution (before using gaussian density)

5. Regression
r=f(x|θ)+e (e~N(0,s^2)) -> MSE below is caused by this, assuming that e follows Gaussian
p(xt,rt)=p(xt) p(rt|xt) ~ p(xt) N(g(xt|θ),s^2) -> (g: predicting rt hat, using xt based on θ ) (xt,rt: train data)
Regression goal!
-> Maximizing L(θ|X)=Sigma(log p(rt|xt)) + Sigma(log p(xt))
-> Maximize Sigma(log p(rt|xt)) = log Pi(p(rt|xt)) -> p(rt|xt) ~ N
-> Maximize log exp(-Sigma[ {(rt-g(xt | θ)}^2 ] / 2 s^2)
-> Minimize Error caused by θ, E(θ|X)=1/2 Sigma[ {(rt-g(xt | θ)}^2 ] (std, s is independent of θ)

* To find optimal point in bias-variance dillema
-> build complex models to decrease bias
-> take an average to complex models to find low-variability model

* adjusted error=error on data+gamma*model complexity
-> find optimal gamma value using hp optimization and cross validation
-> gamma is adopted to decrease model complexity too.

'머신러닝' 카테고리의 다른 글

Linear methods for classification - LDA, QDA, RDA (0)	2026.02.02
EM algorithm (0)	2026.01.20
Bootstrap in machine learning (0)	2026.01.19
Chapter 9. Decision Trees (0)	2025.11.01
Chapter 2 - Note (0)	2025.10.29

현재글Chapter 4. Parametric methods

jun1-cs 님의 블로그

jun1-cs 님의 블로그 입니다.

degree of membership, a priori, backpropagation, 깊이우선탐색, dl, reduced rank LDA, dfs, k means clustering, hedge operation, 의사를 위한 실전 인공지능, 선형화 가능, hyperplane, membership function, 딥러닝, 스레드, deep learning, 순차 무모순, probability theory, ML, kernel pca,

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

jun1-cs 님의 블로그