WoE transformacije u regresionim modelima
Last updated
Was this helpful?
Last updated
Was this helpful?
Zadatak: Importovati woe_iv.csv
fajl dat u prilogu, a zatim importovanom data frameu db
dodati novu variablu maturity.g
, definisanu na način da se vrijednosti varijable maturity
grupišu u 3 grupe u odnosu na zadate granice 4, 11, 14 i 72. Dalje:
ocijeniti model logističke regresije (zavisna varijabla bo
, nezavisna maturity.g
) koristeći metod:
a) WoE transformacije nezavisne varijable (tzv. woe coding);
b) transformacije nezavisne varijable u binarne varijable (tzv. dummy coding);
ocijeniti model linearne regresije (zavisna varijabla co
, nezavisna maturity.g
) koristeći metod:
a) WoE transformacije nezavisne varijable (tzv. woe coding);
b) transformacije nezavisne varijable u binarne varijable (tzv. dummy coding).
> #naredne komande izvrsiti ukoliko paketi vec nisu instalirani
> #install.packages("Hmisc")
> #install.packages("dtplyr")
> #install.packages("dplyr")
> library(Hmisc)
> library(dtplyr)
> library(dplyr)
>
> #importovati woe_iv.csv fajl
> db <- read.csv("woe_iv.csv", header = TRUE)
> str(db)
'data.frame': 10000 obs. of 3 variables:
$ bo : int 0 0 0 0 0 0 0 0 0 0 ...
$ co : num 0.1361 0.0941 0.0847 0.0122 0.0122 ...
$ maturity: int 18 9 12 12 12 10 8 6 18 24 ...
> #bo - dobar (0) / los (1) indikator
> table(db$bo)
0 1
9500 500
> #kreirati grupe rocnosti kredita
> db$maturity.g <- cut2(db$maturity, cuts = c(4, 11, 14))
> #kreirati data.table objekat
> dt <- lazy_dt(db)
> dt
Source: local data table [10,000 x 4]
Call: `_DT1`
bo co maturity maturity.g
<int> <dbl> <int> <fct>
1 0 0.136 18 [14,72]
2 0 0.0941 9 [ 4,11)
3 0 0.0847 12 [11,14)
4 0 0.0122 12 [11,14)
5 0 0.0122 12 [11,14)
6 0 0.0122 10 [ 4,11)
# ... with 9,994 more rows
# Use as.data.table()/as.data.frame()/as_tibble() to access results
> #woe izracun
> bo.s <- dt %>%
+ group_by(maturity.g) %>%
+ summarise(no = n(),
+ ng = sum(bo%in%0),
+ nb = sum(bo)) %>%
+ mutate(dr = nb / no) %>%
+ ungroup() %>%
+ mutate(dist.g = ng / sum(ng),
+ dist.b = nb / sum(nb),
+ woe = log(dist.g / dist.b))
> bo.s <- as.data.frame(bo.s)
> bo.s
maturity.g no ng nb dr dist.g dist.b woe
1 [ 4,11) 2009 1961 48 0.02389248 0.2064211 0.096 0.7655698
2 [11,14) 2024 1942 82 0.04051383 0.2044211 0.164 0.2203154
3 [14,72] 5967 5597 370 0.06200771 0.5891579 0.740 -0.2279560
> #mapiranje odgovarajucih modaliteta i woe vrijednosti
> woe.nv <- bo.s$woe
> names(woe.nv) <- bo.s$maturity.g
> db$maturity.woe.b <- woe.nv[db$maturity.g]
> #provjera
> table(db$maturity.woe.b, db$maturity.g)
[ 4,11) [11,14) [14,72]
-0.227955965913351 0 0 5967
0.220315422420577 0 2024 0
0.765569836122015 2009 0 0
> #logisticka regresija - woe transformacija
> lr.woe.b <- glm(bo ~ maturity.woe.b, family = binomial(link = logit), data = db)
> summary(lr.woe.b)
Call:
glm(formula = bo ~ maturity.woe.b, family = binomial(link = logit),
data = db)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.3578 -0.3578 -0.3578 -0.2876 2.7328
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.94444 0.04668 -63.081 < 2e-16 ***
maturity.woe.b -1.00000 0.14450 -6.921 4.5e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3970.3 on 9999 degrees of freedom
Residual deviance: 3913.9 on 9998 degrees of freedom
AIC: 3917.9
Number of Fisher Scoring iterations: 6
> #logisticka regresija - dummy transformacija
> lr.dummy.b <- glm(bo ~ maturity.g, family = binomial(link = logit), data = db)
> summary(lr.dummy.b)
Call:
glm(formula = bo ~ maturity.g, family = binomial(link = logit),
data = db)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.3578 -0.3578 -0.3578 -0.2876 2.7328
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.7100 0.1461 -25.395 < 2e-16 ***
maturity.g[11,14) 0.5453 0.1845 2.955 0.00313 **
maturity.g[14,72] 0.9935 0.1556 6.383 1.73e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3970.3 on 9999 degrees of freedom
Residual deviance: 3913.9 on 9997 degrees of freedom
AIC: 3919.9
Number of Fisher Scoring iterations: 6
> #woe izracun
> co.s <- db %>%
+ group_by(maturity.g) %>%
+ summarise(no = n(),
+ sy = sum(co)) %>%
+ ungroup() %>%
+ mutate(po = no / sum(no),
+ py = sy / sum(sy),
+ woe = log(py / po))
> co.s <- as.data.frame(co.s)
> co.s
maturity.g no sy po py woe
1 [ 4,11) 2009 78.45591 0.2009 0.1664434 -0.18815216
2 [11,14) 2024 93.44255 0.2024 0.1982373 -0.02078092
3 [14,72] 5967 299.46856 0.5967 0.6353193 0.06271322
> #mapiranje odgovarajucih modaliteta i woe vrijednosti
> woe.nv <- co.s$woe
> names(woe.nv) <- co.s$maturity.g
> db$maturity.woe.c <- woe.nv[db$maturity.g]
> #provjera
> table(db$maturity.woe.c, db$maturity.g)
[ 4,11) [11,14) [14,72]
-0.188152156905125 2009 0 0
-0.0207809184405925 0 2024 0
0.0627132152120221 0 0 5967
> #logisticka regresija - woe transformacija
> lr.woe.c <- lm(co ~ maturity.woe.c, data = db)
> summary(lr.woe.c)
Call:
lm(formula = co ~ maturity.woe.c, data = db)
Residuals:
Min 1Q Median 3Q Max
-0.04823 -0.03439 -0.01531 0.02287 0.26299
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0473407 0.0004228 111.96 <2e-16 ***
maturity.woe.c 0.0444954 0.0043277 10.28 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04224 on 9998 degrees of freedom
Multiple R-squared: 0.01046, Adjusted R-squared: 0.01036
F-statistic: 105.7 on 1 and 9998 DF, p-value: < 2.2e-16
> #logisticka regresija - dummy transformacija
> lr.dummy.c <- lm(co ~ maturity.g, data = db)
> summary(lr.dummy.c)
Call:
lm(formula = co ~ maturity.g, data = db)
Residuals:
Min 1Q Median 3Q Max
-0.04828 -0.03419 -0.01539 0.02281 0.26293
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0390522 0.0009424 41.440 < 2e-16 ***
maturity.g[11,14) 0.0071150 0.0013303 5.349 9.06e-08 ***
maturity.g[14,72] 0.0111352 0.0010895 10.220 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04224 on 9997 degrees of freedom
Multiple R-squared: 0.01047, Adjusted R-squared: 0.01027
F-statistic: 52.89 on 2 and 9997 DF, p-value: < 2.2e-16