WoE transformacije u regresionim modelima

Zadatak: Importovati woe_iv.csv fajl dat u prilogu, a zatim importovanom data frameu db dodati novu variablu maturity.g , definisanu na način da se vrijednosti varijable maturity grupišu u 3 grupe u odnosu na zadate granice 4, 11, 14 i 72. Dalje:

  1. ocijeniti model logističke regresije (zavisna varijabla bo, nezavisna maturity.g) koristeći metod: a) WoE transformacije nezavisne varijable (tzv. woe coding); b) transformacije nezavisne varijable u binarne varijable (tzv. dummy coding);

  2. ocijeniti model linearne regresije (zavisna varijabla co, nezavisna maturity.g) koristeći metod: a) WoE transformacije nezavisne varijable (tzv. woe coding); b) transformacije nezavisne varijable u binarne varijable (tzv. dummy coding).

woe_iv.csv
> #naredne komande izvrsiti ukoliko paketi vec nisu instalirani
> #install.packages("Hmisc")
> #install.packages("dtplyr")
> #install.packages("dplyr")
> library(Hmisc)
> library(dtplyr)
> library(dplyr)
> 
> #importovati woe_iv.csv fajl
> db <- read.csv("woe_iv.csv", header = TRUE)
> str(db)
'data.frame':   10000 obs. of  3 variables:
 $ bo      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ co      : num  0.1361 0.0941 0.0847 0.0122 0.0122 ...
 $ maturity: int  18 9 12 12 12 10 8 6 18 24 ...
> #bo - dobar (0) / los (1) indikator
> table(db$bo)

   0    1 
9500  500 
> #kreirati grupe rocnosti kredita
> db$maturity.g <- cut2(db$maturity, cuts = c(4, 11, 14))
> #kreirati data.table objekat 
> dt <- lazy_dt(db)
> dt
Source: local data table [10,000 x 4]
Call:   `_DT1`

     bo     co maturity maturity.g
  <int>  <dbl>    <int> <fct>     
1     0 0.136        18 [14,72]   
2     0 0.0941        9 [ 4,11)   
3     0 0.0847       12 [11,14)   
4     0 0.0122       12 [11,14)   
5     0 0.0122       12 [11,14)   
6     0 0.0122       10 [ 4,11)   
# ... with 9,994 more rows

# Use as.data.table()/as.data.frame()/as_tibble() to access results
> #woe izracun
> bo.s <- dt %>% 
+   group_by(maturity.g) %>%
+   summarise(no = n(),
+ ng = sum(bo%in%0),
+ nb = sum(bo)) %>%
+   mutate(dr = nb / no) %>%
+   ungroup() %>%
+   mutate(dist.g = ng / sum(ng),
+    dist.b = nb / sum(nb),
+    woe = log(dist.g / dist.b))
> bo.s <- as.data.frame(bo.s)
> bo.s
  maturity.g   no   ng  nb         dr    dist.g dist.b        woe
1    [ 4,11) 2009 1961  48 0.02389248 0.2064211  0.096  0.7655698
2    [11,14) 2024 1942  82 0.04051383 0.2044211  0.164  0.2203154
3    [14,72] 5967 5597 370 0.06200771 0.5891579  0.740 -0.2279560
> #mapiranje odgovarajucih modaliteta i woe vrijednosti
> woe.nv <- bo.s$woe
> names(woe.nv) <- bo.s$maturity.g
> db$maturity.woe.b <- woe.nv[db$maturity.g]
> #provjera
> table(db$maturity.woe.b, db$maturity.g)
                    
                     [ 4,11) [11,14) [14,72]
  -0.227955965913351       0       0    5967
  0.220315422420577        0    2024       0
  0.765569836122015     2009       0       0
> #logisticka regresija - woe transformacija
> lr.woe.b <- glm(bo ~ maturity.woe.b, family = binomial(link = logit), data = db)
> summary(lr.woe.b)

Call:
glm(formula = bo ~ maturity.woe.b, family = binomial(link = logit), 
    data = db)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.3578  -0.3578  -0.3578  -0.2876   2.7328  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -2.94444    0.04668 -63.081  < 2e-16 ***
maturity.woe.b -1.00000    0.14450  -6.921  4.5e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3970.3  on 9999  degrees of freedom
Residual deviance: 3913.9  on 9998  degrees of freedom
AIC: 3917.9

Number of Fisher Scoring iterations: 6

> #logisticka regresija - dummy transformacija
> lr.dummy.b <- glm(bo ~ maturity.g, family = binomial(link = logit), data = db)
> summary(lr.dummy.b)

Call:
glm(formula = bo ~ maturity.g, family = binomial(link = logit), 
    data = db)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.3578  -0.3578  -0.3578  -0.2876   2.7328  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -3.7100     0.1461 -25.395  < 2e-16 ***
maturity.g[11,14)   0.5453     0.1845   2.955  0.00313 ** 
maturity.g[14,72]   0.9935     0.1556   6.383 1.73e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3970.3  on 9999  degrees of freedom
Residual deviance: 3913.9  on 9997  degrees of freedom
AIC: 3919.9

Number of Fisher Scoring iterations: 6

Last updated

Was this helpful?