Development of a decision tree to classify the most accurate tissue-specific tissue to plasma partition coefficient algorithm for a given compound

ORIGINAL PAPER

Development of a decision tree to classify the most accuratetissue-specific tissue to plasma partition coefficient algorithmfor a given compound

Yejin Esther Yun • Cecilia A. Cotton •

Andrea N. Edginton

Received: 10 July 2013 / Accepted: 7 November 2013 / Published online: 21 November 2013

� Springer Science+Business Media New York 2013

Abstract Physiologically based pharmacokinetic (PBPK)

modeling is a tool used in drug discovery and human health

risk assessment. PBPK models are mathematical repre-

sentations of the anatomy, physiology and biochemistry of

an organism and are used to predict a drug’s pharmacoki-

netics in various situations. Tissue to plasma partition

coefficients (Kp), key PBPK model parameters, define the

steady-state concentration differential between tissue and

plasma and are used to predict the volume of distribution.

The experimental determination of these parameters once

limited the development of PBPK models; however, in

silico prediction methods were introduced to overcome this

issue. The developed algorithms vary in input parameters

and prediction accuracy, and none are considered standard,

warranting further research. In this study, a novel decision-

tree-based Kp prediction method was developed using six

previously published algorithms. The aim of the developed

classifier was to identify the most accurate tissue-specific

Kp prediction algorithm for a new drug. A dataset con-

sisting of 122 drugs was used to train the classifier and

identify the most accurate Kp prediction algorithm for a

certain physicochemical space. Three versions of tissue-

specific classifiers were developed and were dependent on

the necessary inputs. The use of the classifier resulted in a

better prediction accuracy than that of any single Kp pre-

diction algorithm for all tissues, the current mode of use in

PBPK model building. Because built-in estimation equa-

tions for those input parameters are not necessarily avail-

able, this Kp prediction tool will provide Kp prediction

when only limited input parameters are available. The

presented innovative method will improve tissue distribu-

tion prediction accuracy, thus enhancing the confidence in

PBPK modeling outputs.

Keywords Physiologically based pharmacokinetic

model � Tissue to plasma partition coefficient �Decision tree � Random forest

Abbreviations

AAFE Absolute average fold error

AFE Average fold error

B:P Blood-to-plasma ratio

E Extraction ratio

Exp Experimentally derived Kp values

FE Fold error

Fi Fraction of ionized drug

Fup Unbound fraction in plasma

HSA Human serum albumin

Kp Tissue-to-plasma partition coefficient

Kpu Tissue-to-plasma water partition coefficient

KpuBC Unbound compound concentration in blood

cells

LogD Logarithmic value of N-octanol–water

partition coefficient adjusted for ionization

at pH 7.4

LogKvo:w Logarithmic value of vegetable oil–water

partitioning adjusted for ionization at pH 7.4

LogP Logarithmic value of N-octanol–water

partition coefficient

Electronic supplementary material The online version of thisarticle (doi:10.1007/s10928-013-9342-0) contains supplementarymaterial, which is available to authorized users.

Y. E. Yun � A. N. Edginton (&)

School of Pharmacy, University of Waterloo, 200 University

Ave W, Waterloo, ON, Canada

e-mail: [email protected]

C. A. Cotton

Department of Statistics and Actuarial Science, University of

Waterloo, Waterloo, ON, Canada

123

J Pharmacokinet Pharmacodyn (2014) 41:1–14

DOI 10.1007/s10928-013-9342-0

http://dx.doi.org/10.1007/s10928-013-9342-0

M The number of variables

MA Membrane affinity

MFE Mean fold error

mtry Optimal value of the number of variables

ntree Number of trees

PBPK Physiologically based pharmacokinetic

PhS Phosphatidylserine

Pred Predicted Kp values

r2 Coefficient of determination

RBCu Red blood cell partitioning data for unbound

drugs

RMSE Root mean square error

SPR Surface plasmon resonance

TCB Tissue composition based

Vss Volume of distribution at steady state

Introduction

Physiologically based pharmacokinetic (PBPK) models

integrate organism- and compound-specific information

within a mathematical framework to describe a com-

pound’s pharmacokinetics. The model structure represents

the mammalian system of parallel and serial connections

between organs and blood pools. Model parameters reflect

the anatomical and physiological aspects of the mammalian

system and include organ volumes and blood flows.

Parameters relating to the compound include protein

binding affinity, tissue-to-plasma partition coefficients

(Kp) and intrinsic clearance. Combining appropriate model

structure with accurate parameter values allows for a pre-

diction of the pharmacokinetics of the compound in the

absence of any real in vivo pharmacokinetic data. The

extent of compound distribution into an individual organ is

expressed by a steady-state Kp, i.e., the ratio of the con-

centration of a compound in tissue and plasma [1]. Kps are

used to quantify the extent of a compound’s distribution

from the systemic circulation into the tissues at steady state

and are key parameters within a PBPK model. The extent

of tissue distribution is dependent on tissue partitioning and

the binding affinity of a compound to blood cells, proteins

and tissue components [2]. Due to various tissue compo-

sitions, Kps are tissue-specific.

Historically, Kp values have been derived experimen-

tally in vivo. This is a costly and time-consuming endeavor

and was once a limitation in the development of PBPK

models. As a result, Kp prediction algorithms have been

developed to overcome the need for experimental Kp

determination. These algorithms predict Kps based on the

underlying physiological and behavioral aspects of a

compound in the body [1, 3–11]. Kp prediction algorithms

are divided into two areas: (i) tissue-composition-based

(TCB) algorithms, which are created solely using the

physicochemical properties of a compound along with

tissue-specific parameters, and (ii) correlation-based algo-

rithms, which are empirically derived using both com-

pound-specific information and information derived in vivo

(e.g., muscle Kp). Algorithm outputs are Kp values based

on total concentration (Kp) [1, 4, 5, 11] or unbound con-

centration (Kpu) [8–10] in the case of drug compounds or

tissue: blood partition coefficients [12] based on total

concentration for environmental chemicals.

Tissue-composition-based algorithms

TCB algorithms are mechanistic in nature and aim to

describe the degree of drug accumulation based on tissue

composition (e.g., acidic phospholipids concentration),

physicochemistry and plasma protein binding [1, 7, 9–11,

13]. The main assumption of these models is that the dis-

tribution of a compound is primarily governed by passive

diffusion into tissue compartments and reversible binding

to common proteins. Poulin et al. introduced the first TCB

models by calculating Kps as a function of a lipophilicity

measure, tissue-specific concentration of lipids and fraction

unbound in plasma [1, 7, 12–14]. Berezhkovskiy later

revised this method by correcting for the ratio of unbound

fraction in tissue to that in plasma [3]. Later, the Rodgers

and Rowland model [9] accounted for the electrostatic

interaction of basic moieties of moderate to strong bases to

acidic phospholipids and passive distribution into intra-

and/or extracellular tissue water. Rodgers et al. [10] con-

tinued to develop a new mechanistic equation for predict-

ing the Kps of neutrals, acids and weak bases by

considering compound interactions with proteins. In Sch-

mitt’s model [11], compound binding to phospholipids was

explained mechanistically by accounting for the interaction

between charged phospholipids and charged molecules and

considering the phosphatidylcholine:buffer partition coef-

ficient and the phospholipid:water partition coefficient.

These developed algorithms do not require in vivo infor-

mation because they rely on a mechanistic understanding

of the complex interactions occurring between drug and

tissue.

Correlation based algorithms

The relationship between experimentally determined

in vivo parameters (e.g., a muscle Kp) and Kps has been

utilized to develop predictive regression equations. The

work of Bjorkman [4] demonstrated that muscle Kp can be

used to predict the Kps of other lean tissues, and Bjorkman

developed regression equations to that end. This work was

later refined by Jansson et al. [5], who additionally incor-

porated lipophilicity into the equations. Another algorithm

2 J Pharmacokinet Pharmacodyn (2014) 41:1–14

123

by Poulin and Theil [8] used the relationship between red

blood cell partitioning data for unbound compounds

(RBCu) and tissue Kps as well as the relationship between

muscle Kps and tissue Kps to develop predictive regression

equations. RBCu was determined in vitro and used as an

indicator of the degree of binding capacity due to the

electrostatic interactions of basic compounds with acidic

phosphatidylserine (PhS). This model was later refined by

taking into account both the pharmacological activity of a

compound and compound-specific properties such as pKa

and lipophilicity [8, 15]. The most recent correlation-based

approach, Yun and Edginton [16], used volume of distri-

bution (Vss) as a primary predictor of tissue Kps in addi-

tion to physicochemical descriptors.

Despite increasing attention and interest in the accurate

prediction of compound distribution [17], a standard Kp

prediction method has not been agreed upon within the

research community. To date, there is no single prediction

algorithm that is applicable to all compounds in all tissues

(see Table s.1 in supplementary material), and the pre-

dictability of any single Kp prediction algorithm may vary

depending on the physicochemical properties of the com-

pound and/or the organ being assessed. Furthermore, the

experimental determination of the required compound-

specific chemical descriptors and in vitro and in vivo input

parameters can limit the use of some Kp prediction algo-

rithms. In other words, the availability of these input

parameters often determines the usability of an algorithm.

This study aims to determine the best performing algorithm

in a specific physicochemical space for a single tissue using

a statistical classification technique.

Random forest, a decision-tree-based statistical classi-

fication method [18], was utilized to identify the best

performing algorithms for given input parameters (e.g.,

LogP, fup). This technique is empirical in nature, whereby

training data are used to develop a system of decision trees

that are collated within a forest to support decision making.

In the random forest analysis, N bootstrap samples are

drawn from training data. According to the principle of

recursive partitioning [18, 19], each tree is created using a

random set of samples from the training data and a random

set of input parameters chosen from a library of inputs.

Each decision tree created from a bootstrap sample will

result in a classification. The classification with the most

votes is selected by the forest. A classifier approach will

allow the user to harness the best of all algorithms to

predict tissue-specific Kps for a new compound.

This study aims to develop a predictive decision-tree-

based classifier that will choose the most accurate pub-

lished Kp algorithm for a new compound within a specific

tissue using readily available input parameters and will

assess the predictive accuracy of the classifier relative to

that of previously published algorithms.

Methods

Input parameters for Kp algorithms

Published Kp prediction algorithms use various inputs.

Some of the key input parameters are described below.

Lipophilicity is one of the most important physicochemical

properties affecting compound disposition. Lipophilicity is

measured in various media types, with LogP the most

commonly reported. A measure of lipophilicity is usually

available in the early stages of compound discovery either

through in vitro experimentation or in silico prediction.

The fraction of compound unbound in plasma (fup) is a

common input parameter within Kp prediction algorithms

because of its pronounced influence on the extent of dis-

tribution. Common binding proteins include albumin, gly-

coproteins, lipoproteins and globulins, and binding to these

proteins is assessed through in vitro or ex vivo experi-

mentation and is considered commonly available for a

novel compound.

The degree of ionization of a compound at a particular

pH has important consequences with respect to distribu-

tion. Compounds that are weak acids or weak bases exist

in solution at equilibrium between the unionized and

ionized form. Only un-ionized nonpolar chemicals are

hypothesized to cross the cellular membranes as ionized

compounds are less permeable than un-ionized com-

pounds. At equilibrium, the concentrations of the un-

ionized compounds are equal in both plasma and tissue.

However, the total concentration in one matrix (e.g., a

tissue) may vary depending on the degree of ionization of

a compound at a tissue-specific physiological pH. For the

statistical analyses performed in this study, the ionized

fraction of the compound (fi) [16, 20, 21] represents the

degree of ionization at a tissue-specific physiological pH

(Table 1, Eq. 1). The fi equations are derived from the

Henderson-Hasselbalch equation and fi value ranges from

0 to 1, where a highly ionized compound at a specific pH

approaches 1.

Other input parameters used for previously published

Kp prediction algorithms include RBCu, muscle Kp and

Vss. Compound binding to RBCs can be used as an indi-

cator of in vivo distribution because RBCs are rich in

acidic phospholipids, which are responsible for the high

binding affinity of basic compounds. Only a few algorithms

[8, 9] require the input parameter of RBCu. Poulin and

Theil [8] demonstrated using both RBCu and muscle Kp

that tissue-specific Kp prediction with muscle Kp as an

input variable was more accurate than Kp prediction with

RBCu as an input variable. Muscle Kp is also an important

factor in Kp prediction because muscle is a highly perfused

organ and accounts for approximately 40 % of total body

mass [22, 23]. In addition, Vss can be used as an input

J Pharmacokinet Pharmacodyn (2014) 41:1–14 3

123

Table 1 Summary of equations used to estimate an unknown input parameter

Parameter Description Equation Reference

Equation 1 Degree of ionization

at a tissue pH

Calculation of fraction

ionized (fi) and fraction

unionized (fui)

fui ¼ 11þ10pKa�pHtissueð Þ fi ¼ 1� fui [16, 20,

21]Monoprotic bases

fi ¼ 1� ½1þ 10pKa�pH tissue��1

Diprotic bases

fi ¼ 1� ½1þ 10pKa1�pH tissue þ 10pKa1þpKa2�pH tissue�2��1

Monoprotic acids

fi ¼ 1� ½1þ 10pH tissue�pKa��1

Diprotic acids

.fi ¼ 1� ½1þ 10pH tissue�pKa1 þ 10pH tissue�2�pKa1�pKa2 ��1

Zwitterions

fi ¼ 1� ½1þ 10pkabase�pH tissue þ 10pH tissue�pKaacid ��1

Where pKa1 [ pKa2 for bases, whereas pKa1 \ pKa2 for acids

Equation 2 LogD Partition coefficient of

octanol and water at

specific pH

Monoprotic base [7, 15]

LogP� Log(1 + 10pKa1�7:4ÞDiprotic base

LogP� Log(1 + 10pKa1�7:4þ10pKa1þpKa2�2�7:4ÞMonoprotic acid

LogP� Log(1 + 107:4�pKa1 ÞDiprotic acid

LogP� Log(1 + 107:4�pKa1þ102�7:4�pKa1�pKa2 ÞZwitterions

LogP� Log(1 + 10pKabase�7:4þ107:4�pKaacid Þwhere pKa1 [ pKa2 for bases, whereas pKa1 \ pKa2 for acids

Equation 3 Kpu_BC (affinity for

blood cell)

Red blood cell to plasma

partition coefficient as it

relates to unbound

compound

B:P�ð1�HematocritÞHematocrit�fup

[26]

where B:P is blood to plasma ratio and fup is unbound fraction

in plasma

Equation 4 KpuBC X�fIW RBC

Yþ PfNL;RBCþð0:3Pþ0:7ÞfNP;RBC

Y

� �[27]

where fIW = .0914, fNL = 0.0017, fNP = 0.0029, P = antilog

values of LogP

For monoprotic base: X = 1 ? 10pKa-7.22, Y = 1 ? 10pKa-7.4

For monoprotic acids: X = 1 ? 107.22-pKa,

Y = 1 ? 107.4-pKa

Equation 5 Blood to plasma

ratio (B:P)

Log B : Pð Þ ¼ �0:004282þ 0:067028 LogP

þ 0:214590 Log fupð Þ n ¼ 28; R2 ¼ 0:40� � [9]

This equation was obtained using Rodgers et al. [9] dataset. In

the dataset, there were 28 experimentally determined B:P

values available. The regression equation was developed and

was statistically significant (P \ 0.05).

Equation 6 LogMA Logarithmic value of

membrane affinity

LogMA ¼ 1:294þ 0:304 � LogP [11]

This equation was obtained using Schmitt’s dataset. In the

dataset, there were 60 LogMA values available. The

regression equation was developed and was statistically

significant (P \ 0.05).

Equation 7 LogHSA Logarithmic value of

human serum albumin

(HSA)

LogHSA ¼ 0:294þ 0:135 � LogP [11]

This equation was obtained using Schmitt’s dataset. In the

dataset, there were 60 LogHSA values available. The

regression equation was developed and was statistically

significant (P \ 0.05).


123

because it is the parameter that represents the overall extent

of drug distribution in the body [5, 16, 24].

These physicochemical and physiological inputs repre-

sent key input parameters for Kp prediction algorithms.

Some of these input parameters are readily available, such

as a measure of lipophilicity or pKa, whereas others are not

routinely measured, such as RBCu or muscle Kp. Due to

the difficulty in obtaining some of the input parameters,

several algorithms have limited utility in tissue-specific Kp

prediction for a novel compound.

Data collection

A database of experimentally derived partition coefficients

with corresponding compound physicochemical properties

were created from the literature using several MEDLINE

searches. In vivo parameters such as the fup and Vss were

also included in the database. Data were included in the

study based on the following criteria: (i) reported experi-

mentally derived Kp values plausibly represent true steady-

state distribution/pseudo equilibrium and (ii) fup, pKa and

one of the lipophilicity measures (i.e., LogP, LogD, Log-

Kvo:w) are available. When experimental physicochemical

parameters (e.g., all lipophilicity measures, pKa) were not

available in the literature, the values were obtained from

predictions made in ChEMBL [25]. Experimentally deter-

mined values were preferably used over predicted values.

The stereoselectivity of a compound was considered, if

applicable, so that R and S enantiomers were considered

separately. As shown in Table s.1 decision trees for the

pancreas, testes, thymus and RBC were not generated

because the number of data points was insufficient for a

classification analysis.

Estimation of required inputs

Table 2 presents the required input parameters for each

algorithm. In the event that a required input parameter was

not available, it was calculated based on the equations

presented in Table 1. For example, if only LogP was

available but LogD was the necessary input parameter,

LogD was calculated using equations based on the equa-

tions derived by Poulin et al. [7, 15] (Table 1, Eq. 2). For

some input parameters [e.g., LogMA, LogHSA and

blood:plasma ratio (B:P)], a regression equation was

derived using the datasets reported in the publications by

Rodgers et al. [9] and Schmitt [11].

Affinity for blood cells (KpuBC) (i.e., unbound compound

concentrationinbloodcells)isoneoftherequiredinputsforthe

Rodgersetal. [9,10]algorithms.KpuBCisthefunctionoffup,

Table 2 Summary of Kp

prediction algorithms and their

main inputs

Algorithm Approach Main inputs

Bjorkman [4] Correlation-based Muscle Kp

Berezhkovskiy [3, 7, 13] Tissue-composition-based LogP, LogKvo:w, fup

Rodgers et al. [9, 10] Tissue-composition-based LogP, pKa, fup, B:P

Schmitt [11] Tissue-composition-based LogP, LogD, LogKvo:w, LogMA,

LogHSA, pKa, fup

Jansson et al. [5] Correlation-based Vss, Muscle Kp, LogP, LogD,

LogKvo:w

Poulin and Theil [8] Correlation-based Muscle Kp or RBCu

Yun and Edginton [16] Correlation-based Vss, LogP, pKa, fup

Table 1 continued

Parameter Description Equation Reference

Equation 8 LogKvo:w Logarithmic value of

partition coefficient

between vegetable oil and

water

1:115 � LogP� 1:34 [33]

Equation 9 Fut_lean tissue Fraction of unbound

compound in lean tissue

1= 1þ 1� fupð Þ=fupð Þ � 0:5ð Þð Þ [1]

Equation 10 Fut_adipose tissue Fraction of unbound

compound in adipose

tissue

1= 1þ 1� fupð Þ=fupð Þ � 0:15ð Þð Þ [1]

Equation 11 Muscle KpVss ¼ Vplasma þ

Pn1

Vtissue;i

�10a � logðKp;muscleÞ þ b � logðlipophilicityÞ þ c where the

coefficients a, b and c are listed in Jansson et al.

[5]


123

B:P and hematocrit. KpuBC is estimated using the standard

equation (Table 1,Eq. 3) in theRodgersmodels [9,26]. In the

absence of an observed B:P value, KpuBC is estimated using

Eq. 4(Table 1)asproposedbyPaixaoetal.[27].Thisequation

was derived from Rodgers et al. [10]. The assumptions for the

equations are that (i) in erythrocytes, there is no extracellular

spaceand(ii)albuminandlipoproteinsarenotcontainedwithin

thespace.

Whereas the first approach to B:P estimation was the use of

a mechanistic equation as described above, another approach

was also followed for B:P estimation, namely, the develop-

ment of a regression equation (Table 1, Eq. 5). Experimen-

tally determined B:P, LogP and fup (n = 28) were obtained

from Rodgers et al. [9], and a predictive regression equation

was developed based on the dataset. For the linear regression

analysis, the statistical software R version 2.12 [28] was used.

The estimation equation that yielded the most accurate Kpu

prediction when compared to the experimentally derived Kpu

values was selected for the calculation of Kps when Rodgers

et al. [9, 10] was used in this study.

For the calculation according to Schmitt’s algorithm

[11], the logarithmic value of the phosphatidylcho-

line:water partition coefficient at pH 7.4 (LogMA) and the

logarithmic value of human serum albumin (LogHSA)

must be estimated in the absence of the experimentally

determined values. The regression equations for LogMA

(Table 1, Eq. 6) and LogHSA (Table 1, Eq. 7) were gen-

erated using a dataset provided by Schmitt [11].

Separation of classifier groups

For researchers performing Kp prediction for a novel

compound, the availability of input parameters will not be

consistent. For example, when in vivo work has not been

performed on a compound, researchers are likely to have

only physicochemical input parameters and lack any

in vivo input parameters such as muscle Kp. Therefore, a

decision tree incorporating algorithms that require in vivo

inputs will not be useful to the researcher. Thus, several

versions of the Classification trees were created and were

based on the likely groupings of input parameters

researchers may have. Any additional algorithm-specific

input parameters that were required were estimated using

the equations in Table 1.

The development and evaluation of Classification tree

#1 was dependent on compounds for which muscle Kp, one

of the lipophilicity measures (e.g., LogP), pKa and fup

were available (Table 3). The development and evaluation

of Classification tree #2 was dependent on compounds for

which Vss, one of the lipophilicity measures, pKa and fup

were available. The development and evaluation of Clas-

sification tree #3 was dependent on compounds for which

one of the lipophilicity measures, pKa and fup were

available. The previously published algorithms that were

classified in each of the Classification trees are listed in

Table 3 along with the number of compounds used in the

development and evaluation of each tree.

Kp calculations according to previously published

algorithms

Kps were calculated according to each published algorithm

using only those input parameters required for Classifica-

tion trees #1 through #3 and using estimation equations for

any remaining required inputs. For the Berezkovskiy model

[3], LogKvo:w, fraction unbound in lean tissue (Fut_lean)

and fraction unbound in adipose tissue (Fut_adipose) were

calculated using Eqs. 8, 9 and 10, respectively (Table 1).

For Rodgers et al.’s method, the Kps of bases with

pKa C 7 were calculated by Rodgers et al. [9]. LogKvo:w

and B:P were estimated by Eqs. 8 and 5, respectively

(Table 1). The Kps of acids, zwitterions, neutrals and weak

bases were calculated by Rodgers and Rowland. [10]. In

Jansson’s algorithm [5], Kp prediction equations of bases

and neutrals and Kp prediction equations of acid and zwit-

terions were separately used. Both experimental muscle Kp

(Classification tree #1) and muscle Kp, as derived from

experimental Vss (Table 1, Eq. 11, Classification tree #2),

were used to predict Kps for Jansson’s algorithms [5]. LogD

and LogKvo:w were calculated as a function of LogP using

Eqs. 2 and 8 (Table 1). In Schmitt’s model [11], compounds

were divided into acids, neutrals, bases and zwitterions and

Kps were calculated accordingly. LogMA and LogHSA

were estimated using the regression equations Eqs. 6 and 7

Table 3 Physicochemical and/or in vivo inputs for a classifier

algorithm and included algorithms for each group

Inputs for

classification

Algorithms

Group 1

(N = 107

compounds)

Muscle Kp, LogP,

fi, fup, ClassaBerezchkovskiy [3]

Bjorkman [4]

Rodgers et al. [9, 10]

Schmitt [11]

Jansson et al. [5]

Poulin and Theil [8]

Group 2 (N = 97

compounds)

Vss, LogP, fi, fup,

ClassaBerezchkovskiy [3]

Rodgers et al. [9, 10].

Schmitt [11]

Jansson et al. [5]

Yun and Edginton [16]

Group 3 (N = 122

compounds)

LogP, fi, fup,

ClassaBerezchkovskiy [3]

Rodgers et al. [9, 10]

Schmitt [11]

a Class: acid–base properties of a compound (A: acid, B: base

(pKa C 7.4), WB: base (pKa \ 7.4), Z: zwitterions


123

(Table 1). In the Yun and Edginton algorithm [16], Kps were

estimated by using equations for moderate to strong bases

and equations for acids, neutrals and zwitterions. The degree

of ionization at a specific tissue pH was calculated using

Eq. 1 (Table 1). Because Poulin and Theil’s Kp prediction

approach [8] aimed to predict Kps for bases, only Kps of

bases were estimated. In Bjorkman’s model [4], Kp predic-

tion equations for acids and bases were separately developed

and Kps were calculated accordingly.

To ensure that the use of estimated input parameters (as

defined in Table 1) led to Kp predictions that were similar

to those predicted using existing algorithms, a comparison

of outcomes was made. In this study section, Kps were

calculated for Rodgers et al. [9, 10], Schmitt [11] and

Jansson et al. [5] (those algorithms that presented their

predicted Kps based on experimental inputs only) using the

inputs that were either experimental (those required by the

Classification trees) and, if appropriate, estimated based on

equations in Table 1. The difference between the predicted

Kp values from the publications (ai) vs. those calculated in

this study section (bi) were compared. Mean fold error

(MFE, Eq. 12), average fold error (AFE, Eq. 13), absolute

average fold error (AAFE, Eq. 14) and root mean square

error (RMSE, Eq. 15) were used to measure the deviation

of the published algorithm predicted Kps and the Kps

calculated using experimental and estimated inputs.

MFE ¼ 1

n

Xn

1

ai

bi

� �ð12Þ

AFE ¼ 10

1n

Pn

1

log10aibi

� ��

ð13Þ

AAFE = 10

1n

Pn

1

log10aibi

� �

�

ð14Þ

RMSE ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn1

½log10ðaiÞ � log10ðbiÞ�2

n

vuuutð15Þ

Dataset development

Using the relevant experimental and estimated input

parameters required for Group 1, 2, and 3 (Table 3),

comparisons between experimentally derived Kps and

predicted Kps from each applicable algorithm were made.

The Kp prediction algorithm that resulted in a value that

was closest to the experimental Kp was selected as the

best-predicting algorithm for the compound within the

specific tissue. The best-predicting algorithm for the

compound was then coded numerically so that the com-

pound could be categorized by the best-predicting model.

This coded information was used as the dependent variable

in the statistical analysis.

Random forest

A random forest was utilized to build a classifier that

identified the most accurate Kp prediction algorithm. The

classification analysis was performed using the random-

Forest package (4.6–6) for the statistical software R (ver-

sion 2.12) [28]. Initially, default parameter values were

used for the number of trees in a forest (ntree = 500) and

number of variables (mtry =p

M, where M is the total

number of variables). By using the rfcv function embedded

in the randomForest package [18, 28], the optimal mtry that

resulted in the smallest cross-validated error was chosen.

Consequently, the random forest was tuned using the

optimal value of the number of variables (mtry) [29]. A

final random forest model was generated by setting the

optimized variable of mtry when trees were grown.

Evaluation of the random forest using cross validation

The developed random forests for Classification trees #1,

#2 and #3 corresponding to each group in Table 3 were

evaluated. The predictive performance of each Classifica-

tion tree was evaluated with the total dataset by using

20-fold cross-validation [29]. This method assumes that a

random forest developed from 95 % (19/20) of a total

dataset is reasonably the same as a final random forest that

is developed using 100 % of the total dataset.

The 20-fold validation and analysis were performed as

follows:

(i) The total dataset was partitioned into 20 subsets.

(ii) A random forest was created using a training set

composed of 19 subsets. The developed random

forest then predicted the classification for samples in

the 20th subset as a test set. The predicted classifi-

cation (e.g., best algorithm for compound X in tissue

Y = Jansson et al. [5]) for the test set was recorded.

This step was repeated 20 times so that each subset

was used only once as a test set. As a result, each

compound was used once as a test compound.

(iii) For the test dataset including all compounds, each

compound was associated with a random forest

generated best prediction algorithm.

(iv) The rate of correct classification, per tissue, was

calculated (Eq. 16) and compared with random

permutation rates (Eq. 17).

Rate of correct classification¼ 1

n

Xn

1

IðExpi¼ PrediÞ ð16Þ

where Expi is the experimentally derived Kp value, Predi is

the predicted Kp value and n is the number of Kp values for

each tissue.


123

Random permutation rate

¼ 1

number of classes to be classifiedð17Þ

(v) Kp was calculated using the algorithm identified as

the most accurate during the cross-validation.

Using this method, the predictive performance of pre-

viously published algorithms was compared to the random-

forest-generated Kps for each tissue using the same dataset

as shown in Tables s.2 and s.3.

Model evaluation: comparative prediction accuracy

The prediction accuracy of each Classification tree was

compared to each of the previously published algorithms

within its group (Table 3) to assess if any one previously

published algorithm performed better than the Classifica-

tion tree. Thus, using inputs required by the Classification

tree with all others estimated based on Table 1, the pre-

diction accuracy of the Classification tree (as defined from

the cross-validation step) was compared to the prediction

accuracy of each algorithm in the group. Prediction accu-

racy was based on a comparison of the predicted (ai) and

experimentally derived (bi) Kps for each algorithm. To

assess the overall precision of each algorithm, RMSE was

calculated (Eq. 15) as well as the overall percentage within

k-fold deviation (k = 1.25, 1.5, 2, 3) (Eq. 18).

% within k� fold error ¼ 1

n

Xn

i¼1

I1

k� ai

bi

� k

� �" #

� 100 % ð18Þ

where I(�) is an indicator function, k = 1.25, 1.5, 2, 3.

Tissue-specific RMSE was also calculated to compare

the precision of the models with respect to the tissue. As a

measure of bias, AFE (Eq. 13) was calculated. The AAFE

(Eq. 14) was also calculated to quantify the overall mag-

nitude of the deviation between the predicted and the

experimentally derived Kp values.

Results

Dataset

The dataset was composed of a total of 122 compounds

with 852 Kps in 11 tissues (Tables s.2, s.3). The dataset

consisted of 29 acids, 70 bases (63 moderate to strong

bases with pKa C 7.4 and 7 weak bases with pKa \ 7.4),

12 neutral compounds and 11 zwitterionic compounds.

Kp calculations according to the previously published

algorithms

For Kp calculation according to Rodgers et al. [9], the

prediction accuracies based on the use of the previously

published estimation equation for KpuBC (Table 1, Eq. 4)

and the developed regression equation for B:P (Table 1,

Eq. 5) as used in Eq. 3 were compared. The use of the

developed regression equation resulted in a more accurate

prediction of Kps with lower tissue-specific RMSE values

(Table s.4). As a result, the developed regression equation

(Eq. 5) was used in all subsequent calculations.

With the use of estimated input parameters (e.g. B:P,

LogKvo:w), the Kps calculated using the algorithm of Rodgers

et al. [9, 10] resulted in an under-prediction with a 6 %

decrease (on average) in AFE value compared to the Kps

calculated by the author with the experimentally determined

Table 4 Summary of random forest parameter and classification performance

Classification tree #1 Classification tree #2 Classification tree #3

n mtry Rate of correct

classification

% within

twofold error


classification

% within

twofold error


classification

% within

twofold error

Adipose 66 5 0.359 51.6 65 2 0.384 54.6 69 4 0.638 60.0

Bone 41 5 0.561 73.2 41 5 0.561 75.6 42 2 0.643 50.0

Brain 78 5 0.385 56.4 76 5 0.395 51.3 90 4 0.644 47.8

Gut 68 5 0.368 72.1 65 5 0.446 80.0 68 4 0.618 60.3

Heart 91 5 0.452 83.3 83 5 0.446 80.7 96 4 0.563 60.4

Kidney 89 5 0.341 73.9 86 5 0.386 69.8 94 4 0.684 55.3

Liver 84 5 0.243 64.2 84 5 0.429 63.1 88 4 0.693 51.1

Lung 93 5 0.312 67.8 85 5 0.365 64.7 95 2 0.589 56.8

Muscle 108 5 0.630 78.7 93 5 0.355 79.6 108 4 0.667 80.5

Skin 64 5 0.328 77.4 61 5 0.393 77.1 64 2 0.719 71.9

Spleen 36 5 0.583 61.1 33 2 0.424 63.6 36 4 0.528 58.3


123

parameters (Table s.5). Using estimated input parameters

(Table 1, Eqs. 2, 6, 7, 8), the Kps calculated using each algo-

rithm were in agreement with the Kps obtained by both Jansson

et al. [5] and Schmitt [11] (Tables s.6, s.7, respectively).

Construction of predictive random forest models:

classification trees #1, # 2 and # 3

For each tissue, three Classification trees (Table 4) were

developed using the random forest method. The number of

samples and the chosen mtry are listed in Table 4. The clas-

sification performance of each Classification tree is indicated

by the rate of correct classification. Classification trees resul-

ted in a higher rate of correct classification than random per-

mutation rates of 1/6, 1/5 and 1/3 based on the probability of a

correct classification when there are n categories (1/n). The

prediction accuracy for each Classification tree was indicated

by the percentage of predicted values within a twofold devi-

ation of the experimentally derived Kps for each tissue

(Eq. 18). Based on Table 4, the relationship between the rate

of correct classification and the Kp prediction accuracy is

tissue specific. The rates of correct classification for Classifi-

cation trees #1 and #2 were relatively lower than that the rate of

Classification tree #3 because Classification tree #3 had only

two or three algorithms to classify whereas Classification tree

#1 had 5 to 6 and Classification tree #2 had 4 to 5 (Table 3).

Comparative assessment of Kp prediction accuracy

of Classification trees and published equations

Comparison of prediction accuracy of Classification tree #

1 and published equations

To assess whether Classification tree #1 offered improved

predictive performance over any one relevant published

algorithm alone, the tissue AFE, AAFE and RMSE were

calculated using the same dataset (Tables s.2, s.3). A plot of

percentage within k-fold deviation from experimentally

derived values showed that predictions based on Classification

tree #1 performed well, with 25.6, 49.7 and 68.8 % falling

within 1.25-, 1.5- and 2-fold deviation from the experimentally

derived Kp values, respectively (Fig. 1). The global RMSEs of

the algorithms in Group 1 indicate that the Kp prediction errors

are similar for Jansson et al. [5], Rodgers et al. [9, 10] and

Classification tree #1 (Table 5). However, Rodgers et al. [9,

10] and Classification tree #1 tended to under-predict Kp, with

AFE values of 0.89,and 0.94, respectively. Rodgers et al. [9,

10] under-predicted Kps in bone, kidney and liver. Jansson

et al. [5] showed the smallest RMSE value of 0.43 but appeared

to over-predict Kp with an AFE of 1.27 (Fig. 1; Table 5). The

over-prediction of Kps by Jansson et al. [5] was observed in

kidney, liver and adipose tissue. The overall bias of deviation

between the experimentally derived Kps and those estimated

using Classification tree #1 was the smallest in Group 1, with

an AFE value of 0.94 (Table 5). This result is further supported

by the tissue-specific box and whisker plots (Figure s.1 in

supplementary material), where the boxes for Classification

tree #1 are small and centered around zero and do not show

evidence of serious under- or over-prediction. Tissue-specific

RMSEs showed that the Kp prediction of Jansson et al. [5]

resulted in the smallest error for 7 out of 10 tissues in Group 1

(Table s.8). It was observed that Berezhkovskiy [3], Schmitt

[11] and Bjorkman’s models [4] tended to over-predict Kps

with an AFE value larger than 1 (Table 5). On the other hand,

Rodgers et al. [9, 10] and Poulin and Theil’s [8] models tended

to under-predict the Kps with an AFE value\1.

Comparison of prediction accuracy of Classification tree

#2 and published equations

To compare the predictive performance of the published

algorithms and Classification tree #2, tissue AFE, AAFE,

Group 1

1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00

Per

cent

age

with

in k

-fol

d er

ror

0

20

40

60

80

100

BerezhkovskiyRodgers et al SchmittJansson et alBjorkmanPoulin and TheilClassification tree #1

Fig. 1 Percentage of predicted

Kps from each algorithm in

Group 1 within k-fold error

(x-axis) of the experimentally

derived Kps


123

and RMSE were calculated. Both Classification tree #2 and

Yun and Edginton [16] yielded more accurate Kp predictions

with higher percentages within k-fold deviation from the

experimentally derived Kps (k = 1.25–3) compared to other

algorithms (Fig. 2). The prediction performances of both

Classification tree #2 and Yun and Edginton’s algorithm [16]

were very similar with almost the same AFE, AAFE, global

RMSE and tissue-specific RMSE values (Table 5, Table

s.9). The plot of percentage within k-fold deviation from

experimentally derived values showed that Kp prediction

based on Classification tree #2 performed well, with 31.9 and

50.4 % falling within 1.25- and 1.5-fold deviation from

the experimentally derived Kps, respectively (Fig. 2). In 7

out of 11 tissues, the Yun and Edginton algorithm [16]

resulted in the smallest error associated with Kp estimates

(Table s.9). Jansson et al. [5] showed an over-prediction of

Kps, which was mainly due to the over-prediction in adipose

and liver Kps (Figure s.2). Schmitt’s algorithm [11] tended to

over-predict Kps (Table 5) especially in adipose, brain, heart

and skin (Figure s.2). Although Berezhkovskiy’s [3]

Table 5 Summary of overall prediction performance for Group 1, 2 and 3

Group 1

Berezhkovskiy [3] Rodgers

et al. [9, 10]

Schmitt [11] Jansson

et al. [5]

Bjorkman [4] Poulin and

Theil [8]

Classification

tree #1

AFE 1.14 0.89 1.37 1.27 1.52 0.16 0.94

AAFE 3.21 2.34 3.36 1.98 2.81 8.34 2.00

RMSE 0.67 0.51 0.66 0.43 0.62 1.25 0.49

Group 2

Berezhkovskiy [3] Rodgers

et al. [9, 10]

Schmitt [11] Jansson et al. [5] Yun and Edginton [16] Classification

tree #2

AFE 1.02 0.93 1.28 1.21 1.01 1.03

AAFE 2.92 2.20 3.20 2.06 1.78 1.82

RMSE 0.60 0.45 0.64 0.45 0.36 0.37

Group 3

Berezhkovskiy [3] Rodgers et al. [9, 10] Schmitt [11] Classification

tree #3

AFE 1.16 0.91 1.37 0.95

AAFE 3.18 2.33 3.27 2.14

RMSE 0.66 0.52 0.65 0.45

AFE average fold error, AAFE absolute average fold error, RMSE root mean square error

Group 2

1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00

Per

cent

age

with

in k

- fo

ld e

rror

0

20

40

60

80

100

BerezhkovskiyRodgers et alSchmittJansson et alYun and EdgintonClassification tree #2



Group 2 within k-fold error

(x-axis) of the experimentally

derived Kps


123

algorithm resulted in an AFE value close to 1 (1.02), its

AAFE value was 2.92 (Table 5). This finding implies that Kp

predictions were less accurate and that there was similar

under- and over-predictions of Kps. The box and whisker

plots (Figure s.2) show that there was over-prediction in

brain and adipose tissue Kps and an under-prediction in gut

and lung Kps.

Comparison of prediction accuracy of Classification tree

#3 and published equations

To compare the predictive performance of the published

algorithms and Classification tree #3, tissue AFE, AAFE

and RMSE were calculated. Classification tree #3 resulted

in accurate predictions in Group 3, with the highest per-

centages within k-fold deviation from experimentally

derived Kps (Fig. 3), the smallest global RMSE and AAFE

and an AFE closest to 1 (Table 5). In 9 out of 11 tissues,

Classification tree #3 resulted in the smallest tissue-specific

RMSEs (Table s.10). The Berezhkovskiy [3] and Schmitt

[11] algorithms were less accurate with an AAFE [ 3, and

both had a tendency to over-predict the Kps (Table 5).

Rodgers et al. [9, 10] under-predicted the Kps especially in

bone, kidneys, liver and lungs (Figure s.3).

The global RMSE, AFE, and AAFE values for Clas-

sification trees #1, #2 and #3 were comparable. However,

in the case of Classification tree #3, the percentage within

k-fold deviation from the experimentally derived Kps was

lower than those of Classification trees #1 and #2.

Discussion

One study objective was to develop a tool for Kp prediction

when only a limited number of input parameters are

available. For each tissue, Classification trees #1, #2 and

#3, which depended on user-supplied input parameters (i.e.,

LogP, pKa, fup, Vss and muscle Kp) as well as estimated

input parameters that were required but not deemed readily

available, were constructed. An assessment of the validity

of using estimation equations as a replacement for handling

generally unavailable input parameters was made. Rodgers

et al. [9, 10] demonstrated that the use of experimentally

determined inputs such as B:P and LogKvo:w resulted in

more accurate Kp predictions with lower tissue-specific

RMSEs when compared to Kps calculated using estimated

inputs (Tables s.1–3 in supplementary material). However,

the use of estimated inputs for the Rodgers et al. [9, 10]

algorithm resulted in Kpus that were comparable, although

not superior to, Kpus calculated using experimentally

determined inputs. The accuracy metrics such as tissue-

specific RMSEs and AFEs were comparable (Table s.5).

Similarly, in Jansson et al.’s algorithm [5], the use of an

experimentally determined muscle Kp resulted in more

accurate predictions in the heart, kidney, liver and lung

when compared to the prediction accuracy of Jansson et al.

[5], which used a muscle Kp that was estimated from Vss

(Table s.6). As a result, Jansson et al.’s algorithm [5] was

selected as the best-predicting algorithm in Classification

tree #1, which used muscle Kp as an input, more often than

in Classification tree #2, which used Vss as an input.

Overall, Kp predictions with the estimated input parame-

ters were deemed sufficiently agreeable to justify their use

as inputs to the Classification trees.

Kp prediction via a Classification tree depends on two

important factors. The first factor is the accuracy of each

Kp prediction algorithm in each group (e.g., Rodgers et al.

[9, 10], Jansson et al. [5]), and the second factor is the

classification performance of a classifier (i.e., a random

forest). Although poor prediction of Kps and/or poor

Group 3

1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00

Per

cent

age

with

in k

-fold

err

or

0

20

40

60

80

100

BerezhkovskiyRodgers et alSchmittClassification tree #3



Group 3 within k-fold error (x-

axis) of the experimentally

derived Kps


123

classification by a classifier can lead to an undesirable

outcome, there is no clear relationship between the accu-

racy of a Kp prediction method and classification perfor-

mance. A higher rate of correct classification will not

always result in an overall lower RMSE. Even though the

best-performing algorithm is correctly selected for certain

compounds it is the degree of error from the incorrectly

classified compounds that contributes most to the RMSE.

This result was observed because the predicted Kp from an

algorithm that was classified by the random forest can

largely deviate from the corresponding experimentally

derived Kp (Table 4). Thus, the interplay of these two

factors should be taken into consideration in the interpre-

tation of Kp prediction via Classification trees #1, #2

and #3.

When experimentally determined muscle Kp and phys-

icochemical parameters (e.g., LogP, pKa, and fup) are

available, six Kp prediction algorithms can be used,

namely, the algorithms used in Classification tree #1. It was

observed that the use of Classification tree #1 improved the

Kp prediction accuracy and bias over any one of the six

prediction algorithms (Table 5; Fig. 1) with the algorithm

that requires both physico-chemical and in vivo inputs,

Jansson et al.[5], having similar accuracy metrics (but a

higher bias).

Both the Yun and Edginton algorithm [16] and Classi-

fication tree #2 had a high Kp prediction accuracy with a

high percentage within k-fold deviation from the experi-

mentally derived Kps. Notably, both the Jansson et al. [5]

and Yun and Edginton [16] models that used Vss showed

high accuracy and precision in Kp prediction. This result

further implies that the availability of the in vivo parameter

Vss and the use of these correlation models improve Kp

prediction accuracy over TCB algorithms.

TCB models [3, 9–11] require a minimal number of

input parameters such as ex vivo fup and physicochemical

parameters. Classification tree #3 identified the best-pre-

dicting model based on basic inputs (pKa, fup, LogP) and

improved the Kp prediction accuracy over any one TCB

prediction algorithm alone. It is expected that Classifica-

tion tree #3 will be the most applicable in early drug dis-

covery when compared to Classification trees #1 and #2

because the use of the Classification trees #1 and #2 is

limited by the availability of an in vivo parameter (i.e.,

muscle Kp or Vss).

For the most part, Classification trees exhibited better

prediction performance in most tissues with little bias

toward over- or under-prediction. According to the plots of

the percentage of predicted Kps within 1.25- and 1.5-fold

deviations from the experimentally derived Kps, Classifi-

cation trees #1, #2 and #3 showed higher percentages than

the other algorithms in each group (Figs. 1, 2, 3). Based on

these results, it can be concluded that classifications trees

offer advantages over the use of any single algorithm to

predict tissue-specific Kps for a compound. Further, on

comparison of the performance of algorithms from the

trees, algorithms that combine both physico-chemical

inputs and in vivo inputs perform better than TCB models

and, as a result, Classification trees #1 and #2 performed

better (greater percentage within k-fold error) than Clas-

sification tree #3, which only incorporated physico-chem-

ical inputs. Correlation-based models depend on the dataset

that is used in their derivation. A correlation model may

perform better if the chemical properties of a new com-

pound are similar to those used to develop the corre-

sponding regression equations. However, this is only true if

the chemical properties are the only determinants for the

tissue distribution of the compound. In the case in which

the chemical properties of the new drug are not similar to

the chemical properties that were used for the development

of the regression equations, a TCB model may perform

better than a correlation model because a TCB model is not

empirical but mechanistic. Therefore, the performance of

Kp prediction algorithms should be evaluated using an

external dataset not used for the development of the cor-

relation model because the prediction performance of a

regression-based algorithm could be artificial depending on

the dataset. Recently, researchers compared the predictive

performance of Kp algorithms using Vss as an outcome.

Using an independent dataset [17], it was observed that a

correlation model (i.e., Jansson et al. [5]) exhibited better

Kp prediction performance than a TCB model (i.e., Rod-

gers et al. [9, 10]). However, the TCB models do have an

advantage in that they are applicable for any species if the

tissue-specific physiological parameters are available [9].

The accuracy of the TCB method depends on how well

the factors describing the underlying process in tissue

distribution (e.g., compound binding affinity to cell con-

stituents) are formulated. Unreasonable formulation in the

structure or uncertainty in physiological and/or chemical

parameter values can lead to poor prediction of Kp. The

underlying mechanism of a Kp prediction algorithm may

not hold for a compound under certain physicochemical

conditions. For example, a different approach was needed

to overcome the poor Kp prediction accuracy for highly

lipophilic compounds. It is known that the high lipophil-

icity of a compound is associated with a large tissue dis-

tribution (i.e., large Kp, large Vss). Rodgers et al. [30]

demonstrated that Vss increases exponentially when LogP

increases above a LogP of 6. In terms of the currently

available algorithms (e.g., Jansson et al. [5], Rodgers et al.

[9, 10], Yun and Edginton [16]), all equations are designed

such that an increase in lipophilicity leads to an increase in

Kp values. Above a certain LogP value, however, this

relationship between distributional parameters and LogP

may not hold true because Kp and/or Vss may reach a


123

plateau [31, 32]. Therefore, in Poulin and Haddad’s sim-

plified model [32] for highly lipophilic compounds

(LogP [ 6), regardless of a compound’s acid–base–neutral

properties, compound partitioning into neutral lipids is

prevalent [32] and the plateau concept holds true. In the

present study, LogP values ranged from -3 to 6. Thus, all

of the algorithms included in the Classification trees are not

appropriate to use with compounds for which LogP [ 6.

Therefore, user caution is recommended in the Kp pre-

diction of highly lipophilic compounds (LogP [ 6).

Because drug compounds tend to have LogP values that are

\6, this is not expected to affect the accuracy of Kp pre-

diction for small drug molecules. For environmental con-

taminants, however, LogP values often exceed 6 and the

use of certain algorithms will over-predict Kps.

In the presence of metabolic elimination and/or transport

carriers, there would be a discrepancy between true and

estimated Kp values under the assumption of no elimina-

tion or carrier-mediated tissue partitioning. The empirical

model for estimating Kps is highly dependent on the

dataset used. If a dataset is composed of numerous com-

pounds for which tissue distribution is affected by elimi-

nation or active transport, observations in the dataset can

be influential in determining the coefficient of an equation,

which can lead to the poor Kp prediction of a new obser-

vation. The relationship between in vivo parameters, the

chemical properties of a compound and tissue Kps is not

currently robust enough to describe tissue partitioning in

the presence of these processes. Thus, user discretion is

recommended in using Kp prediction algorithms for com-

pounds that are significantly affected by metabolic elimi-

nation or carrier-mediated transport. With that being said,

when Kps are used within a PBPK model framework, this

passive diffusion Kp is the desired parameter value. The

effect of extensive metabolism in an eliminating organ or

the effect of transporters in tissue distribution is taken into

account, not through Kp but through the incorporation of

enzymes or transporters.

One of the limitations of classification-tree-based Kp

prediction is that it is mathematically complex. To over-

come this problem, Classification trees #1, #2 and #3 are

available as a web-based program for public consumption

(http://spark.rstudio.com/kprftree/myapp/). This program

features a Classification tree calculator that defines the

best-predicting algorithm as well as a Kp calculator that

calculates Kp values using the best-predicting algorithm.

In conclusion, this study proposed novel Classification

trees for predicting the best-performing Kp prediction

algorithm as a function of tissue and compound. Classifi-

cation-tree-based Kp prediction overcomes the limitations

of any one algorithm by harnessing the best components of

each and by providing Kp prediction for 11 tissues, some of

which are not included in all algorithms (e.g. the Jansson

model does not provide an equation for spleen). The

Classification trees, especially those relying solely on

physico-chemical inputs, had better prediction performance

over any one algorithm, within the group. Further, Classi-

fication trees #1 and #2 performed better than Classifica-

tion tree #3 suggesting that researchers with any relevant

in vivo inputs should use them unless the compound has

features that are expected to be vastly different from those

compounds used in the development of these Classification

trees (see Table s.2). In these cases, Classification tree #3

and/or a TCB model should also be consulted. It is hoped

that an increased prediction performance of Kps will lead

to more appropriate parameterization of PBPK models and

will enhance the predictability of a compound’s

pharmacokinetics.

References

1. Poulin P, Theil FP (2000) A priori prediction of tissue:plasma

partition coefficients of drugs to facilitate the use of physiologi-

cally-based pharmacokinetic models in drug discovery. J Pharm

Sci 89:16–35

2. Peters, SA (2012) Pharmacokinetic principles, in physiologically-

based pharmacokinetic (PBPK) modeling and simulations: prin-

ciples, methods, and applications in the pharmaceutical industry.

Wiley, Hoboken

3. Berezhkovskiy LM (2004) Volume of distribution at steady state

for a linear pharmacokinetic system with peripheral elimination.

J Pharm Sci 93(6):1628–1640

4. Bjorkman S (2002) Prediction of the volume of distribution of a

drug: which tissue–plasma partition coefficients are needed?

J Pharm Pharmacol 54(9):1237–1245

5. Jansson R, Bredberg U, Ashton M (2008) Prediction of drug

tissue to plasma concentration ratios using a measured volume of

distribution in combination with lipophilicity. J Pharm Sci

97(6):2324–2339

6. Peyret T, Poulin P, Krishnan K (2010) A unified algorithm for pre-

dicting partition coefficients for PBPK modeling of drugs and

environmental chemicals. Toxicol Appl Pharmacol 249(3):197–207

7. Poulin P, Schoenlein K, Theil FP (2001) Prediction of adipose

tissue: plasma partition coefficients for structurally unrelated

drugs. J Pharm Sci 90(4):436–447

8. Poulin P, Theil FP (2009) Development of a novel method for

predicting human volume of distribution at steady-state of basic

drugs and comparative assessment with existing methods.

J Pharm Sci 98(12):4941–4961

9. Rodgers T, Leahy D, Rowland M (2005) Physiologically based

pharmacokinetic modeling 1: predicting the tissue distribution of

moderate-to-strong bases. J Pharm Sci 94(6):1259–1276

10. Rodgers T, Rowland M (2006) Physiologically based pharmaco-

kinetic modelling 2: predicting the tissue distribution of acids, very

weak bases, neutrals and zwitterions. J Pharm Sci 95(6):1238–1257

11. Schmitt W (2008) General approach for the calculation of tissue

to plasma partition coefficients. Toxicol In Vitro 22(2):457–467

12. Poulin P, Krishnan K (1995) A biologically-based algorithm for

predicting human tissue: blood partition coefficients of organic

chemicals. Hum Exp Toxicol 14(3):273–280

13. Poulin P, Theil FP (2002) Prediction of pharmacokinetics prior to

in vivo studies. 1. Mechanism-based prediction of volume of

distribution. J. Pharm Sci 91(1):129–156


123

http://spark.rstudio.com/kprftree/myapp/

14. Poulin P, Krishnan K (1996) A tissue composition-based algo-

rithm for predicting tissue:air partition coefficients of organic

chemicals. Toxicol Appl Pharmacol 136(1):126–130

15. Poulin P, Ekins S, Theil FP (2011) A hybrid approach to

advancing quantitative prediction of tissue distribution of basic

drugs in human. Toxicol Appl Pharmacol 250(2):194–212

16. Yun YE, Edginton AN (2013) Correlation-based prediction of

tissue-to-plasma partition coefficients using readily available

input parameters. Xenobiotica 43(10):839–852

17. Jones RD, Jones HM, Rowland M, Gibson CR, Yates JW, Chien

JY et al (2011) PhRMA CPCDC initiative on predictive models

of human pharmacokinetics, part 2: comparative assessment of

prediction methods of human volume of distribution. J Pharm Sci

100(10):4074–4089

18. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

19. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classifi-

cation and regression trees. Wadsworth International Group,

Belmont

20. Martin A, Bustamante P, Chun AHC (1993) Physical pharmacy:

physical chemical principles in the pharmaceutical sciences. Lea

& Febiger, Philadelphia, pp 297–298

21. Zhang H (2005) A new approach for the tissue-blood partition

coefficients of neutral and ionized compounds. J Chem Inf Model

45(1):121–127

22. Hinderling PH (1997) Red blood cells: a neglected compartment

in pharmacokinetics and pharmacodynamics. Pharmacol Rev

49(3):279–295

23. Kurz H, Fichtl B (1983) Binding of drugs to tissues. Drug Metab

Rev 14(3):467–510

24. Arundel P (1997) A multi-compartmental model generally

applicable to physiologically-based pharmacokinetics. 3rd IFAC

Symposium: Modelling and control in biomedical systems; 1997

23–26 March; University of Warwick, Coventry UK: AstraZen-

eca, London, UK

25. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey

A et al (2012) ChEMBL: a large-scale bioactivity database for

drug discovery. Nucleic Acids Res 40:D1100–D1107

26. Rowland M, Tozer T (2011) Clinical pharmacokinetics and

pharmacodynamics: concepts and applications, 4th edn. Wolters

Kluwer Health/Lippincott William & Wilkins, Philadelphia

27. Paixao P, LsF Gouveia, Morais JA (2009) Prediction of drug

distribution within blood. Eur J Pharm Sci 36(2):544–554

28. R Development Core Team (2008) R: a language and environment

for statistical computing. R foundation for statistical computing,

Vienna, Austria, ISBN 3-900051-07-0. http://www.R-project.org

29. Svetnik V, Liaw A, Tong C, Wang T (2004) Application of Brei-

man’s random forest to modeling structure-activity relationships of

pharmaceutical molecules. In: Roli F, Kittler J, Windeatt T (eds)

Multiple classier systems, fifth international workshop, MCS 2004,

Proceedings, Cagliari, Italy, 9-11 June 2004. Lecture Notes in

Computer Science, vol 3077. Springer, Berlin pp 334–343

30. Rodgers T, Rowland M (2007) Mechanistic approaches to vol-

ume of distribution predictions: understanding the processes.

Pharm Res 24(5):918–933

31. Haddad S, Poulin P, Krishnan K (2000) Relative lipid content as

the sole mechanistic determinant of the adipose tissue: blood

partition coefficients of highly lipophilic organic chemicals.

Chemosphere 40(8):839–843

32. Poulin P, Haddad S (2012) Advancing prediction of tissue dis-

tribution and volume of distribution of highly lipophilic com-

pounds from a simplified tissue-composition-based model as a

mechanistic animal alternative method. J Pharm Sci 101(6):

2250–2261

33. Leo A, Hansch C, Elkins D (1971) Partition coefficients and their

uses. Chem Rev 71(6):525–616


123

http://www.R-project.org

Documents

Development of a decision tree to classify the most accurate tissue-specific tissue to plasma partition coefficient algorithm for a given compound