Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Proceedings of the International Multiconference on ISSN 18967094 Computer Science and Information Technology, pp. 565 – 578 © 2007 PIPS
Application of stochastic processes in Internet survey
Elżbieta GetkaWilczyńska
Warsaw School of Economics, Institute of Econometrics, Division of Mathematical Statistics,
al. Niepodleglosci 164, 02554 Warsaw, Poland, [email protected]
Abstract. In this paper Poisson processes and basic methods of the reliability theory are proposed to interpretation, definition and analysis some stochastic properties of process of Internet data collection. At first, the notion of uncontrolled sample is introduced and random size of it is defined as a counting process. At the second, the process of Internet data collection is considered as a life test of the population surveyed. The events which appear in Internet survey are interpreted as a lifetime, arrival, death of the element of the population and the basic characteristics of reliability of the length of the population lifetime are described, calculated and estimated by using the notions and methods of the reliability theory.
Keywords: Internet survey, uncontrolled sample, population, coherent system, estimation, reliability function
1 Introduction
Recently statistical research as for as sampling selection can be divided into representative surveys based on the probability sample and surveys based on the nonprobability sample, e.g. an Internet surveys. After choosing the kind of the sample selection next stage of the survey is data gathering. In all surveys, the data is collected using an immediate interview, a telephone interview or a post and in recent years an interview over the Internet. Difference between these researches rely on following aspects. In representative surveys based on the probability sample the frame of sampling is known, respondents are drawn to the sample by a statistician in accordance with sampling design (sampling scheme), the methods of theory sampling are applied to data analysis and in these surveys an electronic questionnaire is only one of modes of data collection. The Internet surveys have several advantages, such as low costs of collecting information, the speed of the data transmission and a
565
566 Elżbieta GetkaWilczyńska
possibility to monitor it. The usage of the electronic questionnaire in Internet survey makes the interview more efficient, lowers the workload of the respondents and controls the responds’ quality. But in the Internet surveys the frame of sampling usually does not exist and the populations surveyed are sets of unidentifiable elements (sometimes generally described), drawing the sample is not possible and respondents are not randomly selected to the sample, but they participate in the survey with a subjective decision. Moreover, in the Internet surveys data sets are collected in an uncontrolled way and the respondents who took part in the survey form we introduce here a notion an uncontrolled sample. The methods of the sampling theory can not be used for the data from such the samples because the probability inclusions are not known and statistics are calculated on the basis Internet data refer only to the population surveyed.
However (besides these drawbacks of Internet data), when the certain assumptions are made about the population it is possible to define the random size of the uncontrolled sample [2] and process of Internet data collection as a random experiment (or a life testing experiment) by using methods of stochastic processes [4] and the reliability theory [1, 5].
We generally assume, that Internet survey begins at the moment t=0 , when the electronic questionnaire is put on the website and the survey is conducted for the time
0>T . A set {u1 , u2 ,. ..} denotes the population of potential respondents. For each 1≥n , the population of n units is surveyed and the respondent sent the questionnaires independently. By τ j , j=1,2 , .. . , n , n≥1 , we denote a moment of questionnaire record on the server after an initial moment 0=t , from each respondent u k , k=1,2 ,. . . , n , 1≥n , belonging to the population of the size
1≥n , who took part in the survey as j th. The moment of the questionnaire record is an event that can be interpreted as the moment of arrival or birth of the respondent, who took part in the survey as j th, when the size of the uncontrolled sample is defined, as the moment of failure or renew of j th element of the population, when the population is treated as a coherent system, as the waiting time for the j th questionnaire record after the initial moment
0=t equal the length of lifetime of j th element of the population until the moment 0≥t or the moment of death of the j th element of the population, when the length of the population lifetime is considered. Theoretically, four cases which describe the relation between the time of the survey conducting, 0>T , and the size of the uncontrolled sample can be considered. In the first case, the registering the questionnaires ends at the moment 0>T specified in advance, independently of the questionnaires’ number recorded. The size of the uncontrolled sample is then a random value in the interval ],0[ T and depends on
Application of stochastic processes in Internet survey 567
the length of time of the survey and on a selection procedure applied in the survey (if it is used in the survey). An extreme situation occurs when no data was collected (an arrival set is empty or the questionnaire, which arrived were rejected by the selection procedure used in the survey). In the second case the sample size is specified in advance and the survey ends when the assumed number of responses has arrived, independently of the length of time of the survey. An extreme situation occurs when the length of time of the survey is infinite. In the third case, both the length of time of the survey and the sample size are specified in advance and the survey ends in earlier of the assumed moments. In the fourth case, the final moment of the survey is not specified in advance. The process of registering questionnaires lasts at the moment when the collected data set meets the demands of the survey organizers.
2 The first conception random size of an uncontrolled sample
If the process of Internet data collection is considered as a process of registering questionnaires on the server in a fixed interval of the time 0>T (the time of the survey conducting) then the size of the uncontrolled sample at the moment 0≥t equals the total number arrivals until the moment 0≥t and is defined as a counting process Bernoulli, Poisson or composed Poisson process [4].
Case 1 the size of the uncontrolled sample as Bernoulli process
Definition 2.1. For fixed 1≥n the size of uncontrolled sample until the moment 0≥t is given by
N t =card {1≤k≤n : τ k∈[0, t }
and is equal to a sum
N t =N 1 t .. .N n t ,
where N k t ={0 if τ k≥t
1 if τ kt, τ k , k=1,2 , .. . , n , n≥1 , are independent
random variables with uniform distribution over ],0[ T , 0>T , τ k−1≤τ k for 1≥j , τ 0=0 and at the initial moment 0=t no arrivals occur.
The value of the random variable ( ) 0, ≥ttN equals the total number of arrivals until the moment 0≥t and the process ( ){ }0, ≥ttN can be described
568 Elżbieta GetkaWilczyńska
in a following way. Each of n respondents, independently of others send only one questionnaire with the probability 1 in the interval ],0[ T for 0>T (the time of the survey conducted). The probability of sending the questionnaire by the certain
respondent in the interval of the length [ ]T,0⊂∆ is equal to ratio T
∆. In this
way, each respondent generates a stream consisted of only one arrival. A summary stream obtained by summing these streams is called a bound Bernoulli stream, that is, it consists of finite number of events. To complete the definition of the counting process it remains to compute the distribution of ( )tN and the joint distribution of N t1 , N t2 ,. .. , N tn for any nonnegative t0 , t1 ,. . . , tn .
1) Let Pk t =P {N t =k } , nk ,...,1= , be the probability of event, that at the moment 0≥t the total number of arrivals )(tN equals k . Since the probability of arrival of the given respondent in the interval [ ] [ ]Tt ,0,0 ⊂ is equal
to ratio T
t and the arrivals came independently, hence the total number of arrivals
)(tN at the moment 0≥t is random variable with Bernoulli distribution
( )knk
kT
t
T
t
k
ntP
−
−
= 1 .
2) If the intervals Δ1 , .. . , Δn are disjoint pairs and the interval [0,T ]=Δ1∪.. .∪Δn is a sum of Δ1 ,. .. , Δn , then for any nonnegative
integers k 1 ,. .. , k n such that k 1. . .k n=n holds
P {N Δ1=k 1 ,. . . , N Δn =k n}=n !
k1 ! .. .k n !p1k 1. .. pn
kn,
where ( )iN ∆ is the number of arrivals which occur in the interval i∆ ,
pi=∣Δi∣
T for ni ,...,1= , and ∣Δi∣ is the length of the interval
Δi=t i−ti−1 , i=1, .. . , n .
Case 2 size of the uncontrolled sample as Poisson process
Definition 2.2. For fixed 1≥n the size of uncontrolled sample until the moment 0≥t is given by
Application of stochastic processes in Internet survey 569
N ' t =card {1≤k≤n: τk∈[ 0, t }=max {n≥0 : S nt } ,
where τ1 , τ2 , .. . , denote as before the successive moments of questionnaires
record, τ k−1≤τ k for 1≥k and τ 0=0 , zk k=1∞
is a sequence of independent
and identically distributed random variables z k=τ k−τ k−1 with exponential distribution G t =1−e−λt , t≥0 , λ0 and z k=τ k−τ k−1 for 1≥k
denotes k− th spacing between k− th and k−1− th arrivals, S n=∑k=1
n
z k
is a random variable with Erlang distribution given by
P S nt =1−∑i=0
n−1 λt i
i !e−λt for 0≥t and 0>λ ,
Then
P {N ' t =k }=P {N ' t n1 }−P {N ' t n }=P {S n1≥t }−P {Sn≥t }= λt n
n!e−λt ,
0),(' ≥ttN , the total number of arrivals until the moment 0≥t , is a random variable with Poisson distribution with the parameter 0>λ and { }0:)(' ≥ttN is Poisson process.
Moreover, if in the Poisson process (stream) in the interval ],0[ T , 0>T , n arrivals occur, then process (stream) of arrivals in this interval is the Bernoulli process (stream), [5]. This fact is shown below. If Tt ≤≤0 and nk ≤≤0 , then
P {N ' t =k∣N ' T =n }=P {N ' t =k , N ' T −N ' t =n−k }
P {N ' t =n }
=P {N ' t =k }P {N ' T−t =n−k }
P {N ' T =n }=
λt k
k !e−λt
λ T−t n−k
n−k !e−λ T− t
λT n
n !e−λT
.
Hence ( ) ( ){ } === nTNktNP ''knk
T
t
T
t
k
n −
−
1 .
If the intervals Δ1 ,. .. , Δn are disjoint pairs and [0,T ]=Δ1∪.. .∪Δn , then for any nonnegative integers k1 ,. .. , k n such that k 1. . .kn=n holds
570 Elżbieta GetkaWilczyńska
P {N ' Δ1=k1 , ... , N ' Δn=k n∣N ' T =n }=
=Pk {N ' Δ1=k1 ,.. . ,N ' Δn =k n }
P {N ' T =n }=
∏i=1
n
P {N ' Δi =k i }
P {N ' T =n }=
∏i=1
n
λ∣Δi∣k i e
−λ∣Δi∣
λT n
n !e−λT
.
Therefore P {N ' Δ1=k1 , ... , N ' Δn=k n∣N ' T =n }=n !
k1! .. .k
n!p1k 1... p
nk n
.
Case 3 size of the uncontrolled sample with a selection procedure as composed Poisson process
In the Internet surveys the electronic questionnaire is available to all Internet users. and a part of the registered arrivals came from respondents who do not necessarily belong to the surveyed population. In this case only the arrivals of these respondents whose questionnaires qualified for the data set based on the selection procedure are included in the sample. By this assumption and the assumptions made in case 2 the size of the uncontrolled sample is defined as a composed Poisson process. Definition 2.3. For fixed 1≥n the size of uncontrolled sample until the moment
0≥t is given by Y t =S N ' t , where
N ' t =card {1≤k≤n: τk∈[ 0, t }=max {n≥0 : S nt } ,
( )( )
∑==
tN
jjtN US
'
1' ,
a sequence U n n=1∞
of independent and identically distributed random variables and
the Poisson process ( ){ }0:' ≥ttN are independent. The arrivals τ n , n=1,2. .. , are selected for the uncontrolled sample in the following way (the
sequence of arrivals τ n , n=1,2. .. is thinned): the arrival τ n , n=1,2. .. ., is omitted with the probability p, ]1,0[∈p (independently of the process taking place), if the respondent does not belong to the population and the arrival τ n , n=1,2. .. ., is left with the probability p−1 , otherwise. The random
variable U i is equal to 1, if the arrival τ i remains, and 0, if the arrival τ i is
Application of stochastic processes in Internet survey 571
omitted. The probability p, ]1,0[∈p is defined by the procedure of selection used in the survey and consequently, process ( ){ }0, ≥ttY is composed Poisson process with expected number of the arrivals ( )p−1λ .
3 The second conception length of the population lifetime
In the remaining part of this paper we assume that the population of n units for 1≥n , is treated as a finite coherent system of n components and the Internet
survey begins at time 0=t and it is conducted for the time 0, >TT . In this case the process of Internet data collection can be considered as a random experiment or a life testing experiment in which the basic characteristics of length of the population lifetime are analysed by using the methods of reliability theory [1], [5]. A nonnegative random variables τ k , k=1,2. .. , n , n≥1 with distribution function
F k t =P τ kt , for 0≥t , nk ,...,2,1= , 1≥n ,
and probability density function
( ) ( )tFtf ´=́ and ( ) ( )∫=x
dxxftF0
are interpreted as length of lifetime of k th element of the population until the moment 0≥t or the moment of death of k th element of the population until the moment 0≥t . The probability
F k t =1−F k t for 0≥t , nk ,...,2,1= , 1≥n
is called the probability of the lifetime length of k th element at least 0≥t or the probability of event that kth respondent is in the state of life at least 0≥t or the reliability function of the length of kth element lifetime at the moment 0≥t , (the reliability of the kth element in short).
The conditional probability density function
( ) ( )( )tFtf
tk =λ for 0≥t , nk ,...,2,1= , 1≥n .
is called rate of changes of kth element of the population.
572 Elżbieta GetkaWilczyńska
The elements of the population are not renewed each record of questionnaires decreases the size of the population in one and the element which arrived is not replaced by a new one. This way of selection is called random sampling without replacement. Additionally, we assume that the length of the population lifetime of population of the size n for 1≥n at the initial moment 0=t is equal to the sum of the length of the particular elements lifetimes. We define the length of the population lifetime by using the structure function of the population as follows.
States of elements of the population
The state of i th elements of the population (as the system) is defined by the values of the binary function
ei t ={0 if i -th element is in the state of life or i -th element did not arrive
until the moment t
1 if i -th element is in the state of death or i -th element arrive
until the moment t
Then the state of all elements of the population of size n , for 1≥n , is determined by ndimension vector e t =[ e1 t , e2 t , .. . , en t ] and we assume that at the initial moment 0=t all elements of the population of size n , for 1≥n , are in the states of life. This assumption means that at the moment 0=t no arrivals occurred.
States of the population
The state of the population of size n , for 1≥n , at the moment 0≥t is defined by the values of the binary function
ϕ0 t ={0 if the population is in state of life the survey test is conducted
in the moment t
1 if the popualtion is in the state of death the survey test ended
until the moment t
and at each moment 0≥t it depends on the states of the elements through the values of the function ϕ0 t =ϕ [ e1 t ,. .. , en t ]=ϕ e t .
In the process of Internet data collection treated as a life test of the population of size n , for 1≥n , the population can be found at the moment 0≥t in the state of life during the conducting of the survey in following cases. In the first case, at the
Application of stochastic processes in Internet survey 573
moment Tt ≤≤0 , where T is the time of the survey conducting specified in advance and the number of death (arrivals) in interval of the length 0≥t is random but it is less than the size of the population. Otherwise, until the moment T specified in advance. In the second case, until the moment 0≥t , in which the number of death (arrivals) is equal to the size of the sample specified in advance (it is less than the size of the population) and the time of the survey conducting, T , is not specified in advance. In the third case, until the earlier of the time of the survey conducting,T , and the moment 0≥t , when the number of death (arrivals) is equal to the sample size, where both the length of time of the survey,T , and the sample size are specified in advance. In the fourth case, until the moment 0≥t , when the collected data set (it can be a subset of the population or the population surveyed) meets the demands of the survey organizers and the final moment of the survey is not specified in advance.
Properties of the structure function
The structure function ( )eϕ is increasing, if for any two vectors e i e' is satisfied the condition:
if ee' , then ϕ e ≤ϕ e' ,
where ee' , if for all i=1,. .. , n , ei≤ei' .
This property of the structure function introduce a partial order in a set of the binary vectors and means that additional death of the element can not change the state of the population from the state of death to the state of life.
The function ( )eϕ defines a division of a set { }eE = of all n dimension and binary vectors which describe the state of the population to two sets: E= {e :ϕ e =0 } , a set of states of population life and E−= {e :ϕ e =1 } , a
set of states of population death. If the structure function is increasing, then the division of the set E to two sets E and E− is called a monotonic structure.
Length of the population lifetime
Let us denote by τ the length of the population lifetime and τ=inf {t :ϕ [e t ]=1} . Then ( ) ( )tPtF <= τ is the probability of ending of
the survey (test) until the moment 0≥t or the probability of the event that the population is in the state of death until the moment 0≥t and ( ) ( )tPtF ≥= τ is
574 Elżbieta GetkaWilczyńska
the probability of the conducting survey (test) at least 0≥t , the probability of the event that the population is in the state of life at least 0≥t or the reliability function of the length of the population lifetime at the moment 0≥t , (the reliability of the population in short).
4 Calculation and estimate of the reliability of the length of the population lifetime
The formula which expresses the relation between the reliability of the length of the population lifetime and reliabilities of elements at the moment 0≥t is given by
F t = ∑e∈E
p e ,
where p e =∏k=1
n
F k1−e k t F k
ek t , (we adopt the convention 00=1 ) is a
probability of event that the population is in the state e . If τ k , k=1,2. .. , n , n≥1 , are nonnegative independent random variables,
the elements are not renewed and the function ϕ e is increasing, then the reliability function of the length of the population lifetime ( )tF is increasing respectively to each coordinate of the reliability function of the length of the element lifetime F k t .
Thus an upper or a lower bound on the reliability of the length of the population lifetime may be obtained from the upper or lower bounds on the reliabilities of the elements. When the number of the states is large (the number of all states is equal to 2n ) and the function ϕ e is very complicated, then a formulae given above is not
efficient and the other methods of calculation are applied e.g. the method of path and cut, the recurrence method of Markov chain or generally, the Markov methods.
Length of the population lifetime for the series structure
The population (as a system)) of n elements is called a series structure, when the population is in the state of life if and only if each element is in the state of life. In this case, the change of the state of any element causes the change of the population state. The length of the population lifetime in this case is equal to the waiting time of the first death and the size of the uncontrolled sample equals zero for the first death. Then the basic characteristics of the reliability function of the series structure are
Application of stochastic processes in Internet survey 575
given as follows. The probability of the length of the population lifetime (duration of
the survey) at least 0≥t is equal to ( ) ( )tFtFn
ii∏=
=1 . From inequality [3]
1−∑i=1
n
F i≤∏i=1
n
1−F i ≤1−∑i=1
n
F i∑i j
F i F j≤1−∑i=1
n
F i12 ∑i=1
n
F i 2
it follows that ∣F t −∑i=1
n
F i∣≤ 12 ∑i=1
n
F i 2
.
The change rate of the population equals the sum of the change rate of the elements
The expected time of the length of the population lifetime is equal to
( ) ( )dttFET ∫==∞
0τ .
Length of the population lifetime for the parallel structure
The population (as a system) of n elements is called a parallel structure, when the population is in the state of death if and only if all elements are in the state of death. In this case, the change of the state of the population (death of the population) takes places only if changes of all population elements occur all elements of the population died and the size of the uncontrolled sample is equal to the size of the population (all elements of the populations arrived). Then the probability of the length of the population lifetime (duration of the survey) at least 0≥t is equal to
( ) ( )tFtFn
ii∏=
=1 or ( ) ( )tFtF n
0= and the expected time of the length of
the population lifetime is equal to ( ) ( )[ ]dttFET ∫ −==∞
01τ or
( )[ ]dttFT n∫ −=∞
001 , when F i t =F 0 t , i=1,2 , .. . , n .
576 Elżbieta GetkaWilczyńska
Length of the population lifetime for the partial structure
The population (as a system) of n elements is called a partial structure if all elements of the population are identical and the population is in the state of life if at least m elements of the population are in the state of life (that is, at most mn − death occur) and the size of the uncontrolled sample equals n−m . Then the reliability function of the length of the population lifetime is equal to
( ) FFk
ntF knk
n
mk
−=∑
= 00 .
The method of path and cut
In this method we define notions of minimal path and minimal cut that used to estimate the reliability function of the length of the population lifetime, [1]. Definition 4.1. The set of elements A={i1 , .. . ,in} of the population is called a minimal path if all the elements of this set are in the state of life (the population is in the state of life, the survey is being conducted) and no subset of the set A has this property.
From the monotonic property of the structure function, the set A={i1 , .. . ,i k }
is a minimal path if and only if e∈E , where e is a vector in which coordinates i1 , .. . , i k take on the value zero and the remaining coordinates take on the value
one, with any state greater than e belonging to E− . Therefore, every minimal path determines a bordering state of the life of the population in which the occur of death of any element causes a change of the state of the population into the state of death one (ending of the survey (test)). In term of the size of the uncontrolled sample, it means, that the number of arrivals is equal to the number of elements of the minimal path is smaller per one than the size of the sample assumed in the survey. Let {A1 , A2 , .. . , Am } be a sets of all minimal paths with the corresponding
bordering states e 1 , e 2 ,. .. , e m . A s is an event in which all elements of the
minimal path A s are in the state of life. Since E= Us=1
m
As , [5], the reliability
function of the length of the population lifetime is calculated from the formula
F t =P Us=1mA s=∑
i=1
m
P Ai ∑i , jP Ai A j ∑
i jkP Ai A j Ak
. . .−1 m1 P A1 A2. .. Am
Application of stochastic processes in Internet survey 577
The number of the elements of the sum on the right is equal to 2m−1 and the
probability of any event which is given by Ai1Ai 2. .. Aik , where
i1i2 ,. ..i k is equal to
P Ai 1Ai2 .. .Aik =Fs1
t Fs2
t . .. . Fs l
t ,
where s1 , s2 ,. sl are different indices of elements of the minimal paths (that is, in the case of the elements belonging to overlapping parts of different paths, each element is calculated only once). In the order to lower the number of calculation we introduce the following notions. The two minimal paths are called crossing, when they have at least one common element. The two minimal paths are called relevant, if there exists a chain of crossing paths which connects them. The relevant relation is the equivalent relation which divides a set of all minimal paths into classes of relevant
minimal paths. Let {A1.. . Ak1} , {Ak11. . .Ak 2}. . .. be the successive classes
of the relevant minimal paths. Because F t =1−F t =P ∩i=1m
Ai and the
events belonging to different classes are independent, then
F t =P ∩i=1k 1
AiP ∩i=1k 2
Ak 11.. .. . .
A dual notion of the minimal path is a minimal cut (a critical set). Definition 4.2. A set of elements B= j1 , j2 , .. . , jl is called a minimal cut, if all elements of this set are in the state of death (the population is in the state of death, the survey (test) ended) and no subset of the set B has this property.
In this case we are interesting in those cut set in which the number of elements is equal to the size of the uncontrolled sample specified in advance (all elements belonging to the set B arrived).
If we denote by {B1 , B2 , . .. , B s} a set of all the minimal cuts, then the probability of an event that the survey (test) ends until the moment 0≥t is equal to
F t =P Ui=1
s
Bi=∑i=1
s
P Bi ∑i jP Bi B j ∑
i jkP Bi B j Bk .. . .
−1 m1 P B1B2. ..Bm=S1−S 2S3. ..−1 s1 S s ,
From this formula we can obtain the estimation of the length of the population lifetime (during the survey) as the estimation of the reliability function of the population. In this case the number of elements in the successive minimal cuts is interpreted as the possible sample sizes which can be collected in the survey on
578 Elżbieta GetkaWilczyńska
condition that the survey ends after collection of the sample of the assumed size.
From proof of this formula for the noncrossing minimal cuts holds S 2≤S12
2,
S 3≤S13
6, and so on, and in the case of the crossing minimal cuts the partial sums
maintain an order S k=O S 12 .
Moreover, for any k , S 1−S 2. ..−S 2k≤F t ≤S1−S 2. ..−S 2kS 2k1 . Thus there exists the possibility of the estimation of the length of the population lifetime with assumed precision because partial sums on the right of the last formula are the interchangeable upper and the lower bounds of the reliability function.
Conlusions
The process of Internet data collection is interpreted and analysed as a life test of population surveyed by using the notions and methods of the reliability theory. A random set of respondents who participate in Internet survey is called the uncontrolled sample and defined as the counting process by using Poisson processes. The proposed approach allows to study some stochastic properties of the process of the Internet data collection, the calculation and the estimate of the basic characteristics by the assumed assumptions.
References
1. Barlow R. E., Proschan F.: Mathematical theory of reliability, Wiley and Sons, Inc. New York, London, Sydney (1965). 2. GetkaWilczyńska E.: Random properties noncontrolled sample, Annals of Collegium of Economics Analysis, 13, p. 5969, Warsaw School of Economics, (in Polish), Warsaw (2004).3. Hardy G. H., Littlewood J. E., Polya G.: Inequalities, Cambridge University Press, Cambridge (1934) 4. Kingman J. F. C.: Poissons’ processes, Polish Scientific Publishers, (in Polish), Warsaw (2001).5. Sołowiew A. D.: Analytic methods in reliability theory, Technical Scientific Publishers, (in Polish), Warsaw (1983).