1
1. Introduction This study aims at providing an efficient method to solve the statistical file merging issue: merging two or more files containing distinct datasets, from different sources, where the items of the datasets do not overlap. The problem is modelled as a network transportation problem and is solved using an adaptive Genetic Algorithm based on fuzzy logic controller, which dynamically calibrates algorithm parameters. 5. Computational results The algorithm was tested on large-scale instances to analyse both its efficiency and scalability: Computational results of the GA with Greedy initial population versus the GA with random initial population Best objective function of the GA with Greedy initial population versus the GA with random initial population 6 . Conclusions The objective of this study is to analyse a practical question related to statistical data integration. In order to resolve this problem, a procedure based on Genetic Algorithm for merging two or more files has been proposed. A fuzzy enhancement was designed to dynamically calibrate the crossover rates, thereby improving its performance. The solutions obtained for different problem instances highlight the fact that the proposed algorithm can offer remarkable improvements regarding the quality of the obtained solutions, especially for larger problems, when it is hybridized with a suitable heuristic for seeding the population with some initial good solutions. Overall, the algorithm performs well in practice and is scalable to datasets of larger size. 2. Statistical file merging : basic concepts The sample file A contains n 1 records, each with information on variables X, Z 1 ,Z 2 , …, Z k , while the sample file B contains n 2 records, each with information on variables Y, Z 1 ,Z 2 , …, Z k . By merging these two incomplete files of data, the goal is to obtain a single merged dataset C containing all relevant variables X, Y, Z 1 ,Z 2 , …, Z k . Matching results in linking each record of file A (the recipient file) to the record of file B (the donor file) with the best comparable values on a certain subset of variables shared by both files, Z 1 ,Z 2 , …, Z k.. 4. The proposed G A with fuzzy logic controller Encoding strategy: chromosome[k], k = 1, …, PS (Population Size), contains the fields: - array A_weight: A_weight[i] corresponds to the weight a i , i = 1, …,m - array B_weight: B_weight[j] corresponds to the weight b j j = 1, …,n - array X: x[i][j] corresponds to the solution of the problem ⧐Initial population seeding: Greedy strategy: successively pairs a record from the first file with an unmatched record in the second file to obtain the lowest distance cost Fitness function: inversely proportional to objective function Other details: proportional selection scheme, the BLX −α crossover (with α = 0.5 and pc = 0.8 initially), the alternative mutation scheme Structure of the Fuzzy Logic System - input variables: the degree of population genotypic diversity, GD and the degree of population phenotypic diversity, FD: - output variable: crossover probability The membership functions for the inputs The membership function for the output The Fuzzy reasoning table - defuzzification method: the weighted average Big data integration – an evolutionary perspective Simona Dinu Constanta Maritime University, Faculty of Navigation and Naval Transport, Romania 3. The proposed model The problem is described as a transportation network: - records to be merged represent the nodes of the transportation network : i = 1,..,m and j=1,..,n - weights of the records represent the supply/demands values associated with the nodes : a i and b j - matching connections represent the network arcs (i,j) - weight of the merged record i of file A with record j of file B represent the flow on arc (i,j): x ij ) x ( f min m i n j ij 1 1 n j j m i i b a 1 1 m ,..., i , a x i n j ij 1 1 n ,..., j , b x j m i ij 1 1 n ,..., j , and m ,..., i , x ij 1 1 0 ij ij ij x c x f where ) ( matched not are records two the if 0, B file of j record with combined is A file of i record if record, composite the to assigned weight the ij x min max min avg d d d d GD min max avg max fitness fitness fitness fitness FD

Constanta Maritime University, Faculty of Navigation and ... · ⧐Fitness function: inversely proportional to objective function ⧐Other details: proportional selection scheme,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Constanta Maritime University, Faculty of Navigation and ... · ⧐Fitness function: inversely proportional to objective function ⧐Other details: proportional selection scheme,

1. IntroductionThis study aims at providing an efficient method to solve thestatistical file merging issue: merging two or more filescontaining distinct datasets, from different sources, wherethe items of the datasets do not overlap.

The problem is modelled as a network transportationproblem and is solved using an adaptive Genetic Algorithmbased on fuzzy logic controller, which dynamically calibratesalgorithm parameters.

5. Computational resultsThe algorithm was tested on large-scale instances to analyseboth its efficiency and scalability:

Computational results of the GA with Greedy initial population versus the GA with random initial population

Best objective function of the GA with Greedy initial population versus the GA with random initial population

6. Conclusions

The objective of this study is to analyse a practical questionrelated to statistical data integration. In order to resolve thisproblem, a procedure based on Genetic Algorithm for mergingtwo or more files has been proposed.A fuzzy enhancement was designed to dynamically calibrate thecrossover rates, thereby improving its performance.The solutions obtained for different problem instances highlightthe fact that the proposed algorithm can offer remarkableimprovements regarding the quality of the obtained solutions,especially for larger problems, when it is hybridized with asuitable heuristic for seeding the population with some initialgood solutions.Overall, the algorithm performs well in practice and is scalable to

datasets of larger size.

2. Statistical file merging : basic conceptsThe sample file A contains n1 records, each with informationon variables X, Z1, Z2, …, Zk, while the sample file B containsn2 records, each with information on variables Y, Z1, Z2, …,Zk. By merging these two incomplete files of data, the goal isto obtain a single merged dataset C containing all relevantvariables X, Y, Z1, Z2, …, Zk. Matching results in linking eachrecord of file A (the recipient file) to the record of file B (thedonor file) with the best comparable values on a certainsubset of variables shared by both files, Z1, Z2, …, Zk..

4. The proposed G A with fuzzy logic controller⧐ Encoding strategy:

chromosome[k], k = 1, …, PS (Population Size), contains the fields:- array A_weight: A_weight[i] corresponds to the weight ai, i = 1, …,m- array B_weight: B_weight[j] corresponds to the weight bj j = 1, …,n- array X: x[i][j] corresponds to the solution of the problem

⧐ Initial population seeding:

Greedy strategy: successively pairs a record from the first file with anunmatched record in the second file to obtain the lowest distance cost

⧐ Fitness function: inversely proportional to objective function

⧐ Other details: proportional selection scheme, the BLX −α crossover (with α = 0.5 and pc = 0.8 initially), the alternative mutation scheme

⧐ Structure of the Fuzzy Logic System

- input variables: the degree of population genotypic diversity, GD and the degree of population phenotypic diversity, FD:

- output variable: crossover probability

The membership functions for the inputs The membership function for the output

The Fuzzy reasoning table

- defuzzification method: the weighted average

Big data integration – an evolutionary perspectiveSimona Dinu

Constanta Maritime University, Faculty of Navigation and Naval Transport, Romania

3. The proposed model The problem is described as a transportation network:- records to be merged represent the nodes of the

transportation network : i = 1,..,m and j=1,..,n- weights of the records represent the supply/demands

values associated with the nodes : ai and bj

- matching connections represent the network arcs (i,j)- weight of the merged record i of file A with record j of file

B represent the flow on arc (i,j): xij

)x(fminm

i

n

j

ij 1 1

n

j

j

m

i

i ba11

m,...,i,ax i

n

j

ij 11

n,...,j,bx j

m

i

ij 11

n,...,j,andm,...,i,xij 110

ijijij xcxfwhere )(

matchednot are records two theif 0,

B file of j record with combined isA file of i record if

record, composite the toassigned weight the

ijx

minmax

minavg

dd

ddGD

minmax

avgmax

fitnessfitness

fitnessfitnessFD