Constanta Maritime University, Faculty of Navigation and ... · ⧐Fitness function: inversely proportional to objective function ⧐Other details: proportional selection scheme,

1. IntroductionThis study aims at providing an efficient method to solve thestatistical file merging issue: merging two or more filescontaining distinct datasets, from different sources, wherethe items of the datasets do not overlap.

The problem is modelled as a network transportationproblem and is solved using an adaptive Genetic Algorithmbased on fuzzy logic controller, which dynamically calibratesalgorithm parameters.

5. Computational resultsThe algorithm was tested on large-scale instances to analyseboth its efficiency and scalability:

Computational results of the GA with Greedy initial population versus the GA with random initial population

Best objective function of the GA with Greedy initial population versus the GA with random initial population

6. Conclusions

The objective of this study is to analyse a practical questionrelated to statistical data integration. In order to resolve thisproblem, a procedure based on Genetic Algorithm for mergingtwo or more files has been proposed.A fuzzy enhancement was designed to dynamically calibrate thecrossover rates, thereby improving its performance.The solutions obtained for different problem instances highlightthe fact that the proposed algorithm can offer remarkableimprovements regarding the quality of the obtained solutions,especially for larger problems, when it is hybridized with asuitable heuristic for seeding the population with some initialgood solutions.Overall, the algorithm performs well in practice and is scalable to

datasets of larger size.

2. Statistical file merging : basic conceptsThe sample file A contains n1 records, each with informationon variables X, Z1, Z2, …, Zk, while the sample file B containsn2 records, each with information on variables Y, Z1, Z2, …,Zk. By merging these two incomplete files of data, the goal isto obtain a single merged dataset C containing all relevantvariables X, Y, Z1, Z2, …, Zk. Matching results in linking eachrecord of file A (the recipient file) to the record of file B (thedonor file) with the best comparable values on a certainsubset of variables shared by both files, Z1, Z2, …, Zk..

4. The proposed G A with fuzzy logic controller⧐ Encoding strategy:

chromosome[k], k = 1, …, PS (Population Size), contains the fields:- array A_weight: A_weight[i] corresponds to the weight ai, i = 1, …,m- array B_weight: B_weight[j] corresponds to the weight bj j = 1, …,n- array X: x[i][j] corresponds to the solution of the problem

⧐ Initial population seeding:

Greedy strategy: successively pairs a record from the first file with anunmatched record in the second file to obtain the lowest distance cost

⧐ Fitness function: inversely proportional to objective function

⧐ Other details: proportional selection scheme, the BLX −α crossover (with α = 0.5 and pc = 0.8 initially), the alternative mutation scheme

⧐ Structure of the Fuzzy Logic System

- input variables: the degree of population genotypic diversity, GD and the degree of population phenotypic diversity, FD:

- output variable: crossover probability

The membership functions for the inputs The membership function for the output

The Fuzzy reasoning table

- defuzzification method: the weighted average

Big data integration – an evolutionary perspectiveSimona Dinu

Constanta Maritime University, Faculty of Navigation and Naval Transport, Romania

3. The proposed model The problem is described as a transportation network:- records to be merged represent the nodes of the

transportation network : i = 1,..,m and j=1,..,n- weights of the records represent the supply/demands

values associated with the nodes : ai and bj

- matching connections represent the network arcs (i,j)- weight of the merged record i of file A with record j of file

B represent the flow on arc (i,j): xij

)x(fminm

i

n

j

ij 1 1

n

j

j

m

i

i ba11

m,...,i,ax i

n

j

ij 11

n,...,j,bx j

m

i

ij 11

n,...,j,andm,...,i,xij 110

ijijij xcxfwhere )(

matchednot are records two theif 0,

B file of j record with combined isA file of i record if

record, composite the toassigned weight the

ijx

minmax

minavg

dd

ddGD

minmax

avgmax

fitnessfitness

fitnessfitnessFD

Documents

Constanta Maritime University, Faculty of Navigation and ... · ⧐Fitness function: inversely proportional to objective function ⧐Other details: proportional selection scheme,