50
G22.3033-010 Data Warehousing and Data Mining Project Design Correlations between Financial Structure and Foreign Direct Investment Cheng-Cheng Ku (N18898117) Chien-wen Hsu (N16208816) Li-Heng Chen (N14028026) May 2, 2005

Data Warehousing and Data Mining Project Design

  • Upload
    tommy96

  • View
    1.480

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Data Warehousing and Data Mining Project Design

G22.3033-010

Data Warehousing and Data Mining

Project Design

Correlations between Financial Structure

and Foreign Direct Investment

Cheng-Cheng Ku (N18898117)

Chien-wen Hsu (N16208816)

Li-Heng Chen (N14028026)

May 2, 2005

Page 2: Data Warehousing and Data Mining Project Design

2

Table of Contents

1.0 Introduction …………………………………………………………4

1.1 Motives ……………………………………………………………4

1.2 Project Description …………………………………………. 4

1.3 High Level Goal ………………………………………………. 5

1.4 Hypothesis ………………………………………………………5

1.5 Datasets ………………………………………………………6

2.0 Project Design ……………………………………………………………9

3.0 Data Preprocessing ……………………………………………9

4.0 Characterization ……………………………………………14

4.1 Generalization ……………………………………………14

4.2 Analysis of Attribute Relevance ………………………………………14

4.3 Attribute Removal ………………………………………15

4.4 Attribute Analysis ………………………………………15

4.5 Conclusion ………………………………………20

5.0 Association …………………………………………………21

5.1 Select Attribute …………………………………………………21

5.2 Discretize …………………………………………………24

5.3 Association …………………………………………………25

5.4 Conclusion …………………………………………………30

6.0 Classification and Prediction …………………………………30

6.1 Prediction by Linear Regression …………………………………31

6.2 Prediction by Decision Tree …………………………………37

6.3 Prediction by K-Nearest Neighbor …………………………………40

6.4 Conclusion …………………………………………………41

7.0 Cluster Analysis ………………………………………………... 41

7.1 EM …………………….……………………..42

Page 3: Data Warehousing and Data Mining Project Design

3

7.2 SimpleKMeans ……………….………………………45

7.3 Cobweb …………………….……………………….46

7.4 Fareast first ….………………………46

7.5 Make density based cluster ....………………………46

7.6 Other experiment ………………………..47

7.7 Conclusion ………………………...48

8.0 Resources …………………………………………………………………49

8.1 Software …………………………………………………………….49

8.2 Hardware ……………………………………..……………………49

9.0 References ……………………………………………………………...…...50

Page 4: Data Warehousing and Data Mining Project Design

4

1.0 Introduction

1.1 Motives

It goes without saying that nowadays it is a business dominated world. Our life is

influenced by the fluctuation of economy. Not only our daily life could be changed by

the economic situation, but also could a future of a country. Without a healthy

business environment, people in that country can’t have a stable life.

In addition to the importance of the financial development in a country,

globalization is another important factor that we can’t ignore when talking about the

business world. Hundred years ago, we don’t have to care about what happen to other

countries because it will not make any difference for our life. Nowadays, it’s another

story. Many of us still remember what happened years ago. The financial crisis just

suddenly emerged in South East Asia and Latin America. At that time, all countries in

this world kept eyes carefully on that since everyone knows that the seriousness.

Without carefully handling the situation, everyone could be the next victim.

Because of the importance and complexity of the business world, many

professionals and companies did all kind of researches on financial, monetary and

other business areas. People try to discover the hidden secret behind business world,

want to find a way to control them and even predict what will happen next.

Noticing the importance of the global business world, we decide to collect some

data in this area as our research topic. We also try to see if we can find patterns that

help us to make business decisions in the future.

1.2 Project Description

The characteristics and the activeness of a country’s market attract foreign

investors pouring their money into this market. The amount of foreign direct

investment from each country to another country is often related to the characteristics

of the country’s economic system.

In this project, we intend to discover the correlations between Financial

Development and Structure of countries, and their Foreign Direct Investment. First,

we will analyze the inflow/outflow and position of each country regarding foreign

direct investment, based on the size of financial markets and stock markets, GDP per

capital growth and economic index growth. For example, we make a sub-dataset by

Page 5: Data Warehousing and Data Mining Project Design

5

using every selected country’s Stock market capitalization to GDP and its foreign

direct investment to/from other countries. We then compare each data column in this

sub-dataset for all the countries, to find a correlation in it.

Second, the properties of the time series database of both dataset we choose

provide a good chance to project the future. We will try to make prediction of a

country’s direct investment for the next year, based on the data in past 10 or 20 years.

For example, we can predict which developing country is the most prospecting in

economic growth in the future and if its direct investment grows in the same speed.

1.3 High Level Goal

We want to achieve two high level goals in this project, by applying data mining

techniques to discovery useful and meaning patterns in two datasets.

1. Analyze the correlations between Financial Development and Structure of

countries, and their Foreign Direct Investment.

2. Generate a prediction equation for a country’s Foreign Direct Investment.

1.4 Hypotheses

1. There are a lot of factors that could casually cause bias data. For example, the

political structure of a country suddenly changes for a specific year. Some speculators

attack a financial or stock market by selling big amounts of financial or monetary

products. We should take care to those special cases. However, when we build some

predictive model, we shouldn’t take those bias data into account.

2. We should assume that all people and companies in the market are logical and

make their decision by rational judgments. Only by this case that we can analyze the

reasonable react from known model. We can’t predict what an investor will do if he

doesn’t want to maximize his profit. He could do something undermining his benefit

that usual people won’t do.

Page 6: Data Warehousing and Data Mining Project Design

6

1.5 Datasets

1. World Bank Research Dataset

(http://econ.worldbank.org/view.php?type=18&id=3343)

This dataset of financial development and structure across countries and over time

unites a range of indicators that measure the size, activity, and efficiency of financial

intermediaries and markets. First published in 1999, it improved on previous efforts

by presenting data on the public share of commercial banks, by introducing indicators

of the size and activity of non-bank financial institutions, and by presenting measures

of the size of bond and primary equity markets.

The indicators for each country include Private credit by deposit money banks to

GDP, Financial system deposits, Net Interest Margin, Stock market capitalization to

GDP, Private bond market capitalization to GDP, and etc. The time series are from

1960 to 2001.

2. UNCTAD Databases (United Nations conference on Trade and Development

Database)

The UNCTAD provides time series of economic data and development

indicators, in some cases going back as far as 1950, in order to keep track of trends in

Page 7: Data Warehousing and Data Mining Project Design

7

world trade, the global economy and development. It is possible to view the latest

revised figures as well as the full time series.

The Foreign Direct Investment database (FDI) presents inflows, outflows,

inward stocks and outward stocks of foreign direct investment for 196 reporting

economies in an interactive format.

These data correspond to the WIR 2004 Annex B tables. According "Definitions and

Sources" in the above mentioned publication, Foreign direct investment (FDI) is

defined as an investment involving a long-term relationship and reflecting a lasting

interest in and control by a resident entity in one economy (foreign direct investor or

parent enterprise) of an enterprise resident in a different economy (FDI enterprise or

affiliate enterprise or foreign affiliate). This definition is based on the FDI concept as

presented in the IMF Balance of Payments Manual (BPM 5, 1993) and is also a basis

for that adopted in the second edition of the OECD Detailed Benchmark Definition of

FDI. FDI implies that the investor exerts a significant degree of influence on the

management of the enterprise resident in the other economy. Such investment involves

both the initial transaction between the two entities and all subsequent transactions

between them and among foreign affiliates, both incorporated and unincorporated. The

benefits that direct investors expect to derive from a voice in management are different

from those anticipated by portfolio investors, who have no significant influence over

the operations of enterprises. Direct investors are in a position to obtain benefits in

addition to investment income, such as management fees opportunities or similar types

of income (in contrast to portfolio investors, whose primary concerns are capital safety

and returns generated). A direct investment enterprise is defined as an incorporated or

unincorporated enterprise in which the direct investor, resident in another economy,

owns 10 percent or more of the ordinary shares of voting power (or the equivalent).

However, this criterion is not strictly observed by all countries reporting, which may

decide to also include in the FDI figures those investments that do not yield 10 percent

or more of voting power, but are nonetheless judged to give investors a significant

voice in management. Most direct investment enterprises are either branches or

subsidiaries that are wholly or majority owned by non-residents or in which a clear

majority of voting stock is held by a single direct investor or group. The borderline

cases are therefore likely to form a rather small proportion of the whole FDI cluster.

FDI may be undertaken by individuals as well as by business entities. For more detailed

information on concepts presented in this table, please refer to the IMF Balance of

Payments Manual (BPM 5, 1993) and to UNCTAD's World Investment Report 2004:

The Shift Towards Services.

Page 8: Data Warehousing and Data Mining Project Design

8

3. International Monetary Fund Dataset (IMF Dataset)

Page 9: Data Warehousing and Data Mining Project Design

9

2.0 Project Design

Basically our project can be divided into 3 phases. First, we perform data

preprocessing in order to integrate different data sets and to clean the missing values.

We also apply Discretization to our data when it’s necessary for our analysis tasks. In

second phase, the high level data descriptions are performed. We use different Data

Characterization techniques to get a better understanding about how the data

distribution looks like, what the general information we can obtain before we proceed

more in-depth analysis. Finally three parallel data mining tasks are conducted. We

take different methods to gain more intrinsic characteristics hiding in the data.

3.0 Data Preprocessing

Our project will be based on three datasets: 1. World Bank Research Dataset 2.

UNCTAD databases (United Nations conference on Trade and Development Database)

3. IMF dataset (International Monetary Fund Dataset). These three datasets has

distinct characteristics, such that we need to apply different approaches to these two

datasets.

Data Selection

1. Countries covered

World Bank Research Dataset includes 184 countries with their 16 different financial

indicators in Excel file format. IMF has total 179 countries when United Nations

dataset contains around 200 countries. We use the intersection set so that total 105

countries are selected.

Page 10: Data Warehousing and Data Mining Project Design

10

2. Year covered

These three datasets give different time periods for the data. We will basically choose

the years which all datasets cover. World Bank Research Dataset addresses the

financial indicators from 1960 to 2003. IMF has the period from 1980 to 2005 and

United Nations has the data between 1980 and 2003.

We thus work on the years from 1980 to 2003 for time period.

3. Criteria and Indicator Selected

For World Bank Research Dataset, their dataset addresses totally 17 different

indicators that measure the size, activity, and efficiency of financial intermediaries

and markets. Here we select those indicators we think relevant to international direct

investment including the following:

Central Bank assets to total financial assets

Deposit Money Bank Assets to total financial assets

Other Financial Institutions Assets to total financial assets

Deposit money bank vs. central bank assets

Liquid liabilities to GDP

Central Bank Assets to GDP

Deposit Money Bank Assets to GDP

Other Financial Institutions Assets to GDP

Private credit by deposit money banks to GDP

Private credit by deposit money banks and other financial institutions to GDP

Bank deposits

Financial system deposits

Concentration

Overhead costs

Net interest margin

Life insurance penetration

Non-life insurance penetration

Stock market capitalization to GDP

Stock market total value traded to GDP

Stock market turnover ratio

Private bond market capitalization to GDP

Public bond market capitalization to GDP

One issue to be addressed is the possibility that those indicators seemed not

related to international direct investment are actually correlated to it. Thus, we might

Page 11: Data Warehousing and Data Mining Project Design

11

still need to work on some experiments regarding to those looked irrelevant

indicators.

Data Cleaning

In this project, several issues will need to be considered and dealt with concerning

data cleaning. These issues are missing values, noisy data, and inconsistent data.

1) Fill in missing values

Method: Replaces missing values for numeric attributes with modes/means.

(Using ReplaceMissing Values filter in Weka)

2) Identify outliers/smooth out noisy data

Outliers may be found in the dataset. The reasons could be international

economic crisis, natural weather catastrophe, and etc. These outliers must be

identified when we want to perform a prediction based on the time-series

property. On the other hand, if we focus on the investment activities in a

specific year, this outlier may not be necessary to be deleted, because it is

reasonable that an unusual event like tsunami affects the Thailand’s

investment behaviors.

Page 12: Data Warehousing and Data Mining Project Design

12

3) Correct inconsistent data

Any unreasonable data must be identified and dealt with here, such as a

negative value or a value more than 1 in World Bank’s dataset.

Data Integration

Since we use three different datasets in this project, we have to deal with data

integration. First, these datasets address different topics and indicators. Second, the

measurements of them are different as well.

Using Excel and script can transpose the different data to identical format.

Data Transformation

Data Generalization could be used here by generalizing country to higher-level

concept like Asia, Europe, America, etc. In addition, data aggregation could be done

by aggregating annual data to 5-years or 10-years data value. Other issues like

smoothing and normalization are to be dealt with the way mentioned above in data

cleaning.

Page 13: Data Warehousing and Data Mining Project Design

13

Combined new dataset

Data Normalization

The values of different indicators vary drastically. In view of this, normalize the each

value to a range of 0 to 1

Page 14: Data Warehousing and Data Mining Project Design

14

Data Discretization

An instance filter that discretizes a range of numeric attributes in the dataset into

nominal attributes. Discretization is by simple binning.

Discretize -> for specific classifier like Decision Tree algorithm

4.0 Characterization:

In this phase, we will perform generalization and attribute analysis. These two steps

help us to get a better understanding about our data, in terms of the distribution and

relevance.

4.1 Generalization

For generalization, we intend to use attribute-oriented induction approach to data

generalization and summarization-based characterization. In our case, the dataset we

get from World Bank Group has those attributes considered as indicators of financial

structure of a nation. All we need to do is to select those countries on which we have

interests to do the mining. We may choose countries either by region, by the amount

of direct investment, or by the size of its financial institutes. Then we perform

attribute removal and attribute generalization on our initial working relation ( i.e. the

collection of task-relevant data). Because of the insufficient time series of data, we

may select relatively conservative generalization threshold so as to keep attributes to

remain at a rather low abstraction level. Finally, we basically categorize the countries

into three groups: Developed, Developing, and Under-developing countries. This

gives us a general view of the distribution of our data corresponding to each attribute.

4.2 Analysis of Attribute Relevance

For attribute analysis, we want to evaluate each attribute in the candidate relation

using weka provided methods such as information gain analysis technique. The

Page 15: Data Warehousing and Data Mining Project Design

15

attributes are then ranked according to their computed relevance to the data mining

task. Attributes that are not relevant or are weakly relevant to the task are then

removed. Because we are only interested in particular countries in the dataset, the

remaining unselected data could be used as the contrasting class. This step results in

an initial target class working relation for further mining process.

To achieve a more accurate analysis result, selecting the right attributes to be

included in our data mining tasks is crucial. But how do we know which attributes we

should consider? We thus need to perform an analysis of attribute relevance which can

give us a trustable guideline.

4.3 Attribute Removal

Before performing attribute analysis, we first have to consider whether an

attribute provides necessary information to our analysis, and whether an attribute

possibly sabotages our analysis or confuses our judgment. We find two types of

attributes should be removed before we analyze the attribute relevance: nominal

attributes and numeric attributes which are absolute. The nominal attributes such as

CountryName don’t provide any useful information regarding our data mining tasks.

On the other hand, the absolute values may vary dramatically in a short time because

of the value of a country’s dollar surges or drops dramatically, such as

GDPCurrentPriceBaseOnNationalDollar.

4.4 Attribute Analysis

1. Class Selection

In our project, the main focus is the relationship between countries’ financial

structure and their foreign direct investment. We therefore use those attributes related

to foreign direct investment as our classes and look for other attributes relevant to

these classes. After this step, we will be able to filter out those irrelevant or less

relevant attributes. Our data mining tasks thus become simpler and more accurate.

The selected classes are:

InflowsAsAPercentageOfGFCF

InwardStockAsAPercentageOfGDP

OutflowsAsAPercentageOfGFCF

OutwardStockAsAPercentageOfGDP

Page 16: Data Warehousing and Data Mining Project Design

16

2. Evaluator and Search Method

Weka provides several different evaluators and search methods to facilitate

attribute analysis. Some evaluators must be used with certain search methods and vice

versa. We basically use 2 sets of evaluator and search methods to perform the attribute

analysis. After the more relevant attributes are filtered out by 2 different approaches,

we select the intersection from both result sets as our final attributes.

1) CfsSubsetEval + BestFirst:

Here we use CfsSubsetEval to analyze the attributes searched by BestFirst

algorithm.

2) InfoGainAttributeEval + Ranker:

The second time we use InfoGainAttributeEval and Ranker. The

InfoGainAttributeEval evaluator must be used with Ranker search algorithm

so the result will be listed in an order. In addition, the attributes to be

evaluated must be nominal. To solve this problem, we need to perform a data

preprocessing task, discretize, to process the numeric attributes by dividing

them into 10 different bins with equal frequency. After discretization, the

number of instances in each bin will be the same.

3) Result

The selected attributes from different set of evaluators and searching methods

are quite different. Some attributes selected from the first set ranked rather low in

the second set. After performing attribute analysis, the relevant attributes which

must be included in each aspect of analysis are listed as following: (the attributes

appear in both sets are bolded.)

a. InflowsAsAPercentageOfGFCF:

0.16407 1 Year

0.16026 27 Inflation

0.12796 21 GDPbasedonPurchasingPowerParityShareofWorldTotal

0.11649 26 GDPDeflator

0.11296 29 PPPUSdollarExchangeRate

0.10125 9 FinancialSystemDeposits

0.09923 23 GDPperCapitaCurrentPrices

0.09613 22 GDPbasedonPurchasingPowerParityValuationofCountryGDP

Page 17: Data Warehousing and Data Mining Project Design

17

0.09493 20 GDPbasedonPurchasingPowerParityPerCapitaGDP

0.09306 3 DepositMoneyBankvsCentralBankAssets

0.08968 8 BankDeposits

0.08933 25 GDPCurrentPricesUSDollars

0.08722 6 PrivateCreditbyDepositMoneyBankstoGDP

0.08318 5 DepositMoneyBankAssetstoGDP

0.07852 10 LiquidLiabilitiesToGDP

0.07844 24 GDPProductPerCapitaCurrentPrices

0.07764 7

PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP

0.0716 18 CurrentAccountBalance

0.07032 28 InflationAnnualPercentChange

0.0619 4 CentralBankAssetstoGDP

0.0562 19 CurrentAccountBalanceinPercentofGDP

0.03593 2 Development

0.03264 11 LifeInsurancePenetration

0.03153 13 StockMarketCapitalizationToGDP

0.02643 12 Non-lifeInsurancePenetration

0.02124 15 StockMarketTurnoverRatio

0.02027 14 StockMarketTotalValueTradedToGDP

0.00989 17 PublicBondMarketCapitalizationToGDP

0.00684 16 PrivateBondMarketCapitalizationToGDP

b. InwardStockAsAPercentageOfGDP

0.1797 20 GDPbasedonPurchasingPowerParityShareofWorldTotal

0.1774 7 BankDeposits

0.1765 8 FinancialSystemDeposits

0.1695 26 Inflation

0.1637 28 PPPUSdollarExchangeRate

0.1531 22 GDPperCapitaCurrentPrices

0.1492 25 GDPDeflator

0.1446 19 GDPbasedonPurchasingPowerParityPerCapitaGDP

0.1411 2 DepositMoneyBankvsCentralBankAssets

0.1402 9 LiquidLiabilitiesToGDP

0.129 6

PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP

0.1283 21 GDPbasedonPurchasingPowerParityValuationofCountryGDP

Page 18: Data Warehousing and Data Mining Project Design

18

0.1216 5 PrivateCreditbyDepositMoneyBankstoGDP

0.1108 24 GDPCurrentPricesUSDollars

0.1105 3 CentralBankAssetstoGDP

0.108 4 DepositMoneyBankAssetstoGDP

0.1031 27 InflationAnnualPercentChange

0.0959 23 GDPProductPerCapitaCurrentPrices

0.0778 17 CurrentAccountBalance

0.0671 18 CurrentAccountBalanceinPercentofGDP

0.056 10 LifeInsurancePenetration

0.0494 1 Development

0.0445 12 StockMarketCapitalizationToGDP

0.0409 11 Non-lifeInsurancePenetration

0.0318 13 StockMarketTotalValueTradedToGDP

0.0248 14 StockMarketTurnoverRatio

0.0135 15 PrivateBondMarketCapitalizationToGDP

0.0108 16 PublicBondMarketCapitalizationToGDP

c. OutflowsAsAPercentageOfGFCF

0.397 22 GDPperCapitaCurrentPrices

0.3873 19 GDPbasedonPurchasingPowerParityPerCapitaGDP

0.2846 6

PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP

0.2707 24 GDPCurrentPricesUSDollars

0.2533 5 PrivateCreditbyDepositMoneyBankstoGDP

0.253 4 DepositMoneyBankAssetstoGDP

0.2332 20 GDPbasedonPurchasingPowerParityShareofWorldTotal

0.2186 1 Development

0.2158 17 CurrentAccountBalance

0.214 26 Inflation

0.21 21 GDPbasedonPurchasingPowerParityValuationofCountryGDP

0.2028 2 DepositMoneyBankvsCentralBankAssets

0.1907 25 GDPDeflator

0.178 8 FinancialSystemDeposits

0.1766 7 BankDeposits

0.1759 23 GDPProductPerCapitaCurrentPrices

0.1579 27 InflationAnnualPercentChange

0.1446 10 LifeInsurancePenetration

Page 19: Data Warehousing and Data Mining Project Design

19

0.1436 9 LiquidLiabilitiesToGDP

0.1383 28 PPPUSdollarExchangeRate

0.1343 3 CentralBankAssetstoGDP

0.1253 18 CurrentAccountBalanceinPercentofGDP

0.1119 12 StockMarketCapitalizationToGDP

0.1113 13 StockMarketTotalValueTradedToGDP

0.0746 11 Non-lifeInsurancePenetration

0.0564 14 StockMarketTurnoverRatio

0.0268 15 PrivateBondMarketCapitalizationToGDP

0.0175 16 PublicBondMarketCapitalizationToGDP

d. OutwardStockAsAPercentageOfGDP

0.5374 22 GDPperCapitaCurrentPrices

0.526 19 GDPbasedonPurchasingPowerParityPerCapitaGDP

0.3609 6

PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP

0.3481 4 DepositMoneyBankAssetstoGDP

0.3454 5 PrivateCreditbyDepositMoneyBankstoGDP

0.3269 24 GDPCurrentPricesUSDollars

0.3191 20 GDPbasedonPurchasingPowerParityShareofWorldTotal

0.2794 8 FinancialSystemDeposits

0.2744 2 DepositMoneyBankvsCentralBankAssets

0.2703 21 GDPbasedonPurchasingPowerParityValuationofCountryGDP

0.2666 1 Development

0.2567 7 BankDeposits

0.2557 23 GDPProductPerCapitaCurrentPrices

0.2477 26 Inflation

0.2423 17 CurrentAccountBalance

0.2336 28 PPPUSdollarExchangeRate

0.2104 9 LiquidLiabilitiesToGDP

0.2082 27 InflationAnnualPercentChange

0.1884 25 GDPDeflator

0.1803 10 LifeInsurancePenetration

0.1657 3 CentralBankAssetstoGDP

0.1509 18 CurrentAccountBalanceinPercentofGDP

0.1492 12 StockMarketCapitalizationToGDP

0.1275 13 StockMarketTotalValueTradedToGDP

Page 20: Data Warehousing and Data Mining Project Design

20

0.0917 11 Non-lifeInsurancePenetration

0.0608 14 StockMarketTurnoverRatio

0.0285 15 PrivateBondMarketCapitalizationToGDP

0.0231 16 PublicBondMarketCapitalizationToGDP

The results of the different evaluation scheme show that the biggest difference

from two different method sets occurs in attribute

OutflowsAsAPercentageOfGFCF. The most matched selected attributes appear in

InwardStockAsAPercentageOfGDP. To filter out the most irrelevant attributes, we

choose all the attributes appearing in the first set and the first fifteen attributes

listed in the second set. But those listed in the first set with values smaller than 0.1

in the second set must be eliminated.

4.5 Conclusion

According to the result of this attribute analysis, we can find out the most

relevant attributes corresponding to the Foreign Direct Investment attributes

respectively are:

e. InflowsAsAPercentageOfGFCF:

0.16407 1 Year

0.16026 27 Inflation

0.12796 21 GDPbasedonPurchasingPowerParityShareofWorldTotal

0.11649 26 GDPDeflator

0.11296 29 PPPUSdollarExchangeRate

0.10125 9 FinancialSystemDeposits

0.09923 23 GDPperCapitaCurrentPrices

0.09613 22 GDPbasedonPurchasingPowerParityValuationofCountryGDP

f. InwardStockAsAPercentageOfGDP

0.1797 20 GDPbasedonPurchasingPowerParityShareofWorldTotal

0.1774 7 BankDeposits

0.1765 8 FinancialSystemDeposits

0.1695 26 Inflation

0.1637 28 PPPUSdollarExchangeRate

0.1531 22 GDPperCapitaCurrentPrices

0.1492 25 GDPDeflator

Page 21: Data Warehousing and Data Mining Project Design

21

0.1446 19 GDPbasedonPurchasingPowerParityPerCapitaGDP

g. OutflowsAsAPercentageOfGFCF

0.397 22 GDPperCapitaCurrentPrices

0.3873 19 GDPbasedonPurchasingPowerParityPerCapitaGDP

0.2846 6

PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP

0.2707 24 GDPCurrentPricesUSDollars

0.2533 5 PrivateCreditbyDepositMoneyBankstoGDP

0.253 4 DepositMoneyBankAssetstoGDP

0.2332 20 GDPbasedonPurchasingPowerParityShareofWorldTotal

0.2186 1 Development

h. OutwardStockAsAPercentageOfGDP

0.5374 22 GDPperCapitaCurrentPrices

0.526 19 GDPbasedonPurchasingPowerParityPerCapitaGDP

0.3609 6

PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP

0.3481 4 DepositMoneyBankAssetstoGDP

0.3454 5 PrivateCreditbyDepositMoneyBankstoGDP

0.3269 24 GDPCurrentPricesUSDollars

0.3191 20 GDPbasedonPurchasingPowerParityShareofWorldTotal

0.2794 8 FinancialSystemDeposits

The attributes appearing in all four categories should play more important

roles in our data mining task. These attributes are:

22 GDPperCapitaCurrentPrices

19 GDPbasedonPurchasingPowerParityPerCapitaGDP

6 PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP

4 DepositMoneyBankAssetstoGDP

24 GDPCurrentPricesUSDollars

20 GDPbasedonPurchasingPowerParityShareofWorldTotal

8 FinancialSystemDeposits

7 BankDeposits

26 Inflation

Page 22: Data Warehousing and Data Mining Project Design

22

5.0 Association

As we can read from the textbook, the rule A => B holds in the transaction set D with

support s, where s is the percentage of transactions in D that contain A U B. This is

taken to be the probability, P (A U B). The rule A => B has confidence c in the

transaction set D if c is the percentage of transactions in D containing A that also

contain B. This is taken to be the conditional probability, P (B|A).

For the Algorithm for mining our association rule, we use the Apriori and Predictive

Apriori algorithm. Apriori is an influential algorithm for mining frequent item sets for

Boolean association rules.

5.1

Step 1 of association: Select Attributes

Before start to run our Association data analysis, we have to choose the attribute to

run the analysis. Here, we tried two ways to select attributes.

1. Using Select Attribute function in Weka

2. Using ration or percentage attribute only

The first way to select attribute

First, select the attributes by the class “FID inflows millions of dollars”.

Attribute Evaluator : weka.attributeSelection.CfsSubsetEval

Search Method : weka.attributeSelection.GeneticSearch -Z 20 -G 20 -C 0.6 -M 0.033 -R 20 -S 1

Attribute Selection Mode: Use full training set.

Class: FDIInflowsMillionsOfDollars

Selected attributes: 15,19,20,22,23,26,28,29,31,32,34,35,37,38 : 14

StockMarketTotalValueTradedToGDP

CurrentAccountBalance

CurrentAccountBalanceinPercentofGDP

GDPbasedonPurchasingPowerParityShareofWorldTotal

GDPbasedonPurchasingPowerParityValuationofCountryGDP

GDPCurrentPricesUSDollars

Inflation

InflationAnnualPercentChange

FDIInwardStockMillionsOfDollars

FDIOutwardStockMillionsOfDollars

FDIOutflowsMillionsOfDollars

InflowsAsAPercentageOfGFCF

OutflowsAsAPercentageOfGFCF

Page 23: Data Warehousing and Data Mining Project Design

23

OutwardStockAsAPercentageOfGDP

According to the attribute selection definition, we know that these 14 attributes have

relative higher correlation with our target class “FDIInflowsMillionsOfDollars”.

Because we want to see if we can get some association relation for Foreign Direct

Investment including Inflows and Outflows, we have to select those attributes which

are related to outflows as well.

Second, select the attributes by the class “FDI outflows millions of dollars”.

Attribute Evaluator : weka.attributeSelection.CfsSubsetEval

Search Method : weka.attributeSelection.GeneticSearch -Z 20 -G 20 -C 0.6 -M 0.033 -R 20 -S 1

Attribute Selection Mode: Use full training set.

Class: FDIOutflowsMillionsOfDollars

Selected attributes: 23,25,31,32,33,37 : 6

GDPbasedonPurchasingPowerParityValuationofCountryGDP

GDPProductPerCapitaCurrentPrices

FDIInwardStockMillionsOfDollars

FDIOutwardStockMillionsOfDollars

FDIInflowsMillionsOfDollars

OutflowsAsAPercentageOfGFCF

As we described before, the attributes we want to use should have some relations with

both inflows and outflows, so we will take the union set of those two attributes

selection sets and get our final 19 attributes.

This is the first set of attributes that we use for our association analysis:

CountryName. Year. Development. StockMarketTotalValueTradedToGDP . CurrentAccountBalance

. CurrentAccountBalanceinPercentofGDP. GDPbasedonPurchasingPowerParityShareofWorldTotal

. GDPbasedonPurchasingPowerParityValuationofCountryGDP. GDPProductPerCapitaCurrentPrices

. GDPCurrentPricesUSDollars. Inflation . InflationAnnualPercentChange. FDIInwardStockMillionsOfDollars

. FDIOutwardStockMillionsOfDollars. FDIInflowsMillionsOfDollars

. FDIOutflowsMillionsOfDollars. InflowsAsAPercentageOfGFCF

. OutflowsAsAPercentageOfGFCF. OutwardStockAsAPercentageOfGDP

Page 24: Data Warehousing and Data Mining Project Design

24

The second way to select attributes

Not only the previous way to select attributes, we try to use another set of attributes to

see if we would get any different result. By directly review the characteristic of all the

attributes; we can pick the ratio only attributes. The reason why we want to select

only ratio attributes is that the differences between different countries are too large.

That will make association more complicated. For example, the difference in GDP

between USA and Korea might be very large. However, it’s possible that other ratio

attributes are much closed between these two countries. Like GDP to Inflation ratio,

direct investment to bank assets ratio and so on. In mixed attribute sets, those big

difference attributes set could disturb our pattern and blur out the similiarity in this

two countries.

After removing those non ration and non percentage attributes, we get 23 attributes

left which are

CountryName. Year. Development. DepositMoneyBankvsCentralBankAssets . CentralBankAssetstoGDP .

DepositMoneyBankAssetstoGDP . PrivateCreditbyDepositMoneyBankstoGDP

. PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP. LiquidLiabilitiesToGDP . LifeInsurancePenetration

. Non-lifeInsurancePenetration . StockMarketCapitalizationToGDP . StockMarketTotalValueTradedToGDP

. StockMarketTurnoverRatio. PrivateBondMarketCapitalizationToGDP . PublicBondMarketCapitalizationToGDP

. CurrentAccountBalanceinPercentofGDP . InflationAnnualPercentChange. PPPUSdollarExchangeRate

. InflowsAsAPercentageOfGFCF . InwardStockAsAPercentageOfGDP . OutflowsAsAPercentageOfGFCF

. OutwardStockAsAPercentageOfGDP

5.2

Step 2 of association: Discretize

After selected the attributes, there still one more step before we really associate our

datasets. We have to discretize our data. As described before, the algorithm we use,

Apriori algorithm, can’t handle pure numerical datasets. In order to be able to run this

algorithm in our associate analysis, we discretize our data. After discretizing dataset,

we can categorize our data value as several distinct bins such as “between 0 to 1”,

“between 1 to 3” rather than the original pure numerical value.

In Weka, after select the attributes from previous step, we use filter to discretize our

datasets. When we try to discretize, there is an important option that we have to setup.

The criteria “use equal frequency” under the discretize filter should be set to True.

Without doing that, the result of discretize is meaningless for us since most of the data

will be discretized into a few specific bins.

With all the data in the same bins, we can’t get appropriate results.

Page 25: Data Warehousing and Data Mining Project Design

25

5.3

Step 3 of association: Association

1. Associated by the first 19 attributes with Apriori:

Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0

Instances: 2520

Associator model (full training set)

Apriori

=======

Minimum support: 0.15

Minimum metric <confidence>: 0.9

Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 9

Size of set of large itemsets L(2): 11

Size of set of large itemsets L(3): 7

Size of set of large itemsets L(4): 1

Best rules found:

1. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 480 ==>

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 480 conf:(1)

2. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]'

OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 450 ==> OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]'

450 conf:(1)

3. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 458 ==>

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 457 conf:(1)

4. Development=Developing FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' 396 ==>

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 391 conf:(0.99)

5. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' 544 ==>

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 537 conf:(0.99)

6. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]'

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 457 ==> OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]'

450 conf:(0.98)

7. FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 479 ==>

OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 471 conf:(0.98)

8. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 458 ==>

OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 450

conf:(0.98)

Page 26: Data Warehousing and Data Mining Project Design

26

9. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 458 ==>

OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 450 conf:(0.98)

10. FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 621 ==> OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]'

603 conf:(0.97)

What we could get from this result:

We can see that those ten best rules just describe the high correlations between

Foreign Direct Investment Outflows with Outward Stock percentage and other

Foreign Direct Investment outflows related index. All confidences from them are very

high since all of them are higher than 0.97. Also, minimum support 0.15 is a

reasonable choice to make this analysis.

The reason why we get the result all related to Foreign Direct Investment Outflows

could be that because we selected those 19 attributes by their correlation with FDI

outflows and FDI inflows.

Of course, we still can get some patterns in this association analysis but we want to

try to get something more than that. That’s why we will try to use different attributes

and other algorithm for association checks again.

2. Associated by the first 19 attributes with Predictive Apriori:

Scheme: weka.associations.PredictiveApriori -N 100

Instances: 2520

Associator model (full training set)

PredictiveApriori

===================

Best rules found:

1. GDPProductPerCapitaCurrentPrices='(-inf-0.000001]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 228 ==>

OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 228 acc:(0.99391)

2. GDPProductPerCapitaCurrentPrices='(-inf-0.000001]' FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]'

OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 198 ==> OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]'

198 acc:(0.99347)

3. OutflowsAsAPercentageOfGFCF='(-inf-0.427376]' 191 ==> FDIOutflowsMillionsOfDollars='(-inf-0.030614]' 191

acc:(0.99334)

4. GDPProductPerCapitaCurrentPrices='(-inf-0.000001]' FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]'

FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 190 ==> OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]'

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 190 acc:(0.99332)

5. FDIInwardStockMillionsOfDollars='(-inf-0.000618]' FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' 184 ==>

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 184 acc:(0.99319)

6. Development=Developing GDPProductPerCapitaCurrentPrices='(-inf-0.000001]'

FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' 164 ==>

Page 27: Data Warehousing and Data Mining Project Design

27

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 164 acc:(0.9927)

7. GDPbasedonPurchasingPowerParityValuationofCountryGDP='(-inf-0.000206]'

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 163 ==>

FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' 163 acc:(0.99267)

8. Development=Developed GDPbasedonPurchasingPowerParityValuationofCountryGDP='(0.05223-inf)' 162 ==>

GDPCurrentPricesUSDollars='(0.032872-inf)' 162 acc:(0.99264)

9. GDPbasedonPurchasingPowerParityValuationofCountryGDP='(-inf-0.000206]'

FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 155 ==>

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 155 acc:(0.99242)

10. FDIInflowsMillionsOfDollars='(-inf-0.014284]' 243 ==> InflowsAsAPercentageOfGFCF='(-inf-0.538132]' 242

acc:(0.99236)

What we could get from this result:

Briefly speaking, from these association rules, we can see some more useful attributes

correlations.

For example, the first rule tells us that if the GDP per capital in a specific range when

the foreign direct investment also in a specific range, we can predict the ratio between

outflows and gross fixed capital formation.

For us, outward stock as percentage of GDP is one index that we interest most.

According to those rules we get, we know that we can predict this index by knowing

GDP per capital, foreign direct investment outward stock amount, GDP based on

purchasing power and so on.

Another number that we might be interest is out flows as percentage of gross fixed

capital formation.

For predicting this index, we should know GDP per capital, FDIOutflows, FDI

outward stock and so on.

3. Associated by the ratio and percentage 23 attributes with Apriori:

Note: In the beginning, we cant’ get any best result from the default setting. Therefore,

we changed the Minimum confidence from 0.9 to 0.5 in order to generate some rules.

Finally, we get a few results as we expected.

Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0

Instances: 2520

Associator model full training set

Apriori

=======

Minimum support: 0.1

Page 28: Data Warehousing and Data Mining Project Design

28

Minimum metric <confidence>: 0.5

Number of cycles performed: 18

Generated sets of large itemsets:

Size of set of large itemsets L(1): 79

Size of set of large itemsets L(2): 5

Size of set of large itemsets L(3): 1

Best rules found:

1. OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 612 ==> OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]'

531 conf:(0.87)

2. Development=Developing OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 464 ==>

OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 393 conf:(0.85)

3. OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 746 ==> Development=Developing 571 conf:(0.77)

4. OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 612 ==> Development=Developing 464 conf:(0.76)

5. OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 531 ==>

Development=Developing 393 conf:(0.74)

6. OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 746 ==> OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]'

531 conf:(0.71)

7. Development=Developing OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 571 ==>

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 393 conf:(0.69)

8. OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 612 ==> Development=Developing

OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 393 conf:(0.64)

9. OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 746 ==> Development=Developing

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 393 conf:(0.53)

10. Development=Developed 576 ==> PPPUSdollarExchangeRate='(0.000001-0.000002]' 303 conf:(0.53)

What we could get from this result:

As we can see from the result, the confidences in this associate test are lower than the

previous test. However, we do get some more useful information here.

For those developing countries, the relation between stock investment and total

foreign investment are high correlated. Also, we can see that if a country is a

developed country, Purchasing power parity of it is likely to fall into a specific range.

4. Associated by the ratio and percentage 23 attributes with Predictive

Apriori:

Scheme: weka.associations.PredictiveApriori -N 100

Relation:

Page 29: Data Warehousing and Data Mining Project Design

29

finstructure-weka.filters.unsupervised.attribute.Normalize-weka.filters.unsupervised.attribute.Remove-R9-10,19,31-34-weka.filte

rs.unsupervised.attribute.Remove-R18-25-weka.filters.unsupervised.attribute.Discretize-F-B10-M-1.0-Rfirst-last

Instances: 2520

Associator model (full training set)

PredictiveApriori

=======

Best rules found:

1. PPPUSdollarExchangeRate='(0.000313-inf)' 188 ==> Development=Developing 188 acc:(0.99325)

2. DepositMoneyBankAssetstoGDP='(0.549816-inf)'

PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP='(0.525423-inf)'

LiquidLiabilitiesToGDP='(0.413579-inf)' 105 ==> PrivateCreditbyDepositMoneyBankstoGDP='(0.489811-inf)' 105

acc:(0.9898)

3. LiquidLiabilitiesToGDP='(-inf-0.077226]' OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 97 ==>

Development=Developing 97 acc:(0.9891)

4. DepositMoneyBankvsCentralBankAssets='(0.713174-inf)' DepositMoneyBankAssetstoGDP='(0.549816-inf)'

PrivateCreditbyDepositMoneyBankstoGDP='(0.489811-inf)' 97 ==> Development=Developed 97 acc:(0.9891)

5. DepositMoneyBankvsCentralBankAssets='(-inf-0.328866]' DepositMoneyBankAssetstoGDP='(-inf-0.053588]'

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 94 ==> Development=Developing 94 acc:(0.98881)

29. DepositMoneyBankvsCentralBankAssets='(-inf-0.328866]' InwardStockAsAPercentageOfGDP='(-inf-0.55239]' 56 ==>

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 56 acc:(0.9823)

33. StockMarketCapitalizationToGDP='(0.288662-inf)' OutflowsAsAPercentageOfGFCF='(0.461885-inf)' 55 ==>

OutwardStockAsAPercentageOfGDP='(0.240554-inf)' 55 acc:(0.982)

52. Year=1981 OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 46 ==>

OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 46 acc:(0.97879)

56. PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP='(0.15771-0.207416]'

OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 78 ==> OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 77

acc:(0.97557)

75. CountryName=BurkinaFaso 24 ==> Development=Developing InwardStockAsAPercentageOfGDP='(-inf-0.55239]' 24

acc:(0.96134)

89. CountryName=Dominica 24 ==> Development=Developing OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 24

acc:(0.96134)

92. CountryName=Egypt 24 ==> Development=Developing InwardStockAsAPercentageOfGDP='(-inf-0.55239]' 24

acc:(0.96134)

What we could get from this result:

The original result has 100 rules showed up from Weka. However, we just show a few

here.

From the first rule, we can see the relation between Purchasing power parity and

Page 30: Data Warehousing and Data Mining Project Design

30

developing countries. If we get purchasing power parity information, we could

possibly guess if this country is developing country or not.

Meanwhile, for us, outward stock as percentage of GDP is one index that we interest

most.

We know that we can predict this index by knowing inward stock as percentage of

GDP, if this country is developing country, stock market capitalization to GDP and so

on.

Another number that we might be interest is out flows as percentage of gross fixed

capital formation.

For predicting this index, we should know private credit by deposit money banks,

outward stock as percentage of GDP, country name and some attributes got from

rules.

5.4 Conclusion:

From the definition, rules that satisfy both a minimum support threshold and a

minimum confidence threshold are called strong. In our case, we get several strong

rules in our association since they satisfy these requirements. However, not all of

them are useful for us. Because of that, we have to choose what the rules we need are

and we should apply to. For example, we got many rules that do nothing with inflows,

outflows foreign direct investment, inward, outward stock ratio.

By selecting those useful rules, we can use them to do the prediction like we did

before.

Of course, even some attributes and rules are not our interests now; we might still

need them later. When we want to predict other different attributes, we just need to

repeat the same processes. We can analyze many relations between attributes and also

try to predict data by knowing a few key values.

6.0 Classification and Prediction:

According to our project goal, our task is to predict certain countries' direct

investment in United States by means of analyzing correlation between their financial

structure and their direct investment in United States. Because variant size of one

country's financial structure may have different international investment behavior and

strategy, and other factors, like vicinity or development status of a country, may affect

Page 31: Data Warehousing and Data Mining Project Design

31

investment in United States as well. So we can construct a classification model by

decision tree induction to classify any data of an unseen country. Then make a

prediction according to the class which the unseen country matches. For those

countries which have already been analyzed throughout our mining process, we can

simply employ the statistical techniques of regression to make the prediction of

continuous values.

Select Attributes

Originally we have 38 attributes in our dataset. After using all of those attributes to do

the mining task, we found that due to enormous variation of some values of the

attributes, the resulting prediction is not very satisfied, even though we have done

attributes normalization to eliminate the problem stemmed from scalability. So we

decide to filter out those attributes whose values are absolute and varying dramatically

from country to country. We keep only those attributes which are in percentage or

ratio.

Secondly, we remove country name, year, and development attributes. Because our

goal is to build a modal which can make a prediction on one nation’s inward or

outward investment based on other attributes’ values excluding country, year and

development situation. Therefore there are 20 attributes left for doing the data mining

task.

6.1 Prediction by Linear Regression

Using Linear Regression to make a prediction on OutwardStockAsPercentageOfGDP

and InwardStockAsPercentageOfGDP

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation:

finstructure-weka.filters.unsupervised.attribute.Normalize-weka.filters.unsupervised.a

ttribute.Remove-R9-10,19,31-34-weka.filters.unsupervised.attribute.Remove-R18-25-

weka.filters.unsupervised.attribute.Remove-R1-3-weka.filters.unsupervised.instance.

RemoveMisclassified-Wweka.classifiers.functions.LinearRegression -S 0 -R

1.0E-8-C-1-F0-T0.05-I0

Instances: 1915

Attributes: 20

Page 32: Data Warehousing and Data Mining Project Design

32

DepositMoneyBankvsCentralBankAssets

CentralBankAssetstoGDP

DepositMoneyBankAssetstoGDP

PrivateCreditbyDepositMoneyBankstoGDP

PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP

LiquidLiabilitiesToGDP

LifeInsurancePenetration

Non-lifeInsurancePenetration

StockMarketCapitalizationToGDP

StockMarketTotalValueTradedToGDP

StockMarketTurnoverRatio

PrivateBondMarketCapitalizationToGDP

PublicBondMarketCapitalizationToGDP

CurrentAccountBalanceinPercentofGDP

InflationAnnualPercentChange

PPPUSdollarExchangeRate

InflowsAsAPercentageOfGFCF

InwardStockAsAPercentageOfGDP

OutflowsAsAPercentageOfGFCF

OutwardStockAsAPercentageOfGDP

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Linear Regression Model (Outward Stock as a percentage of GDP)

OutwardStockAsAPercentageOfGDP =

-0.0229 * DepositMoneyBankvsCentralBankAssets +

-0.0361 * CentralBankAssetstoGDP +

0.1014 * DepositMoneyBankAssetstoGDP +

0.0207 * PrivateCreditbyDepositMoneyBankstoGDP +

-0.0107 *

PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP +

-0.07 * LiquidLiabilitiesToGDP +

0.0544 * LifeInsurancePenetration +

0.0899 * StockMarketCapitalizationToGDP +

0.2151 * StockMarketTotalValueTradedToGDP +

-0.1165 * StockMarketTurnoverRatio +

0.0169 * PublicBondMarketCapitalizationToGDP +

Page 33: Data Warehousing and Data Mining Project Design

33

0.165 * CurrentAccountBalanceinPercentofGDP +

0.1594 * InwardStockAsAPercentageOfGDP +

1.8957 * OutflowsAsAPercentageOfGFCF +

-0.9621

Time taken to build model: 0.2 seconds

=== Cross-validation ===

=== Summary ===

Correlation coefficient 0.9398

Mean absolute error 0.012

Root mean squared error 0.0158

Relative absolute error 45.4532 %

Root relative squared error 34.1512 %

Total Number of Instances 1915

Linear Regression Model ( Inward stock as a percentage of GDP)

InwardStockAsAPercentageOfGDP =

0.0634 * DepositMoneyBankvsCentralBankAssets +

0.0588 * CentralBankAssetstoGDP +

-0.1443 * DepositMoneyBankAssetstoGDP +

0.0851 * PrivateCreditbyDepositMoneyBankstoGDP +

-0.0384 *

PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP +

0.0846 * LiquidLiabilitiesToGDP +

-0.03 * LifeInsurancePenetration +

0.0716 * Non-lifeInsurancePenetration +

-0.1021 * StockMarketTotalValueTradedToGDP +

0.0355 * StockMarketTurnoverRatio +

-0.0273 * PrivateBondMarketCapitalizationToGDP +

-0.1699 * CurrentAccountBalanceinPercentofGDP +

15.8396 * InflowsAsAPercentageOfGFCF +

-0.7026 * OutflowsAsAPercentageOfGFCF +

0.4928 * OutwardStockAsAPercentageOfGDP +

-7.5842

Page 34: Data Warehousing and Data Mining Project Design

34

Time taken to build model: 0.19 seconds

=== Cross-validation ===

=== Summary ===

Correlation coefficient 0.6141

Mean absolute error 0.0192

Root mean squared error 0.0289

Relative absolute error 75.1824 %

Root relative squared error 79.1605 %

Total Number of Instances 1915

Analysis

Since our task is to predict a continuous value, rather than a categorical label, we

choose statistical techniques of regression to tackle this problem.

The prediction result for outward stock as percentage of GDP is acceptable, while that

for inward stock as percentage of GDP is way unacceptable. The problem may come

from that linear regression method is not suitable for the model we want to build; or

the attributes in the dataset are less correlated to inward stock. Since it is not good

enough to make a prediction on inward stock as percentage of GDP, we focus on the

model for predicting outward stock as percentage of GDP.

At the beginning, the regression equation we got only has 12 attributes involved and

its correlation coefficient is 0.8074. But relative absolute error and root relative

squared error are too high, we think it is probably because of some outliers. Therefore

we use filter “RemoveMisclassified” to remove outliers and curtail the instances from

2520 to 1915. Then we build a better model whose correlation coefficient is 0.9398 ,

and both of relative absolute error and root relative squared error are down to less

than 50%.

Page 35: Data Warehousing and Data Mining Project Design

35

Histogram shows the distribution of each attribute in the datase

Page 36: Data Warehousing and Data Mining Project Design

36

X: PredictedOutwardStockAsAPercentageOfGDP

Y: OutwardStockAsAPercentageOfGDP

The plot diagram shows that the prediction is relatively proportional to the real valu

Page 37: Data Warehousing and Data Mining Project Design

37

6.2 Prediction by Decision Tree

Using Decision Tree to make a prediction on OutwardStockAsPercentageOfGDP

Result:

=== Summary ===

Correctly Classified Instances 1907 97.4949 %

Incorrectly Classified Instances 49 2.5051 %

Kappa statistic 0.9093

Mean absolute error 0.005

Root mean squared error 0.0602

Relative absolute error 8.8114 %

Root relative squared error 36.062 %

Total Number of Instances 1956

Analysis

We then changed our approach to predict the target value. By using filter

“Discretize” to convert numeric values to nominal data, we can do classification

using decision tree. We are able to build a decision tree and predict which interval of

the OutwardStockAsPercentageOfGDP the query will fall in.

At the beginning, the resulting model is around 70% accuracy rate, which is

acceptable but not very good. And we think outlier is always an important issue.

Therefore after eliminating those outliers and cutting down the instances from 2520

to 1956, we get a much better result whose accuracy rate is 97%.

We take a close look at the decision tree and found that the top level of tree is the

attribute “PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP”.

Therefore we can infer that this attribute is most discernable to the target value we

want to predict.

Page 38: Data Warehousing and Data Mining Project Design

38

Histogram shows the distribution of each attribute with coloring by class grouping.

Page 39: Data Warehousing and Data Mining Project Design

39

X: OutwardStockAsAPercentageOfGDP

Y: PredictedOutwardStockAsAPercentageOfGDP

Page 40: Data Warehousing and Data Mining Project Design

6.3 Prediction by K-Nearest Neighbor

Using Nearest Neighbor to make a prediction on OutwardStockAsPercentageOfGDP

Result:

K=1

=== Cross-validation ===

=== Summary ===

Correlation coefficient 0.9379

Mean absolute error 0.0332

Root mean squared error 0.0501

Relative absolute error 37.9836 %

Root relative squared error 37.582 %

Total Number of Instances 1001

Analysis

Then we tried instance-based lazy learning method to do the prediction. After removing

outliers by using filter “RemoveMisclassified”, the number of instances is cut down to

1001. When K=1, the result is as above. We tried K=2 and K=3 but the results are not as

good as that when K=1. But generally speaking, the accuracy rate is pretty high

compared with that by linear regression. This makes sense because investment indicators

may follow certain pattern of the economic indicators, the query may get a reasonable

estimation by referring to the real-valued labels associated with the k nearest neighbors of

the unknown sample. If the query in the mining task is from one developed country, then

its indicators may fall in the range where other developed countries are located.

That one problem may arise by k-nearest neighbors classifier is to assign equal weight to

each attribute. This may cause confusion when there are many irrelevant attributes in the

data.

Page 41: Data Warehousing and Data Mining Project Design

41

6.4 Conclusion:

� When we employed the linear regression, we make an assumption that our

prediction can fit in the linear regression model. The result shows that our target

value which we want to predict is only related to the subset of all the attributes

according to the equation. As a result, we can further predict the outward stock as a

percentage of GDP by these attributes. But since we have total 23 attributes, there is

a possibility that our task would be better to fit in nonlinear regression model.

Nevertheless, we chose to simplify the task by deploying simple linear regression.

� By using symbolic learning method, decision tree, we obtained a satisfied prediction.

And from decision-tree building algorithms, we found the attribute

“PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP” does the

best job of splitting the training data at the root of the tree. Since we discretized each

attribute to 10 bins, the prediction obtained is only a range instead of the exact

number. The high precision may come from the fact that most of the instances fall in

the certain ranges. Therefore, how to appropriately discretize the data into certain

amount of bins so that distribution would be more suitable for building the decision

tree would be a primary factor.

� One of the problems of making prediction by Case-based method is that the result

highly depends on the original data. If the unknown data which is about to predict

does not match any initial data, the technique will not obtain good prediction.

Besides, assigning equal weight to each attribute may also be a problem. Since from

the result of the linear regression, only subset of the attributes are related to the

prediction. Therefore, choosing a proper weighting scheme for different attributes

would be tough yet important issue when deploying this method.

7.0 Cluster Analysis

Here we use the cluster algorithms provided by Weka, trying to group the data into

clusters which contains objects with high similarity comparing to the objects in other

clusters. We try to analyze the relationship between developing phase and countries’

financial structure by applying the five cluster methods. The selected attributed to be used

are those chosen in attribute analysis.

Page 42: Data Warehousing and Data Mining Project Design

42

7.1 Using EM

Grouping: The instances are automatically grouped into 8 clusters. The details are

listed as following:

Number of clusters selected by cross validation: 8

Cluster: 0 Prior probability: 0.2877

Cluster: 1 Prior probability: 0.1483

Cluster: 2 Prior probability: 0.0376

Cluster: 3 Prior probability: 0.2409

Cluster: 4 Prior probability: 0.0292

Cluster: 5 Prior probability: 0.1564

Cluster: 6 Prior probability: 0.0449

Cluster: 7 Prior probability: 0.055

7.1.1 Correlations

We try to use the visualization of this clustering result to understand the correlations

between financial structure attributes and foreign direct investment attributes. We find

some attributes are relevant to foreign direct investment; some aren’t.

1. Irrelevant attributes: All the instances are all together. There’s no way to

distinguish the grouping in the visualization.

Year

GDPProductPerCapitaCurrentPrices

GDPbasedonPurchasingPowerParityValuationofCountryGDP

GDPperCapitaCurrentPrices

GDPCurrentPricesUSDollars

GDPDeflator

Inflation

PPPUSdollarExchangeRate

GDPbasedonPurchasingPowerParityShareofWorldTotal

2. Relevant attributes: The instances can be generally separated into two

clusters in the visualization.

DepositMoneyBankvsCentralBankAssets

DepositMoneyBankAssetstoGDP

PrivateCreditbyDepositMoneyBankstoGDP

BankDeposits

FinancialSystemDeposits

Page 43: Data Warehousing and Data Mining Project Design

43

LiquidLiabilitiesToGDP

GDPbasedonPurchasingPowerParityPerCapitaGDP

For example, the following visualization represents no correlation. All

instances are located around the same place. No cluster distinguishing effects

happen.

PPPUSdollarExchangeRate vs. InflowsAsAPercentageOfGFCF

The following visualization represents positive correlation.

Page 44: Data Warehousing and Data Mining Project Design

44

GDPperCapitaCurrentPrices vs InflowsAsAPercentageOfGFCF

Here the picture shows the interesting correlation between Development and

InflowsAsAPercentageOfGFCF. The left cluster basically consists of developed

countries, the middle cluster has both developed and developing countries, and the

right cluster is basically others.

Page 45: Data Warehousing and Data Mining Project Design

45

Development vs. InflowsAsAPercentageOfGFCF

7.2 Using SimpleKMeans

We define two clusters grouped together. The first cluster consists of mostly

developed countries, and the second cluster consists of mostly developing and other

countries. The percentage of incorrectly clustered instances is 24.9206 %. Thus the

correctness is somewhat believable. This clustering result tells us the financial structure

and direct investments are highly related to countries’ developing status.

Clustered Instances

0 814 ( 32%)

1 1706 ( 68%)

Class attribute: Development

Classes to Clusters:

0 1 <-- assigned to cluster

499 77 | Developed

287 1393 | Developing

28 236 | Other

Cluster 0 <-- Developed

Cluster 1 <-- Developing

Page 46: Data Warehousing and Data Mining Project Design

46

Incorrectly clustered instances : 628.0 24.9206 %

7.3 Using Cobweb

In this clustering task, the instances are equally separated into same sized clusters.

Every cluster has exactly 105 instances. Thus the percentage of incorrectly clustered

instances is 95.8333 % which gives no help to our analysis.

7.4 Using FarthestFirst

The result here shows two clusters. One cluster consists of almost all instances, and

the other one only has 3%. The result is not useful.

Clustered Instances

0 2440 ( 97%)

1 80 ( 3%)

Class attribute: Development

Classes to Clusters:

0 1 <-- assigned to cluster

499 77 | Developed

1677 3 | Developing

264 0 | Other

Cluster 0 <-- Developing

Cluster 1 <-- Developed

Incorrectly clustered instances : 766.0 30.3968 %

7.5 Using MakeDensityBasedClusterer

Two clusters are grouped. First cluster consists of almost 95% developed countries,

around 20% developing countries, and 10% other countries. Second cluster includes

basically developing countries and other countries. This result corresponds to what we

had when using SimpleKMeans.

Clustered Instances

0 871 ( 35%)

1 1649 ( 65%)

Page 47: Data Warehousing and Data Mining Project Design

47

Log likelihood: 21.99495

Class attribute: Development

Classes to Clusters:

0 1 <-- assigned to cluster

539 37 | Developed

307 1373 | Developing

25 239 | Other

Cluster 0 <-- Developed

Cluster 1 <-- Developing

Incorrectly clustered instances : 608.0 24.127 %

7.6 Other experiment

In this section we try to use another approach to group the instances. We would

like to use the attributes related to countries’ foreign direct investment as our class to

map to the clusters. We first discretize the four attributes

InflowsAsAPercentageOfGFCF, InwardStockAsAPercentageOfGDP,

OutflowsAsAPercentageOfGFCF, and OutwardStockAsAPercentageOfGDP. These

attributes are then selected as the class. However, the results in this experiment are

unsatisfying. The incorrectly clustered instances are too many. (84.0476 %)

Clustered Instances

0 396 ( 16%)

1 461 ( 18%)

2 295 ( 12%)

3 324 ( 13%)

4 156 ( 6%)

5 162 ( 6%)

6 139 ( 6%)

7 128 ( 5%)

8 253 ( 10%)

9 206 ( 8%)

Class attribute: InflowsAsAPercentageOfGFCF

Page 48: Data Warehousing and Data Mining Project Design

48

Classes to Clusters:

0 1 2 3 4 5 6 7 8 9 <-- assigned to cluster

18 83 35 78 16 4 11 6 38 20 | '(-inf-0.538132]'

28 43 25 40 15 9 23 4 41 17 | '(0.538132-0.538189]'

31 37 19 36 19 13 27 12 33 14 | '(0.538189-0.538266]'

32 44 23 31 21 18 22 18 26 16 | '(0.538266-0.538364]'

29 48 36 22 9 29 21 16 16 18 | '(0.538364-0.53849]'

42 40 31 20 23 20 12 18 25 16 | '(0.53849-0.538637]'

39 40 24 24 20 30 11 11 26 19 | '(0.538637-0.538861]'

57 44 40 22 8 18 6 11 15 25 | '(0.538861-0.539225]'

49 44 39 20 13 15 4 16 20 26 | '(0.539225-0.539841]'

71 38 23 31 12 6 2 16 13 35 | '(0.539841-inf)'

Cluster 0 <-- '(0.539841-inf)'

Cluster 1 <-- '(0.538364-0.53849]'

Cluster 2 <-- '(0.538861-0.539225]'

Cluster 3 <-- '(-inf-0.538132]'

Cluster 4 <-- '(0.53849-0.538637]'

Cluster 5 <-- '(0.538637-0.538861]'

Cluster 6 <-- '(0.538189-0.538266]'

Cluster 7 <-- '(0.538266-0.538364]'

Cluster 8 <-- '(0.538132-0.538189]'

Cluster 9 <-- '(0.539225-0.539841]'

Incorrectly clustered instances : 2118.0 84.0476 %

7.7 Conclusion

From this cluster analysis, we have following conclusions:

1. Deposit Money Bank plays an important role. It influences the performance of

a country’s foreign direct investment. The higher the ratio of Deposit Money

Bank versus Central Bank Assets is, the more the Investment Inflow from

other foreign country.

2. Financial System Deposits has high correlations with a country’s

inward/outward and inflow/outflow investment movements. Demand, time

Page 49: Data Warehousing and Data Mining Project Design

49

and saving deposits in deposit money banks and other financial institutions

actually related to the activeness of a country’s foreign direct investment.

3. GDP dominates all the attributes as the decisive criteria to overlook a

country’s foreign direct investment behaviors. Either the ratio of other

attributes to GDP, or GDP itself has deep impacts in how active countries’

foreign direct investment is.

4. GDPDeflator, Inflation, PPPUSdollarExchangeRate have nothing to do with a

country’s performance regarding foreign direct investment.

5. The completeness of a country’s development somewhat represents the

grouping results of the clustering, but the correlation is not totally positive.

There must be some other involved factors.

6. Based on the data’s intrinsic characteristics, the instances can be grouped into

8 clusters which are determined by EM algorithm. EM algorithm can

automatically decide how many clusters to be created, while other algorithms

don’t. We need to presume how many clusters there will be when using other

8.0 Resources

8.1 Software Environment

1. Weka

2. Excel

8.2 Hardware Environment

Laptop1:

Intel ®Pentium® M Processor (1.3 GHz)

256 DDR SDRAM

40 GB 4200 RPM HD

Page 50: Data Warehousing and Data Mining Project Design

50

Laptop 2:

Intel ®Pentium® M Processor (1.5 GHz)

512 DDR SDRAM

40 GB 4200 RPM HD

9.0 References

1. United Nations http://www.unctad.org/Templates/Page.asp?intItemID=1923&lang=1

2. IMP http://www.imf.org/external/pubs/ft/weo/2004/02/data/index.htm

2. World Bank http://econ.worldbank.org/view.php?type=18&id=3343