Upload
tommy96
View
1.480
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
G22.3033-010
Data Warehousing and Data Mining
Project Design
Correlations between Financial Structure
and Foreign Direct Investment
Cheng-Cheng Ku (N18898117)
Chien-wen Hsu (N16208816)
Li-Heng Chen (N14028026)
May 2, 2005
2
Table of Contents
1.0 Introduction …………………………………………………………4
1.1 Motives ……………………………………………………………4
1.2 Project Description …………………………………………. 4
1.3 High Level Goal ………………………………………………. 5
1.4 Hypothesis ………………………………………………………5
1.5 Datasets ………………………………………………………6
2.0 Project Design ……………………………………………………………9
3.0 Data Preprocessing ……………………………………………9
4.0 Characterization ……………………………………………14
4.1 Generalization ……………………………………………14
4.2 Analysis of Attribute Relevance ………………………………………14
4.3 Attribute Removal ………………………………………15
4.4 Attribute Analysis ………………………………………15
4.5 Conclusion ………………………………………20
5.0 Association …………………………………………………21
5.1 Select Attribute …………………………………………………21
5.2 Discretize …………………………………………………24
5.3 Association …………………………………………………25
5.4 Conclusion …………………………………………………30
6.0 Classification and Prediction …………………………………30
6.1 Prediction by Linear Regression …………………………………31
6.2 Prediction by Decision Tree …………………………………37
6.3 Prediction by K-Nearest Neighbor …………………………………40
6.4 Conclusion …………………………………………………41
7.0 Cluster Analysis ………………………………………………... 41
7.1 EM …………………….……………………..42
3
7.2 SimpleKMeans ……………….………………………45
7.3 Cobweb …………………….……………………….46
7.4 Fareast first ….………………………46
7.5 Make density based cluster ....………………………46
7.6 Other experiment ………………………..47
7.7 Conclusion ………………………...48
8.0 Resources …………………………………………………………………49
8.1 Software …………………………………………………………….49
8.2 Hardware ……………………………………..……………………49
9.0 References ……………………………………………………………...…...50
4
1.0 Introduction
1.1 Motives
It goes without saying that nowadays it is a business dominated world. Our life is
influenced by the fluctuation of economy. Not only our daily life could be changed by
the economic situation, but also could a future of a country. Without a healthy
business environment, people in that country can’t have a stable life.
In addition to the importance of the financial development in a country,
globalization is another important factor that we can’t ignore when talking about the
business world. Hundred years ago, we don’t have to care about what happen to other
countries because it will not make any difference for our life. Nowadays, it’s another
story. Many of us still remember what happened years ago. The financial crisis just
suddenly emerged in South East Asia and Latin America. At that time, all countries in
this world kept eyes carefully on that since everyone knows that the seriousness.
Without carefully handling the situation, everyone could be the next victim.
Because of the importance and complexity of the business world, many
professionals and companies did all kind of researches on financial, monetary and
other business areas. People try to discover the hidden secret behind business world,
want to find a way to control them and even predict what will happen next.
Noticing the importance of the global business world, we decide to collect some
data in this area as our research topic. We also try to see if we can find patterns that
help us to make business decisions in the future.
1.2 Project Description
The characteristics and the activeness of a country’s market attract foreign
investors pouring their money into this market. The amount of foreign direct
investment from each country to another country is often related to the characteristics
of the country’s economic system.
In this project, we intend to discover the correlations between Financial
Development and Structure of countries, and their Foreign Direct Investment. First,
we will analyze the inflow/outflow and position of each country regarding foreign
direct investment, based on the size of financial markets and stock markets, GDP per
capital growth and economic index growth. For example, we make a sub-dataset by
5
using every selected country’s Stock market capitalization to GDP and its foreign
direct investment to/from other countries. We then compare each data column in this
sub-dataset for all the countries, to find a correlation in it.
Second, the properties of the time series database of both dataset we choose
provide a good chance to project the future. We will try to make prediction of a
country’s direct investment for the next year, based on the data in past 10 or 20 years.
For example, we can predict which developing country is the most prospecting in
economic growth in the future and if its direct investment grows in the same speed.
1.3 High Level Goal
We want to achieve two high level goals in this project, by applying data mining
techniques to discovery useful and meaning patterns in two datasets.
1. Analyze the correlations between Financial Development and Structure of
countries, and their Foreign Direct Investment.
2. Generate a prediction equation for a country’s Foreign Direct Investment.
1.4 Hypotheses
1. There are a lot of factors that could casually cause bias data. For example, the
political structure of a country suddenly changes for a specific year. Some speculators
attack a financial or stock market by selling big amounts of financial or monetary
products. We should take care to those special cases. However, when we build some
predictive model, we shouldn’t take those bias data into account.
2. We should assume that all people and companies in the market are logical and
make their decision by rational judgments. Only by this case that we can analyze the
reasonable react from known model. We can’t predict what an investor will do if he
doesn’t want to maximize his profit. He could do something undermining his benefit
that usual people won’t do.
6
1.5 Datasets
1. World Bank Research Dataset
(http://econ.worldbank.org/view.php?type=18&id=3343)
This dataset of financial development and structure across countries and over time
unites a range of indicators that measure the size, activity, and efficiency of financial
intermediaries and markets. First published in 1999, it improved on previous efforts
by presenting data on the public share of commercial banks, by introducing indicators
of the size and activity of non-bank financial institutions, and by presenting measures
of the size of bond and primary equity markets.
The indicators for each country include Private credit by deposit money banks to
GDP, Financial system deposits, Net Interest Margin, Stock market capitalization to
GDP, Private bond market capitalization to GDP, and etc. The time series are from
1960 to 2001.
2. UNCTAD Databases (United Nations conference on Trade and Development
Database)
The UNCTAD provides time series of economic data and development
indicators, in some cases going back as far as 1950, in order to keep track of trends in
7
world trade, the global economy and development. It is possible to view the latest
revised figures as well as the full time series.
The Foreign Direct Investment database (FDI) presents inflows, outflows,
inward stocks and outward stocks of foreign direct investment for 196 reporting
economies in an interactive format.
These data correspond to the WIR 2004 Annex B tables. According "Definitions and
Sources" in the above mentioned publication, Foreign direct investment (FDI) is
defined as an investment involving a long-term relationship and reflecting a lasting
interest in and control by a resident entity in one economy (foreign direct investor or
parent enterprise) of an enterprise resident in a different economy (FDI enterprise or
affiliate enterprise or foreign affiliate). This definition is based on the FDI concept as
presented in the IMF Balance of Payments Manual (BPM 5, 1993) and is also a basis
for that adopted in the second edition of the OECD Detailed Benchmark Definition of
FDI. FDI implies that the investor exerts a significant degree of influence on the
management of the enterprise resident in the other economy. Such investment involves
both the initial transaction between the two entities and all subsequent transactions
between them and among foreign affiliates, both incorporated and unincorporated. The
benefits that direct investors expect to derive from a voice in management are different
from those anticipated by portfolio investors, who have no significant influence over
the operations of enterprises. Direct investors are in a position to obtain benefits in
addition to investment income, such as management fees opportunities or similar types
of income (in contrast to portfolio investors, whose primary concerns are capital safety
and returns generated). A direct investment enterprise is defined as an incorporated or
unincorporated enterprise in which the direct investor, resident in another economy,
owns 10 percent or more of the ordinary shares of voting power (or the equivalent).
However, this criterion is not strictly observed by all countries reporting, which may
decide to also include in the FDI figures those investments that do not yield 10 percent
or more of voting power, but are nonetheless judged to give investors a significant
voice in management. Most direct investment enterprises are either branches or
subsidiaries that are wholly or majority owned by non-residents or in which a clear
majority of voting stock is held by a single direct investor or group. The borderline
cases are therefore likely to form a rather small proportion of the whole FDI cluster.
FDI may be undertaken by individuals as well as by business entities. For more detailed
information on concepts presented in this table, please refer to the IMF Balance of
Payments Manual (BPM 5, 1993) and to UNCTAD's World Investment Report 2004:
The Shift Towards Services.
8
3. International Monetary Fund Dataset (IMF Dataset)
9
2.0 Project Design
Basically our project can be divided into 3 phases. First, we perform data
preprocessing in order to integrate different data sets and to clean the missing values.
We also apply Discretization to our data when it’s necessary for our analysis tasks. In
second phase, the high level data descriptions are performed. We use different Data
Characterization techniques to get a better understanding about how the data
distribution looks like, what the general information we can obtain before we proceed
more in-depth analysis. Finally three parallel data mining tasks are conducted. We
take different methods to gain more intrinsic characteristics hiding in the data.
3.0 Data Preprocessing
Our project will be based on three datasets: 1. World Bank Research Dataset 2.
UNCTAD databases (United Nations conference on Trade and Development Database)
3. IMF dataset (International Monetary Fund Dataset). These three datasets has
distinct characteristics, such that we need to apply different approaches to these two
datasets.
Data Selection
1. Countries covered
World Bank Research Dataset includes 184 countries with their 16 different financial
indicators in Excel file format. IMF has total 179 countries when United Nations
dataset contains around 200 countries. We use the intersection set so that total 105
countries are selected.
10
2. Year covered
These three datasets give different time periods for the data. We will basically choose
the years which all datasets cover. World Bank Research Dataset addresses the
financial indicators from 1960 to 2003. IMF has the period from 1980 to 2005 and
United Nations has the data between 1980 and 2003.
We thus work on the years from 1980 to 2003 for time period.
3. Criteria and Indicator Selected
For World Bank Research Dataset, their dataset addresses totally 17 different
indicators that measure the size, activity, and efficiency of financial intermediaries
and markets. Here we select those indicators we think relevant to international direct
investment including the following:
Central Bank assets to total financial assets
Deposit Money Bank Assets to total financial assets
Other Financial Institutions Assets to total financial assets
Deposit money bank vs. central bank assets
Liquid liabilities to GDP
Central Bank Assets to GDP
Deposit Money Bank Assets to GDP
Other Financial Institutions Assets to GDP
Private credit by deposit money banks to GDP
Private credit by deposit money banks and other financial institutions to GDP
Bank deposits
Financial system deposits
Concentration
Overhead costs
Net interest margin
Life insurance penetration
Non-life insurance penetration
Stock market capitalization to GDP
Stock market total value traded to GDP
Stock market turnover ratio
Private bond market capitalization to GDP
Public bond market capitalization to GDP
One issue to be addressed is the possibility that those indicators seemed not
related to international direct investment are actually correlated to it. Thus, we might
11
still need to work on some experiments regarding to those looked irrelevant
indicators.
Data Cleaning
In this project, several issues will need to be considered and dealt with concerning
data cleaning. These issues are missing values, noisy data, and inconsistent data.
1) Fill in missing values
Method: Replaces missing values for numeric attributes with modes/means.
(Using ReplaceMissing Values filter in Weka)
2) Identify outliers/smooth out noisy data
Outliers may be found in the dataset. The reasons could be international
economic crisis, natural weather catastrophe, and etc. These outliers must be
identified when we want to perform a prediction based on the time-series
property. On the other hand, if we focus on the investment activities in a
specific year, this outlier may not be necessary to be deleted, because it is
reasonable that an unusual event like tsunami affects the Thailand’s
investment behaviors.
12
3) Correct inconsistent data
Any unreasonable data must be identified and dealt with here, such as a
negative value or a value more than 1 in World Bank’s dataset.
Data Integration
Since we use three different datasets in this project, we have to deal with data
integration. First, these datasets address different topics and indicators. Second, the
measurements of them are different as well.
Using Excel and script can transpose the different data to identical format.
Data Transformation
Data Generalization could be used here by generalizing country to higher-level
concept like Asia, Europe, America, etc. In addition, data aggregation could be done
by aggregating annual data to 5-years or 10-years data value. Other issues like
smoothing and normalization are to be dealt with the way mentioned above in data
cleaning.
13
Combined new dataset
Data Normalization
The values of different indicators vary drastically. In view of this, normalize the each
value to a range of 0 to 1
14
Data Discretization
An instance filter that discretizes a range of numeric attributes in the dataset into
nominal attributes. Discretization is by simple binning.
Discretize -> for specific classifier like Decision Tree algorithm
4.0 Characterization:
In this phase, we will perform generalization and attribute analysis. These two steps
help us to get a better understanding about our data, in terms of the distribution and
relevance.
4.1 Generalization
For generalization, we intend to use attribute-oriented induction approach to data
generalization and summarization-based characterization. In our case, the dataset we
get from World Bank Group has those attributes considered as indicators of financial
structure of a nation. All we need to do is to select those countries on which we have
interests to do the mining. We may choose countries either by region, by the amount
of direct investment, or by the size of its financial institutes. Then we perform
attribute removal and attribute generalization on our initial working relation ( i.e. the
collection of task-relevant data). Because of the insufficient time series of data, we
may select relatively conservative generalization threshold so as to keep attributes to
remain at a rather low abstraction level. Finally, we basically categorize the countries
into three groups: Developed, Developing, and Under-developing countries. This
gives us a general view of the distribution of our data corresponding to each attribute.
4.2 Analysis of Attribute Relevance
For attribute analysis, we want to evaluate each attribute in the candidate relation
using weka provided methods such as information gain analysis technique. The
15
attributes are then ranked according to their computed relevance to the data mining
task. Attributes that are not relevant or are weakly relevant to the task are then
removed. Because we are only interested in particular countries in the dataset, the
remaining unselected data could be used as the contrasting class. This step results in
an initial target class working relation for further mining process.
To achieve a more accurate analysis result, selecting the right attributes to be
included in our data mining tasks is crucial. But how do we know which attributes we
should consider? We thus need to perform an analysis of attribute relevance which can
give us a trustable guideline.
4.3 Attribute Removal
Before performing attribute analysis, we first have to consider whether an
attribute provides necessary information to our analysis, and whether an attribute
possibly sabotages our analysis or confuses our judgment. We find two types of
attributes should be removed before we analyze the attribute relevance: nominal
attributes and numeric attributes which are absolute. The nominal attributes such as
CountryName don’t provide any useful information regarding our data mining tasks.
On the other hand, the absolute values may vary dramatically in a short time because
of the value of a country’s dollar surges or drops dramatically, such as
GDPCurrentPriceBaseOnNationalDollar.
4.4 Attribute Analysis
1. Class Selection
In our project, the main focus is the relationship between countries’ financial
structure and their foreign direct investment. We therefore use those attributes related
to foreign direct investment as our classes and look for other attributes relevant to
these classes. After this step, we will be able to filter out those irrelevant or less
relevant attributes. Our data mining tasks thus become simpler and more accurate.
The selected classes are:
InflowsAsAPercentageOfGFCF
InwardStockAsAPercentageOfGDP
OutflowsAsAPercentageOfGFCF
OutwardStockAsAPercentageOfGDP
16
2. Evaluator and Search Method
Weka provides several different evaluators and search methods to facilitate
attribute analysis. Some evaluators must be used with certain search methods and vice
versa. We basically use 2 sets of evaluator and search methods to perform the attribute
analysis. After the more relevant attributes are filtered out by 2 different approaches,
we select the intersection from both result sets as our final attributes.
1) CfsSubsetEval + BestFirst:
Here we use CfsSubsetEval to analyze the attributes searched by BestFirst
algorithm.
2) InfoGainAttributeEval + Ranker:
The second time we use InfoGainAttributeEval and Ranker. The
InfoGainAttributeEval evaluator must be used with Ranker search algorithm
so the result will be listed in an order. In addition, the attributes to be
evaluated must be nominal. To solve this problem, we need to perform a data
preprocessing task, discretize, to process the numeric attributes by dividing
them into 10 different bins with equal frequency. After discretization, the
number of instances in each bin will be the same.
3) Result
The selected attributes from different set of evaluators and searching methods
are quite different. Some attributes selected from the first set ranked rather low in
the second set. After performing attribute analysis, the relevant attributes which
must be included in each aspect of analysis are listed as following: (the attributes
appear in both sets are bolded.)
a. InflowsAsAPercentageOfGFCF:
0.16407 1 Year
0.16026 27 Inflation
0.12796 21 GDPbasedonPurchasingPowerParityShareofWorldTotal
0.11649 26 GDPDeflator
0.11296 29 PPPUSdollarExchangeRate
0.10125 9 FinancialSystemDeposits
0.09923 23 GDPperCapitaCurrentPrices
0.09613 22 GDPbasedonPurchasingPowerParityValuationofCountryGDP
17
0.09493 20 GDPbasedonPurchasingPowerParityPerCapitaGDP
0.09306 3 DepositMoneyBankvsCentralBankAssets
0.08968 8 BankDeposits
0.08933 25 GDPCurrentPricesUSDollars
0.08722 6 PrivateCreditbyDepositMoneyBankstoGDP
0.08318 5 DepositMoneyBankAssetstoGDP
0.07852 10 LiquidLiabilitiesToGDP
0.07844 24 GDPProductPerCapitaCurrentPrices
0.07764 7
PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP
0.0716 18 CurrentAccountBalance
0.07032 28 InflationAnnualPercentChange
0.0619 4 CentralBankAssetstoGDP
0.0562 19 CurrentAccountBalanceinPercentofGDP
0.03593 2 Development
0.03264 11 LifeInsurancePenetration
0.03153 13 StockMarketCapitalizationToGDP
0.02643 12 Non-lifeInsurancePenetration
0.02124 15 StockMarketTurnoverRatio
0.02027 14 StockMarketTotalValueTradedToGDP
0.00989 17 PublicBondMarketCapitalizationToGDP
0.00684 16 PrivateBondMarketCapitalizationToGDP
b. InwardStockAsAPercentageOfGDP
0.1797 20 GDPbasedonPurchasingPowerParityShareofWorldTotal
0.1774 7 BankDeposits
0.1765 8 FinancialSystemDeposits
0.1695 26 Inflation
0.1637 28 PPPUSdollarExchangeRate
0.1531 22 GDPperCapitaCurrentPrices
0.1492 25 GDPDeflator
0.1446 19 GDPbasedonPurchasingPowerParityPerCapitaGDP
0.1411 2 DepositMoneyBankvsCentralBankAssets
0.1402 9 LiquidLiabilitiesToGDP
0.129 6
PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP
0.1283 21 GDPbasedonPurchasingPowerParityValuationofCountryGDP
18
0.1216 5 PrivateCreditbyDepositMoneyBankstoGDP
0.1108 24 GDPCurrentPricesUSDollars
0.1105 3 CentralBankAssetstoGDP
0.108 4 DepositMoneyBankAssetstoGDP
0.1031 27 InflationAnnualPercentChange
0.0959 23 GDPProductPerCapitaCurrentPrices
0.0778 17 CurrentAccountBalance
0.0671 18 CurrentAccountBalanceinPercentofGDP
0.056 10 LifeInsurancePenetration
0.0494 1 Development
0.0445 12 StockMarketCapitalizationToGDP
0.0409 11 Non-lifeInsurancePenetration
0.0318 13 StockMarketTotalValueTradedToGDP
0.0248 14 StockMarketTurnoverRatio
0.0135 15 PrivateBondMarketCapitalizationToGDP
0.0108 16 PublicBondMarketCapitalizationToGDP
c. OutflowsAsAPercentageOfGFCF
0.397 22 GDPperCapitaCurrentPrices
0.3873 19 GDPbasedonPurchasingPowerParityPerCapitaGDP
0.2846 6
PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP
0.2707 24 GDPCurrentPricesUSDollars
0.2533 5 PrivateCreditbyDepositMoneyBankstoGDP
0.253 4 DepositMoneyBankAssetstoGDP
0.2332 20 GDPbasedonPurchasingPowerParityShareofWorldTotal
0.2186 1 Development
0.2158 17 CurrentAccountBalance
0.214 26 Inflation
0.21 21 GDPbasedonPurchasingPowerParityValuationofCountryGDP
0.2028 2 DepositMoneyBankvsCentralBankAssets
0.1907 25 GDPDeflator
0.178 8 FinancialSystemDeposits
0.1766 7 BankDeposits
0.1759 23 GDPProductPerCapitaCurrentPrices
0.1579 27 InflationAnnualPercentChange
0.1446 10 LifeInsurancePenetration
19
0.1436 9 LiquidLiabilitiesToGDP
0.1383 28 PPPUSdollarExchangeRate
0.1343 3 CentralBankAssetstoGDP
0.1253 18 CurrentAccountBalanceinPercentofGDP
0.1119 12 StockMarketCapitalizationToGDP
0.1113 13 StockMarketTotalValueTradedToGDP
0.0746 11 Non-lifeInsurancePenetration
0.0564 14 StockMarketTurnoverRatio
0.0268 15 PrivateBondMarketCapitalizationToGDP
0.0175 16 PublicBondMarketCapitalizationToGDP
d. OutwardStockAsAPercentageOfGDP
0.5374 22 GDPperCapitaCurrentPrices
0.526 19 GDPbasedonPurchasingPowerParityPerCapitaGDP
0.3609 6
PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP
0.3481 4 DepositMoneyBankAssetstoGDP
0.3454 5 PrivateCreditbyDepositMoneyBankstoGDP
0.3269 24 GDPCurrentPricesUSDollars
0.3191 20 GDPbasedonPurchasingPowerParityShareofWorldTotal
0.2794 8 FinancialSystemDeposits
0.2744 2 DepositMoneyBankvsCentralBankAssets
0.2703 21 GDPbasedonPurchasingPowerParityValuationofCountryGDP
0.2666 1 Development
0.2567 7 BankDeposits
0.2557 23 GDPProductPerCapitaCurrentPrices
0.2477 26 Inflation
0.2423 17 CurrentAccountBalance
0.2336 28 PPPUSdollarExchangeRate
0.2104 9 LiquidLiabilitiesToGDP
0.2082 27 InflationAnnualPercentChange
0.1884 25 GDPDeflator
0.1803 10 LifeInsurancePenetration
0.1657 3 CentralBankAssetstoGDP
0.1509 18 CurrentAccountBalanceinPercentofGDP
0.1492 12 StockMarketCapitalizationToGDP
0.1275 13 StockMarketTotalValueTradedToGDP
20
0.0917 11 Non-lifeInsurancePenetration
0.0608 14 StockMarketTurnoverRatio
0.0285 15 PrivateBondMarketCapitalizationToGDP
0.0231 16 PublicBondMarketCapitalizationToGDP
The results of the different evaluation scheme show that the biggest difference
from two different method sets occurs in attribute
OutflowsAsAPercentageOfGFCF. The most matched selected attributes appear in
InwardStockAsAPercentageOfGDP. To filter out the most irrelevant attributes, we
choose all the attributes appearing in the first set and the first fifteen attributes
listed in the second set. But those listed in the first set with values smaller than 0.1
in the second set must be eliminated.
4.5 Conclusion
According to the result of this attribute analysis, we can find out the most
relevant attributes corresponding to the Foreign Direct Investment attributes
respectively are:
e. InflowsAsAPercentageOfGFCF:
0.16407 1 Year
0.16026 27 Inflation
0.12796 21 GDPbasedonPurchasingPowerParityShareofWorldTotal
0.11649 26 GDPDeflator
0.11296 29 PPPUSdollarExchangeRate
0.10125 9 FinancialSystemDeposits
0.09923 23 GDPperCapitaCurrentPrices
0.09613 22 GDPbasedonPurchasingPowerParityValuationofCountryGDP
f. InwardStockAsAPercentageOfGDP
0.1797 20 GDPbasedonPurchasingPowerParityShareofWorldTotal
0.1774 7 BankDeposits
0.1765 8 FinancialSystemDeposits
0.1695 26 Inflation
0.1637 28 PPPUSdollarExchangeRate
0.1531 22 GDPperCapitaCurrentPrices
0.1492 25 GDPDeflator
21
0.1446 19 GDPbasedonPurchasingPowerParityPerCapitaGDP
g. OutflowsAsAPercentageOfGFCF
0.397 22 GDPperCapitaCurrentPrices
0.3873 19 GDPbasedonPurchasingPowerParityPerCapitaGDP
0.2846 6
PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP
0.2707 24 GDPCurrentPricesUSDollars
0.2533 5 PrivateCreditbyDepositMoneyBankstoGDP
0.253 4 DepositMoneyBankAssetstoGDP
0.2332 20 GDPbasedonPurchasingPowerParityShareofWorldTotal
0.2186 1 Development
h. OutwardStockAsAPercentageOfGDP
0.5374 22 GDPperCapitaCurrentPrices
0.526 19 GDPbasedonPurchasingPowerParityPerCapitaGDP
0.3609 6
PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP
0.3481 4 DepositMoneyBankAssetstoGDP
0.3454 5 PrivateCreditbyDepositMoneyBankstoGDP
0.3269 24 GDPCurrentPricesUSDollars
0.3191 20 GDPbasedonPurchasingPowerParityShareofWorldTotal
0.2794 8 FinancialSystemDeposits
The attributes appearing in all four categories should play more important
roles in our data mining task. These attributes are:
22 GDPperCapitaCurrentPrices
19 GDPbasedonPurchasingPowerParityPerCapitaGDP
6 PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP
4 DepositMoneyBankAssetstoGDP
24 GDPCurrentPricesUSDollars
20 GDPbasedonPurchasingPowerParityShareofWorldTotal
8 FinancialSystemDeposits
7 BankDeposits
26 Inflation
22
5.0 Association
As we can read from the textbook, the rule A => B holds in the transaction set D with
support s, where s is the percentage of transactions in D that contain A U B. This is
taken to be the probability, P (A U B). The rule A => B has confidence c in the
transaction set D if c is the percentage of transactions in D containing A that also
contain B. This is taken to be the conditional probability, P (B|A).
For the Algorithm for mining our association rule, we use the Apriori and Predictive
Apriori algorithm. Apriori is an influential algorithm for mining frequent item sets for
Boolean association rules.
5.1
Step 1 of association: Select Attributes
Before start to run our Association data analysis, we have to choose the attribute to
run the analysis. Here, we tried two ways to select attributes.
1. Using Select Attribute function in Weka
2. Using ration or percentage attribute only
The first way to select attribute
First, select the attributes by the class “FID inflows millions of dollars”.
Attribute Evaluator : weka.attributeSelection.CfsSubsetEval
Search Method : weka.attributeSelection.GeneticSearch -Z 20 -G 20 -C 0.6 -M 0.033 -R 20 -S 1
Attribute Selection Mode: Use full training set.
Class: FDIInflowsMillionsOfDollars
Selected attributes: 15,19,20,22,23,26,28,29,31,32,34,35,37,38 : 14
StockMarketTotalValueTradedToGDP
CurrentAccountBalance
CurrentAccountBalanceinPercentofGDP
GDPbasedonPurchasingPowerParityShareofWorldTotal
GDPbasedonPurchasingPowerParityValuationofCountryGDP
GDPCurrentPricesUSDollars
Inflation
InflationAnnualPercentChange
FDIInwardStockMillionsOfDollars
FDIOutwardStockMillionsOfDollars
FDIOutflowsMillionsOfDollars
InflowsAsAPercentageOfGFCF
OutflowsAsAPercentageOfGFCF
23
OutwardStockAsAPercentageOfGDP
According to the attribute selection definition, we know that these 14 attributes have
relative higher correlation with our target class “FDIInflowsMillionsOfDollars”.
Because we want to see if we can get some association relation for Foreign Direct
Investment including Inflows and Outflows, we have to select those attributes which
are related to outflows as well.
Second, select the attributes by the class “FDI outflows millions of dollars”.
Attribute Evaluator : weka.attributeSelection.CfsSubsetEval
Search Method : weka.attributeSelection.GeneticSearch -Z 20 -G 20 -C 0.6 -M 0.033 -R 20 -S 1
Attribute Selection Mode: Use full training set.
Class: FDIOutflowsMillionsOfDollars
Selected attributes: 23,25,31,32,33,37 : 6
GDPbasedonPurchasingPowerParityValuationofCountryGDP
GDPProductPerCapitaCurrentPrices
FDIInwardStockMillionsOfDollars
FDIOutwardStockMillionsOfDollars
FDIInflowsMillionsOfDollars
OutflowsAsAPercentageOfGFCF
As we described before, the attributes we want to use should have some relations with
both inflows and outflows, so we will take the union set of those two attributes
selection sets and get our final 19 attributes.
This is the first set of attributes that we use for our association analysis:
CountryName. Year. Development. StockMarketTotalValueTradedToGDP . CurrentAccountBalance
. CurrentAccountBalanceinPercentofGDP. GDPbasedonPurchasingPowerParityShareofWorldTotal
. GDPbasedonPurchasingPowerParityValuationofCountryGDP. GDPProductPerCapitaCurrentPrices
. GDPCurrentPricesUSDollars. Inflation . InflationAnnualPercentChange. FDIInwardStockMillionsOfDollars
. FDIOutwardStockMillionsOfDollars. FDIInflowsMillionsOfDollars
. FDIOutflowsMillionsOfDollars. InflowsAsAPercentageOfGFCF
. OutflowsAsAPercentageOfGFCF. OutwardStockAsAPercentageOfGDP
24
The second way to select attributes
Not only the previous way to select attributes, we try to use another set of attributes to
see if we would get any different result. By directly review the characteristic of all the
attributes; we can pick the ratio only attributes. The reason why we want to select
only ratio attributes is that the differences between different countries are too large.
That will make association more complicated. For example, the difference in GDP
between USA and Korea might be very large. However, it’s possible that other ratio
attributes are much closed between these two countries. Like GDP to Inflation ratio,
direct investment to bank assets ratio and so on. In mixed attribute sets, those big
difference attributes set could disturb our pattern and blur out the similiarity in this
two countries.
After removing those non ration and non percentage attributes, we get 23 attributes
left which are
CountryName. Year. Development. DepositMoneyBankvsCentralBankAssets . CentralBankAssetstoGDP .
DepositMoneyBankAssetstoGDP . PrivateCreditbyDepositMoneyBankstoGDP
. PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP. LiquidLiabilitiesToGDP . LifeInsurancePenetration
. Non-lifeInsurancePenetration . StockMarketCapitalizationToGDP . StockMarketTotalValueTradedToGDP
. StockMarketTurnoverRatio. PrivateBondMarketCapitalizationToGDP . PublicBondMarketCapitalizationToGDP
. CurrentAccountBalanceinPercentofGDP . InflationAnnualPercentChange. PPPUSdollarExchangeRate
. InflowsAsAPercentageOfGFCF . InwardStockAsAPercentageOfGDP . OutflowsAsAPercentageOfGFCF
. OutwardStockAsAPercentageOfGDP
5.2
Step 2 of association: Discretize
After selected the attributes, there still one more step before we really associate our
datasets. We have to discretize our data. As described before, the algorithm we use,
Apriori algorithm, can’t handle pure numerical datasets. In order to be able to run this
algorithm in our associate analysis, we discretize our data. After discretizing dataset,
we can categorize our data value as several distinct bins such as “between 0 to 1”,
“between 1 to 3” rather than the original pure numerical value.
In Weka, after select the attributes from previous step, we use filter to discretize our
datasets. When we try to discretize, there is an important option that we have to setup.
The criteria “use equal frequency” under the discretize filter should be set to True.
Without doing that, the result of discretize is meaningless for us since most of the data
will be discretized into a few specific bins.
With all the data in the same bins, we can’t get appropriate results.
25
5.3
Step 3 of association: Association
1. Associated by the first 19 attributes with Apriori:
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0
Instances: 2520
Associator model (full training set)
Apriori
=======
Minimum support: 0.15
Minimum metric <confidence>: 0.9
Number of cycles performed: 17
Generated sets of large itemsets:
Size of set of large itemsets L(1): 9
Size of set of large itemsets L(2): 11
Size of set of large itemsets L(3): 7
Size of set of large itemsets L(4): 1
Best rules found:
1. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 480 ==>
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 480 conf:(1)
2. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]'
OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 450 ==> OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]'
450 conf:(1)
3. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 458 ==>
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 457 conf:(1)
4. Development=Developing FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' 396 ==>
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 391 conf:(0.99)
5. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' 544 ==>
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 537 conf:(0.99)
6. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]'
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 457 ==> OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]'
450 conf:(0.98)
7. FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 479 ==>
OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 471 conf:(0.98)
8. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 458 ==>
OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 450
conf:(0.98)
26
9. FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 458 ==>
OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 450 conf:(0.98)
10. FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 621 ==> OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]'
603 conf:(0.97)
What we could get from this result:
We can see that those ten best rules just describe the high correlations between
Foreign Direct Investment Outflows with Outward Stock percentage and other
Foreign Direct Investment outflows related index. All confidences from them are very
high since all of them are higher than 0.97. Also, minimum support 0.15 is a
reasonable choice to make this analysis.
The reason why we get the result all related to Foreign Direct Investment Outflows
could be that because we selected those 19 attributes by their correlation with FDI
outflows and FDI inflows.
Of course, we still can get some patterns in this association analysis but we want to
try to get something more than that. That’s why we will try to use different attributes
and other algorithm for association checks again.
2. Associated by the first 19 attributes with Predictive Apriori:
Scheme: weka.associations.PredictiveApriori -N 100
Instances: 2520
Associator model (full training set)
PredictiveApriori
===================
Best rules found:
1. GDPProductPerCapitaCurrentPrices='(-inf-0.000001]' FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 228 ==>
OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 228 acc:(0.99391)
2. GDPProductPerCapitaCurrentPrices='(-inf-0.000001]' FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]'
OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 198 ==> OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]'
198 acc:(0.99347)
3. OutflowsAsAPercentageOfGFCF='(-inf-0.427376]' 191 ==> FDIOutflowsMillionsOfDollars='(-inf-0.030614]' 191
acc:(0.99334)
4. GDPProductPerCapitaCurrentPrices='(-inf-0.000001]' FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]'
FDIOutflowsMillionsOfDollars='(0.030614-0.030615]' 190 ==> OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]'
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 190 acc:(0.99332)
5. FDIInwardStockMillionsOfDollars='(-inf-0.000618]' FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' 184 ==>
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 184 acc:(0.99319)
6. Development=Developing GDPProductPerCapitaCurrentPrices='(-inf-0.000001]'
FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' 164 ==>
27
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 164 acc:(0.9927)
7. GDPbasedonPurchasingPowerParityValuationofCountryGDP='(-inf-0.000206]'
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 163 ==>
FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' 163 acc:(0.99267)
8. Development=Developed GDPbasedonPurchasingPowerParityValuationofCountryGDP='(0.05223-inf)' 162 ==>
GDPCurrentPricesUSDollars='(0.032872-inf)' 162 acc:(0.99264)
9. GDPbasedonPurchasingPowerParityValuationofCountryGDP='(-inf-0.000206]'
FDIOutwardStockMillionsOfDollars='(0.00007-0.000072]' OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 155 ==>
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 155 acc:(0.99242)
10. FDIInflowsMillionsOfDollars='(-inf-0.014284]' 243 ==> InflowsAsAPercentageOfGFCF='(-inf-0.538132]' 242
acc:(0.99236)
What we could get from this result:
Briefly speaking, from these association rules, we can see some more useful attributes
correlations.
For example, the first rule tells us that if the GDP per capital in a specific range when
the foreign direct investment also in a specific range, we can predict the ratio between
outflows and gross fixed capital formation.
For us, outward stock as percentage of GDP is one index that we interest most.
According to those rules we get, we know that we can predict this index by knowing
GDP per capital, foreign direct investment outward stock amount, GDP based on
purchasing power and so on.
Another number that we might be interest is out flows as percentage of gross fixed
capital formation.
For predicting this index, we should know GDP per capital, FDIOutflows, FDI
outward stock and so on.
3. Associated by the ratio and percentage 23 attributes with Apriori:
Note: In the beginning, we cant’ get any best result from the default setting. Therefore,
we changed the Minimum confidence from 0.9 to 0.5 in order to generate some rules.
Finally, we get a few results as we expected.
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0
Instances: 2520
Associator model full training set
Apriori
=======
Minimum support: 0.1
28
Minimum metric <confidence>: 0.5
Number of cycles performed: 18
Generated sets of large itemsets:
Size of set of large itemsets L(1): 79
Size of set of large itemsets L(2): 5
Size of set of large itemsets L(3): 1
Best rules found:
1. OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 612 ==> OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]'
531 conf:(0.87)
2. Development=Developing OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 464 ==>
OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 393 conf:(0.85)
3. OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 746 ==> Development=Developing 571 conf:(0.77)
4. OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 612 ==> Development=Developing 464 conf:(0.76)
5. OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 531 ==>
Development=Developing 393 conf:(0.74)
6. OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 746 ==> OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]'
531 conf:(0.71)
7. Development=Developing OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 571 ==>
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 393 conf:(0.69)
8. OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 612 ==> Development=Developing
OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 393 conf:(0.64)
9. OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 746 ==> Development=Developing
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 393 conf:(0.53)
10. Development=Developed 576 ==> PPPUSdollarExchangeRate='(0.000001-0.000002]' 303 conf:(0.53)
What we could get from this result:
As we can see from the result, the confidences in this associate test are lower than the
previous test. However, we do get some more useful information here.
For those developing countries, the relation between stock investment and total
foreign investment are high correlated. Also, we can see that if a country is a
developed country, Purchasing power parity of it is likely to fall into a specific range.
4. Associated by the ratio and percentage 23 attributes with Predictive
Apriori:
Scheme: weka.associations.PredictiveApriori -N 100
Relation:
29
finstructure-weka.filters.unsupervised.attribute.Normalize-weka.filters.unsupervised.attribute.Remove-R9-10,19,31-34-weka.filte
rs.unsupervised.attribute.Remove-R18-25-weka.filters.unsupervised.attribute.Discretize-F-B10-M-1.0-Rfirst-last
Instances: 2520
Associator model (full training set)
PredictiveApriori
=======
Best rules found:
1. PPPUSdollarExchangeRate='(0.000313-inf)' 188 ==> Development=Developing 188 acc:(0.99325)
2. DepositMoneyBankAssetstoGDP='(0.549816-inf)'
PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP='(0.525423-inf)'
LiquidLiabilitiesToGDP='(0.413579-inf)' 105 ==> PrivateCreditbyDepositMoneyBankstoGDP='(0.489811-inf)' 105
acc:(0.9898)
3. LiquidLiabilitiesToGDP='(-inf-0.077226]' OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 97 ==>
Development=Developing 97 acc:(0.9891)
4. DepositMoneyBankvsCentralBankAssets='(0.713174-inf)' DepositMoneyBankAssetstoGDP='(0.549816-inf)'
PrivateCreditbyDepositMoneyBankstoGDP='(0.489811-inf)' 97 ==> Development=Developed 97 acc:(0.9891)
5. DepositMoneyBankvsCentralBankAssets='(-inf-0.328866]' DepositMoneyBankAssetstoGDP='(-inf-0.053588]'
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 94 ==> Development=Developing 94 acc:(0.98881)
29. DepositMoneyBankvsCentralBankAssets='(-inf-0.328866]' InwardStockAsAPercentageOfGDP='(-inf-0.55239]' 56 ==>
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 56 acc:(0.9823)
33. StockMarketCapitalizationToGDP='(0.288662-inf)' OutflowsAsAPercentageOfGFCF='(0.461885-inf)' 55 ==>
OutwardStockAsAPercentageOfGDP='(0.240554-inf)' 55 acc:(0.982)
52. Year=1981 OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 46 ==>
OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 46 acc:(0.97879)
56. PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP='(0.15771-0.207416]'
OutwardStockAsAPercentageOfGDP='(0.065911-0.06675]' 78 ==> OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 77
acc:(0.97557)
75. CountryName=BurkinaFaso 24 ==> Development=Developing InwardStockAsAPercentageOfGDP='(-inf-0.55239]' 24
acc:(0.96134)
89. CountryName=Dominica 24 ==> Development=Developing OutflowsAsAPercentageOfGFCF='(0.427376-0.427634]' 24
acc:(0.96134)
92. CountryName=Egypt 24 ==> Development=Developing InwardStockAsAPercentageOfGDP='(-inf-0.55239]' 24
acc:(0.96134)
What we could get from this result:
The original result has 100 rules showed up from Weka. However, we just show a few
here.
From the first rule, we can see the relation between Purchasing power parity and
30
developing countries. If we get purchasing power parity information, we could
possibly guess if this country is developing country or not.
Meanwhile, for us, outward stock as percentage of GDP is one index that we interest
most.
We know that we can predict this index by knowing inward stock as percentage of
GDP, if this country is developing country, stock market capitalization to GDP and so
on.
Another number that we might be interest is out flows as percentage of gross fixed
capital formation.
For predicting this index, we should know private credit by deposit money banks,
outward stock as percentage of GDP, country name and some attributes got from
rules.
5.4 Conclusion:
From the definition, rules that satisfy both a minimum support threshold and a
minimum confidence threshold are called strong. In our case, we get several strong
rules in our association since they satisfy these requirements. However, not all of
them are useful for us. Because of that, we have to choose what the rules we need are
and we should apply to. For example, we got many rules that do nothing with inflows,
outflows foreign direct investment, inward, outward stock ratio.
By selecting those useful rules, we can use them to do the prediction like we did
before.
Of course, even some attributes and rules are not our interests now; we might still
need them later. When we want to predict other different attributes, we just need to
repeat the same processes. We can analyze many relations between attributes and also
try to predict data by knowing a few key values.
6.0 Classification and Prediction:
According to our project goal, our task is to predict certain countries' direct
investment in United States by means of analyzing correlation between their financial
structure and their direct investment in United States. Because variant size of one
country's financial structure may have different international investment behavior and
strategy, and other factors, like vicinity or development status of a country, may affect
31
investment in United States as well. So we can construct a classification model by
decision tree induction to classify any data of an unseen country. Then make a
prediction according to the class which the unseen country matches. For those
countries which have already been analyzed throughout our mining process, we can
simply employ the statistical techniques of regression to make the prediction of
continuous values.
Select Attributes
Originally we have 38 attributes in our dataset. After using all of those attributes to do
the mining task, we found that due to enormous variation of some values of the
attributes, the resulting prediction is not very satisfied, even though we have done
attributes normalization to eliminate the problem stemmed from scalability. So we
decide to filter out those attributes whose values are absolute and varying dramatically
from country to country. We keep only those attributes which are in percentage or
ratio.
Secondly, we remove country name, year, and development attributes. Because our
goal is to build a modal which can make a prediction on one nation’s inward or
outward investment based on other attributes’ values excluding country, year and
development situation. Therefore there are 20 attributes left for doing the data mining
task.
6.1 Prediction by Linear Regression
Using Linear Regression to make a prediction on OutwardStockAsPercentageOfGDP
and InwardStockAsPercentageOfGDP
=== Run information ===
Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation:
finstructure-weka.filters.unsupervised.attribute.Normalize-weka.filters.unsupervised.a
ttribute.Remove-R9-10,19,31-34-weka.filters.unsupervised.attribute.Remove-R18-25-
weka.filters.unsupervised.attribute.Remove-R1-3-weka.filters.unsupervised.instance.
RemoveMisclassified-Wweka.classifiers.functions.LinearRegression -S 0 -R
1.0E-8-C-1-F0-T0.05-I0
Instances: 1915
Attributes: 20
32
DepositMoneyBankvsCentralBankAssets
CentralBankAssetstoGDP
DepositMoneyBankAssetstoGDP
PrivateCreditbyDepositMoneyBankstoGDP
PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP
LiquidLiabilitiesToGDP
LifeInsurancePenetration
Non-lifeInsurancePenetration
StockMarketCapitalizationToGDP
StockMarketTotalValueTradedToGDP
StockMarketTurnoverRatio
PrivateBondMarketCapitalizationToGDP
PublicBondMarketCapitalizationToGDP
CurrentAccountBalanceinPercentofGDP
InflationAnnualPercentChange
PPPUSdollarExchangeRate
InflowsAsAPercentageOfGFCF
InwardStockAsAPercentageOfGDP
OutflowsAsAPercentageOfGFCF
OutwardStockAsAPercentageOfGDP
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Linear Regression Model (Outward Stock as a percentage of GDP)
OutwardStockAsAPercentageOfGDP =
-0.0229 * DepositMoneyBankvsCentralBankAssets +
-0.0361 * CentralBankAssetstoGDP +
0.1014 * DepositMoneyBankAssetstoGDP +
0.0207 * PrivateCreditbyDepositMoneyBankstoGDP +
-0.0107 *
PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP +
-0.07 * LiquidLiabilitiesToGDP +
0.0544 * LifeInsurancePenetration +
0.0899 * StockMarketCapitalizationToGDP +
0.2151 * StockMarketTotalValueTradedToGDP +
-0.1165 * StockMarketTurnoverRatio +
0.0169 * PublicBondMarketCapitalizationToGDP +
33
0.165 * CurrentAccountBalanceinPercentofGDP +
0.1594 * InwardStockAsAPercentageOfGDP +
1.8957 * OutflowsAsAPercentageOfGFCF +
-0.9621
Time taken to build model: 0.2 seconds
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.9398
Mean absolute error 0.012
Root mean squared error 0.0158
Relative absolute error 45.4532 %
Root relative squared error 34.1512 %
Total Number of Instances 1915
Linear Regression Model ( Inward stock as a percentage of GDP)
InwardStockAsAPercentageOfGDP =
0.0634 * DepositMoneyBankvsCentralBankAssets +
0.0588 * CentralBankAssetstoGDP +
-0.1443 * DepositMoneyBankAssetstoGDP +
0.0851 * PrivateCreditbyDepositMoneyBankstoGDP +
-0.0384 *
PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP +
0.0846 * LiquidLiabilitiesToGDP +
-0.03 * LifeInsurancePenetration +
0.0716 * Non-lifeInsurancePenetration +
-0.1021 * StockMarketTotalValueTradedToGDP +
0.0355 * StockMarketTurnoverRatio +
-0.0273 * PrivateBondMarketCapitalizationToGDP +
-0.1699 * CurrentAccountBalanceinPercentofGDP +
15.8396 * InflowsAsAPercentageOfGFCF +
-0.7026 * OutflowsAsAPercentageOfGFCF +
0.4928 * OutwardStockAsAPercentageOfGDP +
-7.5842
34
Time taken to build model: 0.19 seconds
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.6141
Mean absolute error 0.0192
Root mean squared error 0.0289
Relative absolute error 75.1824 %
Root relative squared error 79.1605 %
Total Number of Instances 1915
Analysis
Since our task is to predict a continuous value, rather than a categorical label, we
choose statistical techniques of regression to tackle this problem.
The prediction result for outward stock as percentage of GDP is acceptable, while that
for inward stock as percentage of GDP is way unacceptable. The problem may come
from that linear regression method is not suitable for the model we want to build; or
the attributes in the dataset are less correlated to inward stock. Since it is not good
enough to make a prediction on inward stock as percentage of GDP, we focus on the
model for predicting outward stock as percentage of GDP.
At the beginning, the regression equation we got only has 12 attributes involved and
its correlation coefficient is 0.8074. But relative absolute error and root relative
squared error are too high, we think it is probably because of some outliers. Therefore
we use filter “RemoveMisclassified” to remove outliers and curtail the instances from
2520 to 1915. Then we build a better model whose correlation coefficient is 0.9398 ,
and both of relative absolute error and root relative squared error are down to less
than 50%.
35
Histogram shows the distribution of each attribute in the datase
36
X: PredictedOutwardStockAsAPercentageOfGDP
Y: OutwardStockAsAPercentageOfGDP
The plot diagram shows that the prediction is relatively proportional to the real valu
37
6.2 Prediction by Decision Tree
Using Decision Tree to make a prediction on OutwardStockAsPercentageOfGDP
Result:
=== Summary ===
Correctly Classified Instances 1907 97.4949 %
Incorrectly Classified Instances 49 2.5051 %
Kappa statistic 0.9093
Mean absolute error 0.005
Root mean squared error 0.0602
Relative absolute error 8.8114 %
Root relative squared error 36.062 %
Total Number of Instances 1956
Analysis
We then changed our approach to predict the target value. By using filter
“Discretize” to convert numeric values to nominal data, we can do classification
using decision tree. We are able to build a decision tree and predict which interval of
the OutwardStockAsPercentageOfGDP the query will fall in.
At the beginning, the resulting model is around 70% accuracy rate, which is
acceptable but not very good. And we think outlier is always an important issue.
Therefore after eliminating those outliers and cutting down the instances from 2520
to 1956, we get a much better result whose accuracy rate is 97%.
We take a close look at the decision tree and found that the top level of tree is the
attribute “PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP”.
Therefore we can infer that this attribute is most discernable to the target value we
want to predict.
38
Histogram shows the distribution of each attribute with coloring by class grouping.
39
X: OutwardStockAsAPercentageOfGDP
Y: PredictedOutwardStockAsAPercentageOfGDP
6.3 Prediction by K-Nearest Neighbor
Using Nearest Neighbor to make a prediction on OutwardStockAsPercentageOfGDP
Result:
K=1
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.9379
Mean absolute error 0.0332
Root mean squared error 0.0501
Relative absolute error 37.9836 %
Root relative squared error 37.582 %
Total Number of Instances 1001
Analysis
Then we tried instance-based lazy learning method to do the prediction. After removing
outliers by using filter “RemoveMisclassified”, the number of instances is cut down to
1001. When K=1, the result is as above. We tried K=2 and K=3 but the results are not as
good as that when K=1. But generally speaking, the accuracy rate is pretty high
compared with that by linear regression. This makes sense because investment indicators
may follow certain pattern of the economic indicators, the query may get a reasonable
estimation by referring to the real-valued labels associated with the k nearest neighbors of
the unknown sample. If the query in the mining task is from one developed country, then
its indicators may fall in the range where other developed countries are located.
That one problem may arise by k-nearest neighbors classifier is to assign equal weight to
each attribute. This may cause confusion when there are many irrelevant attributes in the
data.
41
6.4 Conclusion:
� When we employed the linear regression, we make an assumption that our
prediction can fit in the linear regression model. The result shows that our target
value which we want to predict is only related to the subset of all the attributes
according to the equation. As a result, we can further predict the outward stock as a
percentage of GDP by these attributes. But since we have total 23 attributes, there is
a possibility that our task would be better to fit in nonlinear regression model.
Nevertheless, we chose to simplify the task by deploying simple linear regression.
� By using symbolic learning method, decision tree, we obtained a satisfied prediction.
And from decision-tree building algorithms, we found the attribute
“PrivateCreditbyDepositMoneyBanksandOtherFinancialInstitutionstoGDP” does the
best job of splitting the training data at the root of the tree. Since we discretized each
attribute to 10 bins, the prediction obtained is only a range instead of the exact
number. The high precision may come from the fact that most of the instances fall in
the certain ranges. Therefore, how to appropriately discretize the data into certain
amount of bins so that distribution would be more suitable for building the decision
tree would be a primary factor.
� One of the problems of making prediction by Case-based method is that the result
highly depends on the original data. If the unknown data which is about to predict
does not match any initial data, the technique will not obtain good prediction.
Besides, assigning equal weight to each attribute may also be a problem. Since from
the result of the linear regression, only subset of the attributes are related to the
prediction. Therefore, choosing a proper weighting scheme for different attributes
would be tough yet important issue when deploying this method.
7.0 Cluster Analysis
Here we use the cluster algorithms provided by Weka, trying to group the data into
clusters which contains objects with high similarity comparing to the objects in other
clusters. We try to analyze the relationship between developing phase and countries’
financial structure by applying the five cluster methods. The selected attributed to be used
are those chosen in attribute analysis.
42
7.1 Using EM
Grouping: The instances are automatically grouped into 8 clusters. The details are
listed as following:
Number of clusters selected by cross validation: 8
Cluster: 0 Prior probability: 0.2877
Cluster: 1 Prior probability: 0.1483
Cluster: 2 Prior probability: 0.0376
Cluster: 3 Prior probability: 0.2409
Cluster: 4 Prior probability: 0.0292
Cluster: 5 Prior probability: 0.1564
Cluster: 6 Prior probability: 0.0449
Cluster: 7 Prior probability: 0.055
7.1.1 Correlations
We try to use the visualization of this clustering result to understand the correlations
between financial structure attributes and foreign direct investment attributes. We find
some attributes are relevant to foreign direct investment; some aren’t.
1. Irrelevant attributes: All the instances are all together. There’s no way to
distinguish the grouping in the visualization.
Year
GDPProductPerCapitaCurrentPrices
GDPbasedonPurchasingPowerParityValuationofCountryGDP
GDPperCapitaCurrentPrices
GDPCurrentPricesUSDollars
GDPDeflator
Inflation
PPPUSdollarExchangeRate
GDPbasedonPurchasingPowerParityShareofWorldTotal
2. Relevant attributes: The instances can be generally separated into two
clusters in the visualization.
DepositMoneyBankvsCentralBankAssets
DepositMoneyBankAssetstoGDP
PrivateCreditbyDepositMoneyBankstoGDP
BankDeposits
FinancialSystemDeposits
43
LiquidLiabilitiesToGDP
GDPbasedonPurchasingPowerParityPerCapitaGDP
For example, the following visualization represents no correlation. All
instances are located around the same place. No cluster distinguishing effects
happen.
PPPUSdollarExchangeRate vs. InflowsAsAPercentageOfGFCF
The following visualization represents positive correlation.
44
GDPperCapitaCurrentPrices vs InflowsAsAPercentageOfGFCF
Here the picture shows the interesting correlation between Development and
InflowsAsAPercentageOfGFCF. The left cluster basically consists of developed
countries, the middle cluster has both developed and developing countries, and the
right cluster is basically others.
45
Development vs. InflowsAsAPercentageOfGFCF
7.2 Using SimpleKMeans
We define two clusters grouped together. The first cluster consists of mostly
developed countries, and the second cluster consists of mostly developing and other
countries. The percentage of incorrectly clustered instances is 24.9206 %. Thus the
correctness is somewhat believable. This clustering result tells us the financial structure
and direct investments are highly related to countries’ developing status.
Clustered Instances
0 814 ( 32%)
1 1706 ( 68%)
Class attribute: Development
Classes to Clusters:
0 1 <-- assigned to cluster
499 77 | Developed
287 1393 | Developing
28 236 | Other
Cluster 0 <-- Developed
Cluster 1 <-- Developing
46
Incorrectly clustered instances : 628.0 24.9206 %
7.3 Using Cobweb
In this clustering task, the instances are equally separated into same sized clusters.
Every cluster has exactly 105 instances. Thus the percentage of incorrectly clustered
instances is 95.8333 % which gives no help to our analysis.
7.4 Using FarthestFirst
The result here shows two clusters. One cluster consists of almost all instances, and
the other one only has 3%. The result is not useful.
Clustered Instances
0 2440 ( 97%)
1 80 ( 3%)
Class attribute: Development
Classes to Clusters:
0 1 <-- assigned to cluster
499 77 | Developed
1677 3 | Developing
264 0 | Other
Cluster 0 <-- Developing
Cluster 1 <-- Developed
Incorrectly clustered instances : 766.0 30.3968 %
7.5 Using MakeDensityBasedClusterer
Two clusters are grouped. First cluster consists of almost 95% developed countries,
around 20% developing countries, and 10% other countries. Second cluster includes
basically developing countries and other countries. This result corresponds to what we
had when using SimpleKMeans.
Clustered Instances
0 871 ( 35%)
1 1649 ( 65%)
47
Log likelihood: 21.99495
Class attribute: Development
Classes to Clusters:
0 1 <-- assigned to cluster
539 37 | Developed
307 1373 | Developing
25 239 | Other
Cluster 0 <-- Developed
Cluster 1 <-- Developing
Incorrectly clustered instances : 608.0 24.127 %
7.6 Other experiment
In this section we try to use another approach to group the instances. We would
like to use the attributes related to countries’ foreign direct investment as our class to
map to the clusters. We first discretize the four attributes
InflowsAsAPercentageOfGFCF, InwardStockAsAPercentageOfGDP,
OutflowsAsAPercentageOfGFCF, and OutwardStockAsAPercentageOfGDP. These
attributes are then selected as the class. However, the results in this experiment are
unsatisfying. The incorrectly clustered instances are too many. (84.0476 %)
Clustered Instances
0 396 ( 16%)
1 461 ( 18%)
2 295 ( 12%)
3 324 ( 13%)
4 156 ( 6%)
5 162 ( 6%)
6 139 ( 6%)
7 128 ( 5%)
8 253 ( 10%)
9 206 ( 8%)
Class attribute: InflowsAsAPercentageOfGFCF
48
Classes to Clusters:
0 1 2 3 4 5 6 7 8 9 <-- assigned to cluster
18 83 35 78 16 4 11 6 38 20 | '(-inf-0.538132]'
28 43 25 40 15 9 23 4 41 17 | '(0.538132-0.538189]'
31 37 19 36 19 13 27 12 33 14 | '(0.538189-0.538266]'
32 44 23 31 21 18 22 18 26 16 | '(0.538266-0.538364]'
29 48 36 22 9 29 21 16 16 18 | '(0.538364-0.53849]'
42 40 31 20 23 20 12 18 25 16 | '(0.53849-0.538637]'
39 40 24 24 20 30 11 11 26 19 | '(0.538637-0.538861]'
57 44 40 22 8 18 6 11 15 25 | '(0.538861-0.539225]'
49 44 39 20 13 15 4 16 20 26 | '(0.539225-0.539841]'
71 38 23 31 12 6 2 16 13 35 | '(0.539841-inf)'
Cluster 0 <-- '(0.539841-inf)'
Cluster 1 <-- '(0.538364-0.53849]'
Cluster 2 <-- '(0.538861-0.539225]'
Cluster 3 <-- '(-inf-0.538132]'
Cluster 4 <-- '(0.53849-0.538637]'
Cluster 5 <-- '(0.538637-0.538861]'
Cluster 6 <-- '(0.538189-0.538266]'
Cluster 7 <-- '(0.538266-0.538364]'
Cluster 8 <-- '(0.538132-0.538189]'
Cluster 9 <-- '(0.539225-0.539841]'
Incorrectly clustered instances : 2118.0 84.0476 %
7.7 Conclusion
From this cluster analysis, we have following conclusions:
1. Deposit Money Bank plays an important role. It influences the performance of
a country’s foreign direct investment. The higher the ratio of Deposit Money
Bank versus Central Bank Assets is, the more the Investment Inflow from
other foreign country.
2. Financial System Deposits has high correlations with a country’s
inward/outward and inflow/outflow investment movements. Demand, time
49
and saving deposits in deposit money banks and other financial institutions
actually related to the activeness of a country’s foreign direct investment.
3. GDP dominates all the attributes as the decisive criteria to overlook a
country’s foreign direct investment behaviors. Either the ratio of other
attributes to GDP, or GDP itself has deep impacts in how active countries’
foreign direct investment is.
4. GDPDeflator, Inflation, PPPUSdollarExchangeRate have nothing to do with a
country’s performance regarding foreign direct investment.
5. The completeness of a country’s development somewhat represents the
grouping results of the clustering, but the correlation is not totally positive.
There must be some other involved factors.
6. Based on the data’s intrinsic characteristics, the instances can be grouped into
8 clusters which are determined by EM algorithm. EM algorithm can
automatically decide how many clusters to be created, while other algorithms
don’t. We need to presume how many clusters there will be when using other
8.0 Resources
8.1 Software Environment
1. Weka
2. Excel
8.2 Hardware Environment
Laptop1:
Intel ®Pentium® M Processor (1.3 GHz)
256 DDR SDRAM
40 GB 4200 RPM HD
50
Laptop 2:
Intel ®Pentium® M Processor (1.5 GHz)
512 DDR SDRAM
40 GB 4200 RPM HD
9.0 References
1. United Nations http://www.unctad.org/Templates/Page.asp?intItemID=1923&lang=1
2. IMP http://www.imf.org/external/pubs/ft/weo/2004/02/data/index.htm
2. World Bank http://econ.worldbank.org/view.php?type=18&id=3343