1
Technology Vertical Lead Generation
1K.S.Hemapriya ,
2Anshul Saxena,
3Dr.N.Sangeetha,
1Student, Department of Management,Kumaraguru College of Technology,
Coimbatore 641049, Tamil Nadu, India 2Assistant Professor, Department of Management ,Kumaraguru College of Technology,
Coimbatore 641049, Tamil Nadu, India 3Senior Associate Professor, Department of Mechanical EngineeringKumaraguru College of Technology,
Saravanampatty,Coimbatore: 49, Tamil Nadu, India. [email protected]
Abstract. This study is to build a model which generate leads from major project posting portals like
freelancer,based on thetechnologyverticalsof the company. It is highly important to reach out the potential
target customer than by selling to everyone in the market hence identifying the right project to bid is very
essential.The study also finds the area of skill set enhancement by finding the frequently demanded skills
along with the key skills of the technical team, which increases their scope of getting more projects.
Keywords: Competition, lead generation, model, market, skill set, services, target customer.
1.INTRODUCTION
Lead generation which is creating customer
interest or enquiry into services of a business in
IT, is becoming challenging these days, because
of high competition in the market and fast
advancement in technologies and innovations.
Lead generation helps to make customers show
interest towards products or services of a
company or to find the prospective project for the
company. It is highly important to reach out the
potential target customer than by selling to
everyone in the market. Analyzing the market
trend and possessing the right skilled employees,
help provide additional services to the client and
also to get new clients. This study aims at
creating a model that generate leads for the
technical team at an IT services company based
on their technology verticals by collecting project
data from major project posting portals like
freelancer. Thisstudy also finds the area of skill
set enhancement of theteam which increases their
chances of getting more projects.
2. REVIEW OF LITERATURE.
International Journal of Pure and Applied MathematicsVolume 119 No. 17 2018, 2687-2697ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/
2687
2
1.1Ku Chun Kit and Dr. David Rossiter in their
paper" Business Lead Qualification by Online
Information Scraping" have created an
automated lead qualifier and online information
scraping. they have collected real sales data from
a partnering company, then classified the
companies into two groups depending on whether
a deal was made with that company or not. The
system retrieved company website URL and
scraped information from the company websites
and social network profile pages. The information
was then cleaned up and used to train three
models. Predictions generated by models were
combined using an algorithm to collectively
qualify new business leads
2.2Jeffrey Kohl Wilkins, Jack Marshall
Zoken in their invention "Internet-enabled lead
generation" has created a model of generating
intender leads in a distributed computer system
which includes the steps of identifying purchase
indicators and extracting prospect identifiers from
the purchase indicators. Purchase indicators are
pieces of data that represent a potential future
purchase by a prospect. For example, an online
classified advertisement selling an automobile is
a purchase indicator for a potential future
purchase of a new car by the old car seller. The
1Kit, K. C., &Rossiter(2017), D. Business Lead
Qualification by Online Information Scraping.
2Wilkins, J. K., &Zoken, J. M. (2005). U.S. Patent No.
6,868,389. Washington, DC: U.S. Patent and Trademark
Office.
prospect identifier, such as a telephone number or
email address, uniquely identifies the prospect
likely to make the future purchase. Preferably, the
method also contains the steps of obtaining full
contact information for the prospect from a
profile database, applying a predictive model to
the prospects to select intender leads, and
transferring the intender leads to an interested
party, such as a direct marketing service or sales
force. An intender lead is a lead for a person
intending to make a purchase of a particular
product or service within a given time period.
Only some of the prospects are actual intenders.
3.3Richard Baron Penman, Timothy
Baldwin, David Martinez in their paper "Web
Scraping Made Simple with Site Scraper" has
created their tool Site Scraper which gets
automatically gets learning XPath-based patterns
to identify where a user-defined list of strings
occurs in a given web page set. To train, Site
Scraper is given a small set of example URLs
from a given website and the strings that the user
wishes to scrape from each. This is used to
generate an XPath query describing where to find
the desired strings, which can be applied to scrape
these from any webpage with a similar structure.
Importantly, the user interacts with Site Scraper
at the level of content, not mark-up, so no
specialist knowledge is required, and if the
structure of a website is changed but the content
3Penman, R. B., Baldwin, T., & Martinez, D. (2009). Web
scraping made simple with sitescraper.
International Journal of Pure and Applied Mathematics Special Issue
2688
3
stays constant, then Site Scraper can
automatically retrain its model without human
intervention.
4.4John J. Salerno and Douglas M.
Boulware in their invention "Method and
apparatus for improved web scraping" has
found to enable the parser component of a web
search engine to adapt in response to frequent
web page format changes at web sites. Parser
“learns” from a set of defined HTTP links, how to
find and parse web pages returned from a search
engine query. The invention intelligently locates
various token/strings that will correctly extract
attributes associated with the returned item.
Present invention may operate either
automatically or in a user-assisted fashion.
5.5Dr. N. FathimaThabassum, in her
“Study on The Freelancing Remote Job
Websites" has studied about the various
freelancing websites available online, its working,
realities of online job market and the services
offered by them. she has also studied the top free
lancing websites and has made a comparative
study on various aspects
4Salerno, J. J., &Boulware, D. M. (2006) Method and
apparatus for improved web scraping,. U.S. Patent No.
7,072,890. Washington, DC: U.S. Patent and Trademark
Office.
5Thabassum, N. F. (2013). A Study on The Freelancing
Remote Job Websites. International Journal of Business
Research and Management, 4, 42-50.
6 .6Mr.HimanshuKunwar, in his model
“Logistic Regresssion in R” in kaggle to predict
the purchase of products based on social media
advertising based on various factors
7.7Mr. Salem Marafi in his model to do
Market Basket Analysis with R for the
groceries dataset.1010
S.Arunadevi and
VijetaIyer(2017)He uses Apriori Algorithm to the
Analysis
3.RESEARCH METHODOLOGY
CRISP DM - Cross-Industry Process For Data
Mining
3.1. BUSINESS UNDERSTANDING
3.1.1. Business Objectives
● To generate project leads for the
technology verticals in the company.
● To identify the commonly used
technologies in market along with the
company’s existing technologies.
3.1.2 Determine data mining goals
● To Generate leads based on the skill set of
the team.
● To Apply logistic regression to train and
predict Acceptance of project..
6Himanshu, Kunwar,. (2018, January 05). Logistic
Regresssion in R. Retrieved April 09, 2018, from
https://www.kaggle.com/suncor/social-adv
7. Market Basket Analysis with R. Retrieved April 11,
2018, from http://www.salemmarafi.com/code/market-
basket-analysis-with-r/
International Journal of Pure and Applied Mathematics Special Issue
2689
4
● To Analyse the commonly occurring skill
set using apriori algorithm in the project
database.
3.1.3. Project plan
● Identify the technology verticals in the
team
● Collect the project data from project
website like Freelancer.com.
● Data manipulation and data preparation
for the model input
● Build a model that predicts the acceptance
of project on the test data based on the
train data.
● Find the related skills to the technology
verticals.
3.1.4.Business success criteria
● Successful lead conversion and project
confirmation
● Efficient model for predicting leads.
3.2.DATA UNDERSTANDING
3.2.1.Initial Data: The appropriate fields for data
collection are chosen and data is scrapped from
job portals like freelancer, up work and guru in
separate tables.
3.2.2 Data Description: The project posts from
Major Job posting websites are extracted which
has fields like title, description, skill set required,
Bid, price, days left, location, ID.
3.2.3.Data Quality: The quality of the data has to
be checked. Missing values and fields was
checked and replaced with NAs. Outliers in bid
value was identified and removed. Derived fields
like continent were derived from country and
location.
3.3.DATA PREPARATION:
3.3.1. Data set description:
The fields common to the three websites
include
1. Title of the project
2. Description about the project
3. Skill set required
4. Bid: Average bid by other freelancers
5.Days Left: Active days of the project
6.Location: Location of the project employer
7.ID: project Id
8. Verified: If the project employer is verified or
not.
3.3.2. Data selection: Major project websites are
chosen and the data is collected from website that
has access to data scrapping.
3.3.3. Data Cleansing: redundant and fake
projects are removed and outliers are eliminated.
3.3.4. Integrate data: Data from all three
websites has to be merged and integrated for
analysis.
3.3.5. Construct data: Normalise the data
(multiple skill requirement give for each project
is normalized) and the derived data is like
continent is extracted from country.
International Journal of Pure and Applied Mathematics Special Issue
2690
5
3.3.6. Format data: The fields are converted (all
bid values are converted in dollars) and kept in
common formats.
3.4. MODELLING
3.4.1 Modelling Technique. Model to generate
leads based on the skill set of the team is created
based on the matching of the skills in skillset
field.1111
Irfan Ahmed Mohammed Saleem, Dr. S.
Jaisankar (2018To predict the acceptance of the
leads, Multiple logistic Regression is used,
which takes multiple factors (skills, average bid
values, verified) to decide on the acceptance
value. As these three fields are the primary
factors considered before bidding a project.
3.4.2. Test design
● Data is sampled from the master data by
selecting equal projects from each
category. The data is split into training
(70%) and testing data (30%). The
acceptance value of the train data is got
from the team member. Based on the
value from the train data the acceptance
value is predicted in the test data.
3.4.3. Model
● Model which predicts the acceptance of
project based on the skill set of the team.
● The skillset specialization of the team,
average bid value, verified are
independent factors(as the team mostly
selects projects based on these factors)
that are used to build the model. Based on
the co-efficient of co-relation, factors
having high co-relation(Java, My SQL,
mongo DB, Apache, Avg Bid Value) are
considered as factors determining
acceptance of project and the remaining
are eliminated.
3.4.4. Model Assessment:
● The model is assessed using Mean
Absolute Error(MAE)
This is found to be 81.67% for this model.
● And Receiver Operating Characteristics
(ROC Curve)which has 89.48% of the
area under the curve
3.5. EVALUATION
The model is assessed by checking on the
conversion of leads periodically. The data is
collected periodically is assessed based on its
conversion rate and a process review is made.
International Journal of Pure and Applied Mathematics Special Issue
2691
6
3.6. DEPLOYMENT
3.6.1. Deployment plan: Deployment is done by
registering the company as a freelancer at the
freelancing websites and assign team to bid for
projects and to create a bid writer for writing
bids.
3.6.2. Monitoring and maintenance plan.
Periodically the data is scrapped form the
websites to check if there is a need to upgrade the
technical team or to include technology vertical.
4.MODEL
4.1. Model to extract the project leads
according to skill set of the team.
A model is created in python which matches the
skillset of the team members with the skillset
required for the projects and segregates the
projects which matches. The entire
projects(45580) in freelancer website on Feb 26th
2018 was scrapped and the model is run on the
project data collected.
import requests
import pymysql
import sys
import csv
Con = pymysql.connect(host="127.0.0.1",
user="root", password="", db="skillset",
autocommit=True, charset='utf8')
Cursor = Con.cursor()
cur = Cursor.execute("select
group_concat(skills) from UnionSkills")
rows = Cursor.fetchall()
for row in rows:
row = list(row)
for iinrow:
row = i.split(',')
print(row)
for skill in row:
query= "select * from
projecttablefreelancer where skillset like
'%"+str(skill)+"%'"
Cursor = Con.cursor()
Cursor.execute(query)
rows = Cursor.fetchall()
print(skill, len(rows))
if len(rows):
with open('skills_'+skill+'.csv', 'a',
newline='') as f:
writer = csv.writer(f, delimiter =',')
data_rows = []
for data in rows:
d = [str(i).replace('\n', '').strip() for
iindata]
d.append(skill)
data_rows.append(d)
try:
writer.writerows(data_rows)
except Exception as e:
print(e)
4.2. Model to predict acceptance of project
from the sampled data from the leads
generated.
The project segregated is transformed based on
the skill set known to the technology team
member(known=1,Not known-0), Avg bid value
and verified. The acceptance value is got for a
sampled data from the technical member whose
skill set is used is build the model. The data is
split as 70% of train data and 30% test data, a
model is built which is used to predict the
acceptance of 30% of the test data.
4.3.1.Logistic regression
Logistic regression is used to describe data and
to explain the relationship between one dependent
binary variable and one or more independent
variables. This article covers the case of a binary
International Journal of Pure and Applied Mathematics Special Issue
2692
7
dependent variable—that is, where the output can
take only two values, "0" and "1",
4.3.2.Model
Interpretation
It is found that significant value is high(<0.05) for
Java, Mysql, No Sql Couch and Mongo, Apache,
Bid value in $ and Verified fields hence they are
considered as factors which affect acceptance.
4.3.3. Prediction and accuracy of the model
ROC Curve
The ROC curve is a fundamental tool for
diagnostic test evaluation. In a Receiver
Operating Characteristic (ROC) curve the true
positive rate (Sensitivity) is plotted in function of
the false positive rate (100-Specificity) for
different cut-off points. Each point on the ROC
curve represents a sensitivity/specificity pair
corresponding to a particular decision threshold.
A test with perfect discrimination (no overlap in
the two distributions) has a ROC curve that
passes through the upper left corner (100%
sensitivity, 100% specificity). Therefore the
closer the ROC curve is to the upper left corner,
the higher the overall accuracy of the test
International Journal of Pure and Applied Mathematics Special Issue
2693
8
The model is assessed using Mean Absolute
Error(MAE)
This is found to be 81.67% for this model.
Interpretation:
The model is built using the train data and the
acceptance value is predicted for the test data.
The accuracy under Mean absolute error is found
to be 81.6% and the accuracy under ROC curve
the area under the curve is found to be 89.4%.
4.4.Frequently occurring skills sets along with
skills that have higher significant values with
the acceptance value in the predictive model.
4.4.1.Apriori algorithm.
Apriori is an algorithm for frequent item set
mining and association rule learning over
transactional databases. It proceeds by identifying
the frequent individual items in the database and
extending them to larger and larger item sets as
long as those item sets appear sufficiently often in
the database. The frequent item sets determined
by Apriori can be used to determine association
rules which highlight general trends in
the database. Association rules analysis is a
technique to uncover how items are associated to
each other. There are three common ways to
measure association.
Support tells popular is a skillset
Confidence says how likely skill Y is occurs
when skill X is occurs, expressed as {X -> Y}.
Lift says how likely skill Y occurs when skill X
occurs, while controlling for how popular skill
Y is.
Larger circles imply higher support, while red
circles imply higher lift
Association Rules can be created for the skills
that have higher significant value in the model
(Java, Apache, My SQL and Mongo DB) and
other frequently occurring skills sets in projects
along with these skillets is found. Association
International Journal of Pure and Applied Mathematics Special Issue
2694
9
rules can also be created for PHP which is the
most demanded Language in the project database
which helps identify related skills frequently
demanded along with the key skills of the team.
4.4.2.Most frequently found skills along with
java in the project database.
Graph
Figure 4.15.Association rule graph-Java
Association rules
Table 4.7. Association rule-Java
Inference
● The most popular pattern of skill set is
JavaScript and Vue.js
● Another popular combination was of Big
data and Java
● If a skill Mathematics is demanded, it is
likely to demand matlab and mathematica
as well
5. FINDINGS
1. A model to predict the acceptance of project
for the member in technology team was built and
the accuracy was found to be 89.4%(ROC Curve)
2. Association rule for the skillset which has high
significant value for acceptance is found which
tells the most frequents occurring combination of
skillset that is demanded in projects, along with
those skillsets which influences the acceptance of
the project.
International Journal of Pure and Applied Mathematics Special Issue
2695
10
5.1.Future work and enhancement.
1.A dynamic application which analyses all the
projects skillset according to the technology
vertical can be built, that suggests projects
dynamically which project has more chance of
acceptance.
5.2.CONCLUSION.
Thus a model which generate leads according to
the skillset of the team is created and market
analysis of skillset location wise and according to
frequency and avg bid value is done and insights
were found.
6.BIBILIOGRAPHY
[1] 1.Kit, K. C., &Rossiter, D. Business Lead
Qualification by Online Information Scraping.
[2] Kit, K. C., &Rossiter, D. Business Lead
Qualification by Online Information Scraping.
[3] Wilkins, J. K., &Zoken, J. M. (2005). U.S. Patent
No. 6,868,389. Washington, DC: U.S. Patent and
Trademark Office
[4] Penman, R. B., Baldwin, T., & Martinez, D.
(2009). Web scraping made simple with
sitescraper.
[5] Salerno, J. J., &Boulware, D. M. (2006). U.S.
Patent No. 7,072,890. Washington, DC: U.S.
Patent and Trademark Office.
[6] Thabassum, N. F. (2013). A Study on The
Freelancing Remote Job Websites. International
Journal of Business Research and Management,
4, 42-50.
[7] Himanshu, Kunwar,. (2018, January 05).
Logistic Regresssion in R. Retrieved April 09,
2018, from
https://www.kaggle.com/suncor/social-adv
[8] .HendraHerviawan, M. (2017, December 25).
Customer Segmentation using RFM Analysis
(R). Retrieved April 09, 2018, from
https://www.kaggle.com/hendraherviawan/custo
mer-segmentation-using-rfm-analysis-r/notebook
[9] https://public.tableau.com/views/TableauSuperst
oreRFMAnalysis_0/TableauSuperstoreRFMAnal
ysis.
[10] S.Arunadevi and VijetaIyer(2017)A Study on
M/M/C Queue Model under Monte Carlo
simulation in Traffic Model, International
Journal of Pure and Applied Mathematics, Vol.
116, no. 12, pp. 199-207,
[11] Irfan Ahmed Mohammed Saleem, Dr. S.
Jaisankar (2018), A Study On Kaizen Based Soft-
Computing In Electric Vehicle Manufacturing
Processes, International Journal Of Innovations
InScientificAndEngineeringResearch,
Vol5Issue5,.31-39.
[12] https://www.kdnuggets.com/2016/04/associati
on-rules-apriori-algorithm-tutorial.html.
[13] http://www.salemmarafi.com/code/market-
basket-analysis-with-r/
[14] http://www.citationmachine.net/items/707381
222/copy?copy-full-bib=true
International Journal of Pure and Applied Mathematics Special Issue
2696
2697
2698