Devsumi2013_15-c-7 アドテク・ターゲティング技術

SummitDevelopers

Developers Summit 2013 Action !

アドテク・ターゲティング技術

〜Pig, Mahout, KVS〜

太田祐一

株式会社オウルデータ

代表取締役社長兼

データマイニングエンジニア

15-C-7#devsumiC

Friday, February 15, 13

SummitDevelopers


自己紹介

太田祐一（おおたゆういち）

• 最近子供が生まれてイクメンに

• もともとは証券会社員

• 独立して失敗

• スペイシーズにData mining engineerとして拾ってもらう

• DMP事業が独立して新会社設立（オウルデータ）→代表に

2


SummitDevelopers


とにかくデータを集めて触ってみよう！


SummitDevelopers


とにかくデータを集めて触ってみよう

• 今日のお題は所謂�”ビッグデータ”分析

• ビッグデータかどうかは置いておいて、とにかく今あるデータを分散環境で分析してみる！

• 分散環境で構築しておけばスケールできる！

4

M Y R E C O M M E N D N E X T A C T I O N !


SummitDevelopers


で、アドテクとデータの関係は？


SummitDevelopers


アドテクって？

”アクション”してもらうためのテクノロジー

6


SummitDevelopers


広告を配信する技術

7

育児情報サイトの広告

育児情報サイトだから25〜35歳女性向けの広告を出す


SummitDevelopers


でも最近はイクメンもけっこう見てますよ？


SummitDevelopers


オーディエンスターゲティング

9

出産育児情報サイトを見ている男だから保険商品をすすめる

育児情報 × 男


SummitDevelopers

Developers Summit 2013 Action ! 10

育児情報サイトを見ている男で保険の切替検討中で、株式投資に興味があり、妻から旅行に連れて行けとせがまれているけど、最近車を買ったからお金が無くて・・・



SummitDevelopers


データがいっぱいあっても活用しきれない

自動で似ている人同士をセグメント化したい



SummitDevelopers


ターゲティングフロー

12

site.jp

userId = XYZ

XYZの情報

userId = XYZKVS

クッキーからユーザー識別IDを取得

KVSからユーザーの情報取得


SummitDevelopers



13

site.jp

userId = XYZ

XYZの情報

userId = XYZ

XYZのベクトル

KVS



ユーザ情報をベクトル化してRandomForest


SummitDevelopers



14

site.jp

userId = XYZ

XYZの情報

userId = XYZ

XYZのベクトル

clusterId = 5

KVS




clusterIdを取得して結果を利用

clusterId = 5


SummitDevelopers


モデル作成の流れ

15

各ユーザーの情報をKVSに保存

各ユーザーの特徴をベクトル化

ユーザーベクトルを標準化

ユーザーベクトルをmahoutでk-Means

k-Meansした結果からRandom Forestモデルを作成


SummitDevelopers


KVSに保存

16







• どのKVSを採用するか？

‣

‣ 比較してるサイトの情報とかぜんぜんあてにならない

• データ形式は何にするか？

‣ JSON? XML? Avro? MessagePack? protobuf?

SummitDevelopers


KVSに保存

17


SummitDevelopers


ベクトル化

18







SummitDevelopers


ベクトル化

19

美容育児金融車マンガ家具家電ペットグルメ金額

0 7 10 4 1 2 8 0 4 7600

{"ユーザーID":12345,"カテゴリ":{"美容":0,"育児":7,"金融":10,"車":4,"マンガ":1,"家具":2,"家電":8,"ペット":0,"グルメ":4},"金額":7600}


SummitDevelopers


ベクトルを標準化

20







SummitDevelopers


ベクトルの標準化

21

美容育児金融車マンガ家具家電ペットお

グルメ金額

0 3 7 5 1 0 6 0 3 8000

5 6 0 0 0 4 0 0 5 12000

0 0 1 2 4 3 2 6 3 23000

美容育児金融車マンガ家具家電ペットグルメ金額

-0.58 0.00 1.14 1.06 -0.32 -1.12 1.09 -0.58 -0.58 -0.82

1.15 1.00 -0.70 -0.93 -0.80 0.80 -0.87 -0.58 1.15 -0.30

-0.58 -1.00 -0.44 -0.13 1.12 0.32 -0.22 1.15 -0.58 1.12

平均0 標準偏差1


SummitDevelopers


Pigを使う

22

面倒なMapReduceのJavaプログラムやHadoopコマンドを使わずに、Hadoopを操ることができるツール(最初はYahooが作った）

http://pig.apache.org/releases.html#Download

（Hadoop環境があることが前提です）

ダウンロードして設定ファイルを編集するだけで使うことができます


SummitDevelopers


PigのUDFを使う

23

% git clone https://github.com/gh-gsd/pigudf.git

% ant jar.build

UDFを作成することでいろいろできます


https://github.com/gh-gsd/pigudf.git

https://github.com/gh-gsd/pigudf.git

SummitDevelopers


Pigで分散を計算

24

register path/to/udfs.jarset job.priority very_low;set job.name 'CalcVariance';define VAR com.gsd.pig.udf.Variance();A = load 'mydata' as (data:double);B = group A all;C = foreach B generate VAR(A.data);store C into 'path/to/hdfs/rawdata';


SummitDevelopers


Mahoutでk-Means

25







SummitDevelopers


主にHadoopを用いてデータマイニングや機械学習をするためのJavaライブラリ

26

% git clone https://github.com/apache/mahout.git

% git checkout mahout-0.7

% mvn install -DskipTests


https://github.com/apache/mahout.git

https://github.com/apache/mahout.git

SummitDevelopers


k-Means

27


SummitDevelopers


　key : IntWritable　Value : WeightedVectorWritable

　key : Text　Value : VectorWritable

Mahout k-Meansのデータ型

Input :

Output :


SummitDevelopers


データを変換

29

　　key : Text　　Value : VectorWritable

ユーザーベクトルをスペース区切りに整形したテキストファイル

% $MAHOUT_HOME/bin/mahout \ org.apache.mahout.clustering.conversion.InputDriver　\ -i path/to/hdfs/rawData \ -o path/to/hdfs/vectorData


SummitDevelopers


% MAHOUT_HOME/bin/mahout kmeans　\ -i path/to/hdfs/vectorData　\ #入力パス -c path/to/hdfs/initialClusters　\ #初期ノードランダムの場合はどこでも -o path/to/hdfs/kmeansResults　\ #結果出力パス -k 10　\ #クラスタ数 -dm org.apache.mahout.common.distance.CosineDistanceMeasure　\ 　#距離計算手法 -x 20　\ #最大イテレーション回数 -xm mapreduce　\ #初期ランダムの場合は適当

k-Meansコマンド

結果は　　path/to/hdfs/kmeansResults/clusteredPoints に出力


SummitDevelopers


k-Means 結果

31

（map例）public void map(IntWritable key, WeightedVectorWritable value, 　OutputCollector<NullWritable, Text> output, Reporter reporter) { try {

StringBuilder sb = new StringBuilder(); sb.append(key.toString());

Iterator<Element> ite = value.getVector().iterator();

while (ite.hasNext()) { sb.append(COMMA + ite.next().get()); } output.collect(NullWritable.get(), new Text(sb.toString())); } catch (InvalidDatastoreException e) { throw new IOException(e); }}


SummitDevelopers


k-Means 結果

32

1,0,3,5,7,1,0,9,8,0,15,6,0,0,0,2,0,0,0,0,22,0,8,2,0,0,0,0,0,0,01,0,4,2,3,0,0,6,7,0,44,0,1,1,1,1,1,1,2,0,03,5,2,0,1,3,0,9,8,0,1・・・

クラスタID ベクトルの値


SummitDevelopers



33

site.jp

userId = XYZ

XYZの情報

userId = XYZ

XYZのベクトル

clusterId = 5

KVS





clusterId = 5


SummitDevelopers


MahoutでRandom Forest

34







SummitDevelopers


決定木

35


SummitDevelopers


Random Forest

36

c-2c-2

c-5


SummitDevelopers


MahoutでRandom Forest

37

% mahout org.apache.mahout.classifier.df.tools.Describe　\-p path/to/cluster_id_to_vector_text_file \ #入力ファイル-f path/to/dataset.info \ #データセット出力先-d L 10 N #データフォーマット

% mahout org.apache.mahout.classifier.df.mapreduce.BuildForest　\-d path/to/cluster_id_to_vector_text_file　\ #入力ファイル-ds path/to/dataset.info \ #データセットファイル-o path/to/decision_forest　\ #モデル出力先-t 10 　 #作成する木の数


SummitDevelopers


モデルのテスト

38

% mahout org.apache.mahout.classifier.df.mapreduce.TestForest　\ -i path/to/test/data　\ #入力テストデータ-ds path/to/dataset.info \ #データセットファイル-m path/to/decision_forest/nsl-forest \ #モデルのパス-a \ #テスト結果表示-mr \ #mapreduceを利用-o path/to/output/ #テスト結果出力先


SummitDevelopers


テスト結果

39

12/10/13 18:08:56 INFO mapreduce.TestForest: Classification Time: 0h 0m 6s 35512/10/13 18:08:56 INFO mapreduce.TestForest: ==================================Summary-------------------------------------------------------Correctly Classified Instances : 17657 78.3224%Incorrectly Classified Instances : 4887 21.6776%Total Classified Instances : 22544

=======================================================Confusion Matrix-------------------------------------------------------a b <--Classified as9459 252 | 9711 a = normal4635 8198 | 12833 b = anomalyDefault Category: unknown: 2


SummitDevelopers



40

site.jp

userId = XYZ

XYZの情報

userId = XYZ

XYZのベクトル

clusterId = 5

KVS





clusterId = 5


SummitDevelopers


Classify

41

…

// decisionForestを取得DataInputStream decisionForestBinary = getDecisionForestBinary();DecisionForest decisionForest = DecisionForest.read(decisionForestBinary);

// datasetを取得DataInputStream datasetBinary = getDatasetBinary();Dataset dataset = Dataset.read(datasetBinary);

// スケール済ベクトルを取得String scaledVector = getScaledVector();

DataConverter dataConverter = new DataConverter(dataset);Instance instance = dataConverter.convert(scaleVectorString);

// ランダム関数Random random = new Random();

// 結果を得るdouble id = decisionForest.classify(random, instance);


SummitDevelopers


モデル作成の流れ

42







SummitDevelopers


ありがとうございました。

オウルデータで働きたい！もっとくわしく聞きたい！

という方は「Ask The Speaker」でお待ちしております。


Documents

Devsumi2013_15-c-7 アドテク・ターゲティング技術