17
ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Embed Size (px)

Citation preview

Page 1: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

ML Approaches – Conceptual StuffNitin Kohli

DS W210 – Capstone Project

Page 2: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Sequence Matching – Marathon Runner Analogy• Imagine you are watching a runner run a marathon• During the marathon, a runner reaches various checkpoints and their

time is recorded• For instance, if there are 26 checkpoints in the race, and we know the

runner’s time at the first 3 checkpoints, we can use this information to deduce the time at the 4th checkpoint, 5th checkpoint, and so on• In general, we are able to infer about time to complete the remainder

of the race for that particular runner

Page 3: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Sequence Matching

• We apply this analogy to trains within the BART system• The trains depart a “starting” station at a particular time, and check in

at checkpoints (train stops) along the way• This information gives us a partial story of the sequence of arrival

times for the train• To deduce the remainder of the times, we can match these

incomplete sequences on complete historical sequences to deduce the next arrival time

Page 4: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Lag Time Analysis

• Unlike the marathon runner in our previous analogy, once a train arrives at a given station it does not immediately continue• It pauses for a bit to allow passengers to get on and off the train before

continuing• Thus, we need to supplement the arrival time from the sequence matching by

accounting for the lag time at a given station• This is done using a Ridge Regression with features such as (but not limited to):

• Length of the train• Which stop the train is at• Time of the arrival (Estimated)• Whether the arrival is in the AM or PM

Page 5: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Summary: Sequence Matching -> Lag Time Prediction -> Updated Departure Time -> Repeat

𝑆1(𝑖 ) 𝑆2

(𝑖 )𝑆1(1 ) 𝑆2

(1 )…

𝑆1(𝑖 ) 𝑆2

(𝑖 )𝑌 1❑ 𝑌 2

❑ 𝑌 𝑗❑

𝑆 𝑗1 𝑆 𝑗+1

(1 )

𝑌 𝑗+1𝑃𝑟𝑒

𝑆1(𝑖 ) 𝑆2

(𝑖 )𝑆1(𝑘 ) 𝑆2

(𝑘 )… 𝑆 𝑗

(𝑘 ) 𝑆 𝑗+1(𝑘 )

Lag Model 𝑌 𝑗+1

Sequence Matching

Model

Page 6: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Tech Summary of System Level Prediction1. User enters various information2. We first need to tell the user the train will arrive at the selected station for departure- This means we need to query the MySQL db to find the most recent trains heading in the direction of the user- Then, we need to use the previous train stops as an input to perform a sequence match- Once we have a sequence match, we can predict from the matched sequences- This will give us a predicted arrival time at the next stop- But at each stop, the train will wait some time before departing from that station- This is were the lag_times model comes in - the current stop, length of the train, etc are used to predict how long the train will wait at a given station3. Repeat this process until we have both the departing destination and arrival destination predictions4. Output these values back to the user in the UI

Page 7: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

ML Approaches – “Mathy” Stuff

Nitin KohliDS W210 – Capstone Project

Page 8: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

• The following slides have (for the most part) all the math that was done to construct the system level prediction• It includes the first model, which was used to iterate on to get the

more accurate second model

Page 9: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Conceptual Framework: k-Nearest Sequences• In the picture on the right, note that there are

5 distinct paths

• Within each path, trains can run in either direction

• Thus, there are 10 directional paths

• For each directional path, we will denote the stops using {1,2,…,n}

• For example, on the orange line from Richmond to Fremont, • 1 will refer to Richmond• 2 will refer to El Cerrito del Norte• 3 will refer to El Cerrito Plaza, etc.

Page 10: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Conceptual Framework Continued

• For each directional path, suppose we have historical information about the time of the arrival at each stop

• Coerce each time from to a real number by

• Let be the numerical version of the historical time at the stop

• Denote the historical path as

𝑆1(𝑖 ) 𝑆2

(𝑖 ) 𝑆3(𝑖 ) 𝑆𝑛

(𝑖 )𝑆𝑛−1(𝑖 )

𝑆1(3 ) 𝑆2

(3 ) 𝑆3(3 ) 𝑆𝑛

(3 )𝑆𝑛−1(3 )

𝑆1(1 ) 𝑆2

(1 ) 𝑆3(1 ) 𝑆𝑛

(1 )𝑆𝑛−1(1 )

𝑆1(2 ) 𝑆2

(2 ) 𝑆3(2 ) 𝑆𝑛

(2 )𝑆𝑛−1(2 )

Page 11: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Conceptual Framework Continued• Suppose we only have information about the times up till the stop, denoted as

• If we can find similar historical sequences to , then we can complete the sequence by using the historical information

• Suppose we have historical sequences for our desired directional path

• Let be a distance function. Under this function, we locate the closest partial sequences to

𝑆1(𝑖 ) 𝑆2

(𝑖 ) 𝑆𝑛(𝑖 )𝑆1

(1 ) 𝑆2(1 ) 𝑆𝑛

(1 )…

𝑆1(𝑖 ) 𝑆2

(𝑖 )𝑌 1❑ 𝑌 2

❑ 𝑌 𝑗❑

𝑆 𝑗1 …𝑆 𝑗+1

(1 )

𝑆𝑛(𝑖 )𝑌 𝑛❑…𝑌 𝑗+1

𝑆1(𝑖 ) 𝑆2

(𝑖 ) 𝑆𝑛(𝑖 )𝑆1

(𝑘 ) 𝑆2(𝑘 ) 𝑆𝑛

(𝑘 )… 𝑆 𝑗

(𝑘 ) …𝑆 𝑗+1(𝑘 )

Page 12: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Approach 1: Complete the whole sequenceOnce we have determined the nearest sequences, we can complete the entire sequence by taking a weighted sum of the values for the remaining components.

As for the choice of weights, we can do

1. Uniform weighting

2. Anti-distance weighting

Page 13: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Approach 1: Empirical Results• Used 2 distance functions:

• Euclidean distance• Cosine similarity

• Used both uniform and anti-distance weighting• Method of evaluation: RMSE on the real-valued time conversions• Results:

• Euclidean distance with uniform weighting outperformed all other combinations• The most accurate value was the st term• Sample RMSE for Euclidean distance with uniform weighting, 4 historical values, and k = 10

• Sample result for Euclidean distance with uniform weighting, 4 historical values, and k = 10

0.126 0.159 0.161 0.16 0.158

Predicted 6:32 6:40 6:42 6:44 6:45

Actuals 6:32 6:40 6:41 6:43 6:44

Page 14: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Approach 2: A Dynamic Probabilistic System• Predicting all at once has one major flaw – it doesn’t explicitly use the time of the most recent

arrival

• The previous method was a purely historical analysis

• A more accurate method would make use of the most recent arrival time when trying to predict future one, for every future value

• Thus, we aim predict each value in a stepwise manner

• Suppose we had some algorithm that could predict from

• Now, the algorithm would do the following:• Given , predict by using the k-nearest sequences • Treat as given. Now, predict by using the k-nearest sequences • Repeat until we predict

• Question: How do we construct ?

Page 15: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Constructing We want to predict the time the train reaches the next stop, given the information of . We formulate this task as an expected value problem. Suppose tells us .

Observe that

So, all we need to do is deduce the value of . By definition,

Thus, all we need to do is approximate the integral. We will use the k closest sequences to given our distance function D to do so.

Page 16: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Solution: Invoke the Weak Law of Large Numbers

Invoking the Weak Law of Large Numbers (WLLN), we can approximate the integral by using the sample average of the k closest sequences to .

Thus,

Page 17: ML Approaches – Conceptual Stuff Nitin Kohli DS W210 – Capstone Project

Dynamic Probabilistic System Algorithm• Given repeat until has length :

• Denote the length of as • Compute the k-nearest sequences to • Predict • Append to

• Note: At every step we re-compute the k-nearest sequences• This way we can keep using the most relevant information