the-algorithm/timelines/data_processing/ml_util/aggregation_framework/docs/batch.rst

216 lines
11 KiB
ReStructuredText

.. _batch:
Batch aggregate feature jobs
============================
In the previous section, we went over the core concepts of the aggregation framework and discussed how you can set up you own `AggregateGroups` to compute aggregate features.
Given these groups, this section will discuss how you can setup offline batch jobs to produce the corresponding aggregate features, updated daily. To accomplish this, we need to setup a summingbird-scalding job that is pointed to the input data records containing features and labels to be aggregated.
Input Data
----------
In order to generate aggregate features, the relevant input features need to be available offline as a daily scalding source in `DataRecord` format (typically `DailySuffixFeatureSource <https://cgit.twitter.biz/source/tree/src/scala/com/twitter/ml/api/FeatureSource.scala>`_, though `HourlySuffixFeatureSource` could also be usable but we have not tested this).
.. admonition:: Note
The input data source should contain the keys, features and labels you want to use in your `AggregateGroups`.
Aggregation Config
------------------
Now that we have a daily data source with input features and labels, we need to setup the `AggregateGroup` config itself. This contains all aggregation groups that you would like to compute and we will go through the implementation step-by-step.
.. admonition:: Example: Timelines Quality config
`TimelinesAggregationConfig <https://cgit.twitter.biz/source/tree/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationConfig.scala>`_ imports the configured `AggregationGroups` from `TimelinesAggregationConfigDetails <https://cgit.twitter.biz/source/tree/src/scala/com/twitter/timelines/prediction/common/aggregates/TimelinesAggregationConfigDetails.scala>`_. The config is then referenced by the implementing summingbird-scalding job which we will setup below.
OfflineAggregateSource
----------------------
Each `AggregateGroup` will need to define a (daily) source of input features. We use `OfflineAggregateSource` for this to tell the aggregation framework where the input data set is and the required timestamp feature that the framework uses to decay aggregate feature values:
.. code-block:: scala
val timelinesDailyRecapSource = OfflineAggregateSource(
name = "timelines_daily_recap",
timestampFeature = TIMESTAMP,
scaldingHdfsPath = Some("/user/timelines/processed/suggests/recap/data_records"),
scaldingSuffixType = Some("daily"),
withValidation = true
)
.. admonition:: Note
.. cssclass:: shortlist
#. The name is not important as long as it is unique.
#. `timestampFeature` must be a discrete feature of type `com.twitter.ml.api.Feature[Long]` and represents the “time” of a given training record in milliseconds - for example, the time at which an engagement, push open event, or abuse event took place that you are trying to train on. If you do not already have such a feature in your daily training data, you need to add one.
#. `scaldingSuffixType` can be “hourly” or “daily” depending on the type of source (`HourlySuffixFeatureSource` vs `DailySuffixFeatureSource`).
#. Set `withValidation` to true to validate the presence of _SUCCESS file. Context: https://jira.twitter.biz/browse/TQ-10618
Output HDFS store
-----------------
The output HDFS store is where the computed aggregate features are stored. This store contains all computed aggregate feature values and is incrementally updated by the aggregates job every day.
.. code-block:: scala
val outputHdfsPath = "/user/timelines/processed/aggregates_v2"
val timelinesOfflineAggregateSink = new OfflineStoreCommonConfig {
override def apply(startDate: String) = new OfflineAggregateStoreCommonConfig(
outputHdfsPathPrefix = outputHdfsPath,
dummyAppId = "timelines_aggregates_v2_ro", // unused - can be arbitrary
dummyDatasetPrefix = "timelines_aggregates_v2_ro", // unused - can be arbitrary
startDate = startDate
)
}
Note: `dummyAppId` and `dummyDatasetPrefix` are unused so can be set to any arbitrary value. They should be removed on the framework side.
The `outputHdfsPathPrefix` is the only field that matters, and should be set to the HDFS path where you want to store the aggregate features. Make sure you have a lot of quota available at that path.
Setting Up Aggregates Job
-------------------------
Once you have defined a config file with the aggregates you would like to compute, the next step is to create the aggregates scalding job using the config (`example <https://cgit.twitter.biz/source/tree/timelines/data_processing/ad_hoc/aggregate_interactions/v2/offline_aggregation/TimelinesAggregationScaldingJob.scala>`_). This is very concise and requires only a few lines of code:
.. code-block:: scala
object TimelinesAggregationScaldingJob extends AggregatesV2ScaldingJob {
override val aggregatesToCompute = TimelinesAggregationConfig.aggregatesToCompute
}
Now that the scalding job is implemented with the aggregation config, we need to setup a capesos config similar to https://cgit.twitter.biz/source/tree/science/scalding/mesos/timelines/prod.yml:
.. code-block:: scala
# Common configuration shared by all aggregates v2 jobs
__aggregates_v2_common__: &__aggregates_v2_common__
class: HadoopSummingbirdProducer
bundle: offline_aggregation-deploy.tar.gz
mainjar: offline_aggregation-deploy.jar
pants_target: "bundle timelines/data_processing/ad_hoc/aggregate_interactions/v2/offline_aggregation:bin"
cron_collision_policy: CANCEL_NEW
use_libjar_wild_card: true
.. code-block:: scala
# Specific job computing user aggregates
user_aggregates_v2:
<<: *__aggregates_v2_common__
cron_schedule: "25 * * * *"
arguments: --batches 1 --output_stores user_aggregates --job_name timelines_user_aggregates_v2
.. admonition:: Important
Each AggregateGroup in your config should have its own associated offline job which specifies `output_stores` pointing to the output store name you defined in your config.
Running The Job
---------------
When you run the batch job for the first time, you need to add a temporary entry to your capesos yml file that looks like this:
.. code-block:: scala
user_aggregates_v2_initial_run:
<<: *__aggregates_v2_common__
cron_schedule: "25 * * * *"
arguments: --batches 1 --start-time 2017-03-03 00:00:00 --output_stores user_aggregates --job_name timelines_user_aggregates_v2
.. admonition:: Start Time
The additional `--start-time` argument should match the `startDate` in your config for that AggregateGroup, but in the format `yyyy-mm-dd hh:mm:ss`.
To invoke the initial run via capesos, we would do the following (in Timelines case):
.. code-block:: scala
CAPESOSPY_ENV=prod capesospy-v2 update --build_locally --start_cron user_aggregates_v2_initial_run science/scalding/mesos/timelines/prod.yml
Once it is running smoothly, you can deschedule the initial run job and delete the temporary entry from your production yml config.
.. code-block:: scala
aurora cron deschedule atla/timelines/prod/user_aggregates_v2_initial_run
Note: deschedule it preemptively to avoid repeatedly overwriting the same initial results
Then schedule the production job from jenkins using something like this:
.. code-block:: scala
CAPESOSPY_ENV=prod capesospy-v2 update user_aggregates_v2 science/scalding/mesos/timelines/prod.yml
All future runs (2nd onwards) will use the permanent entry in the capesos yml config that does not have the `start-time` specified.
.. admonition:: Job name has to match
It's important that the production run should share the same `--job_name` with the initial_run so that eagleeye/statebird knows how to keep track of it correctly.
Output Aggregate Features
-------------------------
This scalding job using the example config from the earlier section would output a VersionedKeyValSource to `/user/timelines/processed/aggregates_v2/user_aggregates` on HDFS.
Note that `/user/timelines/processed/aggregates_v2` is the explicitly defined root path while `user_aggregates` is the output directory of the example `AggregateGroup` defined earlier. The latter can be different for different `AggregateGroups` defined in your config.
The VersionedKeyValSource is difficult to use directly in your jobs/offline trainings, but we provide an adapted source `AggregatesV2FeatureSource` that makes it easy to join and use in your jobs:
.. code-block:: scala
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.conversion._
val pipe: DataSetPipe = AggregatesV2FeatureSource(
rootPath = "/user/timelines/processed/aggregates_v2",
storeName = "user_aggregates",
aggregates = TimelinesAggregationConfig.aggregatesToCompute,
trimThreshold = 0
)(dateRange).read
Simply replace the `rootPath`, `storeName` and `aggregates` object to whatever you defined. The `trimThreshold` tells the framework to trim all features below a certain cutoff: 0 is a safe default to use to begin with.
.. admonition:: Usage
This can now be used like any other `DataSetPipe` in offline ML jobs. You can write out the features to a `DailySuffixFeatureSource`, you can join them with your data offline for trainings, or you can write them to a Manhattan store for serving online.
Aggregate Features Example
--------------------------
Here is an example of sample of the aggregate features we just computed:
.. code-block:: scala
user_aggregate_v2.pair.any_label.any_feature.50.days.count: 100.0
user_aggregate_v2.pair.any_label.tweetsource.is_quote.50.days.count: 30.0
user_aggregate_v2.pair.is_favorited.any_feature.50.days.count: 10.0
user_aggregate_v2.pair.is_favorited.tweetsource.is_quote.50.days.count: 6.0
meta.user_id: 123456789
Aggregate feature names match a `prefix.pair.label.feature.half_life.metric` schema and correspond to what was defined in the aggregation config for each of these fields.
.. admonition:: Example
In this example, the above features are capturing that userId 123456789L has:
..
A 50-day decayed count of 100 training records with any label or feature (“tweet impressions”)
A 50-day decayed count of 30 records that are “quote tweets” (tweetsource.is_quote = true)
A 50-day decayed count of 10 records that are favorites on any type of tweet (is_favorited = true)
A 50-day decayed count of 6 records that are “favorites” on “quote tweets” (both of the above are true)
By combining the above, a model might infer that for this specific user, quote tweets comprise 30% of all impressions, have a favorite rate of 6/30 = 20%, compared to a favorite rate of 10/100 = 10% on the total population of tweets.
Therefore, being a quote tweet makes this specific user `123456789L` approximately twice as likely to favorite the tweet, which is useful for prediction and could result in the ML model giving higher scores to & ranking quote tweets higher in a personalized fashion for this user.
Tests for Feature Names
--------------------------
When you change or add AggregateGroup, feature names might change. And the Feature Store provides a testing mechanism to assert that the feature names change as you expect. See `tests for feature names <https://docbird.twitter.biz/ml_feature_store/catalog.html#tests-for-feature-names>`_.