Compare commits

...

5 Commits

Author SHA1 Message Date
twitter-team
90d7ea370e README updates: representation-manager and representation-scorer 2023-04-28 14:18:30 -05:00
twitter-team
5edbbeedb3 Open-sourcing Representation Scorer
Representation Scorer (RSX) serves as a centralized scoring system, offering SimClusters or other embedding-based scoring solutions as machine learning features.
2023-04-28 14:18:16 -05:00
twitter-team
43cdcf2ed6 Open-sourcing Representation Manager
Representation Manager (RMS) serves as a centralized embedding management system, providing SimClusters or other embeddings as facade of the underlying storage or services.
2023-04-28 14:17:58 -05:00
twitter-team
197bf2c563 Open-sourcing Timelines Aggregation Framework
Open sourcing Aggregation Framework, a config-driven Summingbird based framework for generating real-time and batch aggregate features to be consumed by ML models.
2023-04-28 14:17:02 -05:00
twitter-team
b5e849b029 User Signals in Candidate Sourcing Stage
Add the overview readme about how Twitter uses user signals in candidate retrieval.
2023-04-28 14:16:22 -05:00
230 changed files with 22465 additions and 0 deletions

View File

@ -18,8 +18,11 @@ Product surfaces at Twitter are built on a shared set of data, models, and softw
| | [recos-injector](recos-injector/README.md) | Streaming event processor for building input streams for [GraphJet](https://github.com/twitter/GraphJet) based services. |
| | [graph-feature-service](graph-feature-service/README.md) | Serves graph features for a directed pair of Users (e.g. how many of User A's following liked Tweets from User B). |
| | [topic-social-proof](topic-social-proof/README.md) | Identifies topics related to individual Tweets. |
| | [representation-scorer](representation-scorer/README.md) | Compute scores between pairs of entities (Users, Tweets, etc.) using embedding similarity. |
| Software framework | [navi](navi/README.md) | High performance, machine learning model serving written in Rust. |
| | [product-mixer](product-mixer/README.md) | Software framework for building feeds of content. |
| | [timelines-aggregation-framework](timelines/data_processing/ml_util/aggregation_framework/README.md) | Framework for generating aggregate features in batch or real time. |
| | [representation-manager](representation-manager/README.md) | Service to retrieve embeddings (i.e. SimClusers and TwHIN). |
| | [twml](twml/README.md) | Legacy machine learning framework built on TensorFlow v1. |
The product surface currently included in this repository is the For You Timeline.

51
RETREIVAL_SIGNALS.md Normal file
View File

@ -0,0 +1,51 @@
# Signals for Candidate Sources
## Overview
The candidate sourcing stage within the Twitter Recommendation algorithm serves to significantly narrow down the item size from approximately 1 billion to just a few thousand. This process utilizes Twitter user behavior as the primary input for the algorithm. This document comprehensively enumerates all the signals during the candidate sourcing phase.
| Signals | Description |
| :-------------------- | :-------------------------------------------------------------------- |
| Author Follow | The accounts which user explicit follows. |
| Author Unfollow | The accounts which user recently unfollows. |
| Author Mute | The accounts which user have muted. |
| Author Block | The accounts which user have blocked |
| Tweet Favorite | The tweets which user clicked the like botton. |
| Tweet Unfavorite | The tweets which user clicked the unlike botton. |
| Retweet | The tweets which user retweeted |
| Quote Tweet | The tweets which user retweeted with comments. |
| Tweet Reply | The tweets which user replied. |
| Tweet Share | The tweets which user clicked the share botton. |
| Tweet Bookmark | The tweets which user clicked the bookmark botton. |
| Tweet Click | The tweets which user clicked and viewed the tweet detail page. |
| Tweet Video Watch | The video tweets which user watched certain seconds or percentage. |
| Tweet Don't like | The tweets which user clicked "Not interested in this tweet" botton. |
| Tweet Report | The tweets which user clicked "Report Tweet" botton. |
| Notification Open | The push notification tweets which user opened. |
| Ntab click | The tweets which user click on the Notifications page. |
| User AddressBook | The author accounts identifiers of the user's addressbook. |
## Usage Details
Twitter uses these user signals as training labels and/or ML features in the each candidate sourcing algorithms. The following tables shows how they are used in the each components.
| Signals | USS | SimClusters | TwHin | UTEG | FRS | Light Ranking |
| :-------------------- | :----------------- | :----------------- | :----------------- | :----------------- | :----------------- | :----------------- |
| Author Follow | Features | Features / Labels | Features / Labels | Features | Features / Labels | N/A |
| Author Unfollow | Features | N/A | N/A | N/A | N/A | N/A |
| Author Mute | Features | N/A | N/A | N/A | Features | N/A |
| Author Block | Features | N/A | N/A | N/A | Features | N/A |
| Tweet Favorite | Features | Features | Features / Labels | Features | Features / Labels | Features / Labels |
| Tweet Unfavorite | Features | Features | N/A | N/A | N/A | N/A |
| Retweet | Features | N/A | Features / Labels | Features | Features / Labels | Features / Labels |
| Quote Tweet | Features | N/A | Features / Labels | Features | Features / Labels | Features / Labels |
| Tweet Reply | Features | N/A | Features | Features | Features / Labels | Features |
| Tweet Share | Features | N/A | N/A | N/A | Features | N/A |
| Tweet Bookmark | Features | N/A | N/A | N/A | N/A | N/A |
| Tweet Click | Features | N/A | N/A | N/A | Features | Labels |
| Tweet Video Watch | Features | Features | N/A | N/A | N/A | Labels |
| Tweet Don't like | Features | N/A | N/A | N/A | N/A | N/A |
| Tweet Report | Features | N/A | N/A | N/A | N/A | N/A |
| Notification Open | Features | Features | Features | N/A | Features | N/A |
| Ntab click | Features | Features | Features | N/A | Features | N/A |
| User AddressBook | N/A | N/A | N/A | N/A | Features | N/A |

View File

@ -0,0 +1 @@
# This prevents SQ query from grabbing //:all since it traverses up once to find a BUILD

View File

@ -0,0 +1,4 @@
# Representation Manager #
**Representation Manager** (RMS) serves as a centralized embedding management system, providing SimClusters or other embeddings as facade of the underlying storage or services.

View File

@ -0,0 +1,4 @@
#!/usr/bin/env bash
JOB=representation-manager bazel run --ui_event_filters=-info,-stdout,-stderr --noshow_progress \
//relevance-platform/src/main/python/deploy -- "$@"

View File

@ -0,0 +1,17 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"finatra/inject/inject-thrift-client",
"frigate/frigate-common/src/main/scala/com/twitter/frigate/common/store/strato",
"hermit/hermit-core/src/main/scala/com/twitter/hermit/store/common",
"relevance-platform/src/main/scala/com/twitter/relevance_platform/common/readablestore",
"representation-manager/client/src/main/scala/com/twitter/representation_manager/config",
"representation-manager/server/src/main/thrift:thrift-scala",
"src/scala/com/twitter/simclusters_v2/common",
"src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala",
"stitch/stitch-storehaus",
"strato/src/main/scala/com/twitter/strato/client",
],
)

View File

@ -0,0 +1,208 @@
package com.twitter.representation_manager
import com.twitter.finagle.memcached.{Client => MemcachedClient}
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.frigate.common.store.strato.StratoFetchableStore
import com.twitter.hermit.store.common.ObservedCachedReadableStore
import com.twitter.hermit.store.common.ObservedReadableStore
import com.twitter.representation_manager.config.ClientConfig
import com.twitter.representation_manager.config.DisabledInMemoryCacheParams
import com.twitter.representation_manager.config.EnabledInMemoryCacheParams
import com.twitter.representation_manager.thriftscala.SimClustersEmbeddingView
import com.twitter.simclusters_v2.common.SimClustersEmbedding
import com.twitter.simclusters_v2.thriftscala.InternalId
import com.twitter.simclusters_v2.thriftscala.LocaleEntityId
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.simclusters_v2.thriftscala.TopicId
import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding}
import com.twitter.storehaus.ReadableStore
import com.twitter.strato.client.{Client => StratoClient}
import com.twitter.strato.thrift.ScroogeConvImplicits._
/**
* This is the class that offers features to build readable stores for a given
* SimClustersEmbeddingView (i.e. embeddingType and modelVersion). It applies ClientConfig
* for a particular service and build ReadableStores which implement that config.
*/
class StoreBuilder(
clientConfig: ClientConfig,
stratoClient: StratoClient,
memCachedClient: MemcachedClient,
globalStats: StatsReceiver,
) {
private val stats =
globalStats.scope("representation_manager_client").scope(this.getClass.getSimpleName)
// Column consts
private val ColPathPrefix = "recommendations/representation_manager/"
private val SimclustersTweetColPath = ColPathPrefix + "simClustersEmbedding.Tweet"
private val SimclustersUserColPath = ColPathPrefix + "simClustersEmbedding.User"
private val SimclustersTopicIdColPath = ColPathPrefix + "simClustersEmbedding.TopicId"
private val SimclustersLocaleEntityIdColPath =
ColPathPrefix + "simClustersEmbedding.LocaleEntityId"
def buildSimclustersTweetEmbeddingStore(
embeddingColumnView: SimClustersEmbeddingView
): ReadableStore[Long, SimClustersEmbedding] = {
val rawStore = StratoFetchableStore
.withView[Long, SimClustersEmbeddingView, ThriftSimClustersEmbedding](
stratoClient,
SimclustersTweetColPath,
embeddingColumnView)
.mapValues(SimClustersEmbedding(_))
addCacheLayer(rawStore, embeddingColumnView)
}
def buildSimclustersUserEmbeddingStore(
embeddingColumnView: SimClustersEmbeddingView
): ReadableStore[Long, SimClustersEmbedding] = {
val rawStore = StratoFetchableStore
.withView[Long, SimClustersEmbeddingView, ThriftSimClustersEmbedding](
stratoClient,
SimclustersUserColPath,
embeddingColumnView)
.mapValues(SimClustersEmbedding(_))
addCacheLayer(rawStore, embeddingColumnView)
}
def buildSimclustersTopicIdEmbeddingStore(
embeddingColumnView: SimClustersEmbeddingView
): ReadableStore[TopicId, SimClustersEmbedding] = {
val rawStore = StratoFetchableStore
.withView[TopicId, SimClustersEmbeddingView, ThriftSimClustersEmbedding](
stratoClient,
SimclustersTopicIdColPath,
embeddingColumnView)
.mapValues(SimClustersEmbedding(_))
addCacheLayer(rawStore, embeddingColumnView)
}
def buildSimclustersLocaleEntityIdEmbeddingStore(
embeddingColumnView: SimClustersEmbeddingView
): ReadableStore[LocaleEntityId, SimClustersEmbedding] = {
val rawStore = StratoFetchableStore
.withView[LocaleEntityId, SimClustersEmbeddingView, ThriftSimClustersEmbedding](
stratoClient,
SimclustersLocaleEntityIdColPath,
embeddingColumnView)
.mapValues(SimClustersEmbedding(_))
addCacheLayer(rawStore, embeddingColumnView)
}
def buildSimclustersTweetEmbeddingStoreWithEmbeddingIdAsKey(
embeddingColumnView: SimClustersEmbeddingView
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val rawStore = StratoFetchableStore
.withView[Long, SimClustersEmbeddingView, ThriftSimClustersEmbedding](
stratoClient,
SimclustersTweetColPath,
embeddingColumnView)
.mapValues(SimClustersEmbedding(_))
val embeddingIdAsKeyStore = rawStore.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(_, _, InternalId.TweetId(tweetId)) =>
tweetId
}
addCacheLayer(embeddingIdAsKeyStore, embeddingColumnView)
}
def buildSimclustersUserEmbeddingStoreWithEmbeddingIdAsKey(
embeddingColumnView: SimClustersEmbeddingView
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val rawStore = StratoFetchableStore
.withView[Long, SimClustersEmbeddingView, ThriftSimClustersEmbedding](
stratoClient,
SimclustersUserColPath,
embeddingColumnView)
.mapValues(SimClustersEmbedding(_))
val embeddingIdAsKeyStore = rawStore.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(_, _, InternalId.UserId(userId)) =>
userId
}
addCacheLayer(embeddingIdAsKeyStore, embeddingColumnView)
}
def buildSimclustersTopicEmbeddingStoreWithEmbeddingIdAsKey(
embeddingColumnView: SimClustersEmbeddingView
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val rawStore = StratoFetchableStore
.withView[TopicId, SimClustersEmbeddingView, ThriftSimClustersEmbedding](
stratoClient,
SimclustersTopicIdColPath,
embeddingColumnView)
.mapValues(SimClustersEmbedding(_))
val embeddingIdAsKeyStore = rawStore.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(_, _, InternalId.TopicId(topicId)) =>
topicId
}
addCacheLayer(embeddingIdAsKeyStore, embeddingColumnView)
}
def buildSimclustersTopicIdEmbeddingStoreWithEmbeddingIdAsKey(
embeddingColumnView: SimClustersEmbeddingView
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val rawStore = StratoFetchableStore
.withView[TopicId, SimClustersEmbeddingView, ThriftSimClustersEmbedding](
stratoClient,
SimclustersTopicIdColPath,
embeddingColumnView)
.mapValues(SimClustersEmbedding(_))
val embeddingIdAsKeyStore = rawStore.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(_, _, InternalId.TopicId(topicId)) =>
topicId
}
addCacheLayer(embeddingIdAsKeyStore, embeddingColumnView)
}
def buildSimclustersLocaleEntityIdEmbeddingStoreWithEmbeddingIdAsKey(
embeddingColumnView: SimClustersEmbeddingView
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val rawStore = StratoFetchableStore
.withView[LocaleEntityId, SimClustersEmbeddingView, ThriftSimClustersEmbedding](
stratoClient,
SimclustersLocaleEntityIdColPath,
embeddingColumnView)
.mapValues(SimClustersEmbedding(_))
val embeddingIdAsKeyStore = rawStore.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(_, _, InternalId.LocaleEntityId(localeEntityId)) =>
localeEntityId
}
addCacheLayer(embeddingIdAsKeyStore, embeddingColumnView)
}
private def addCacheLayer[K](
rawStore: ReadableStore[K, SimClustersEmbedding],
embeddingColumnView: SimClustersEmbeddingView,
): ReadableStore[K, SimClustersEmbedding] = {
// Add in-memory caching based on ClientConfig
val inMemCacheParams = clientConfig.inMemoryCacheConfig
.getCacheSetup(embeddingColumnView.embeddingType, embeddingColumnView.modelVersion)
val statsPerStore = stats
.scope(embeddingColumnView.embeddingType.name).scope(embeddingColumnView.modelVersion.name)
inMemCacheParams match {
case DisabledInMemoryCacheParams =>
ObservedReadableStore(
store = rawStore
)(statsPerStore)
case EnabledInMemoryCacheParams(ttl, maxKeys, cacheName) =>
ObservedCachedReadableStore.from[K, SimClustersEmbedding](
rawStore,
ttl = ttl,
maxKeys = maxKeys,
cacheName = cacheName,
windowSize = 10000L
)(statsPerStore)
}
}
}

View File

@ -0,0 +1,12 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"finatra/inject/inject-thrift-client",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/common",
"representation-manager/server/src/main/thrift:thrift-scala",
"src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala",
"strato/src/main/scala/com/twitter/strato/client",
],
)

View File

@ -0,0 +1,25 @@
package com.twitter.representation_manager.config
import com.twitter.simclusters_v2.thriftscala.EmbeddingType
import com.twitter.simclusters_v2.thriftscala.ModelVersion
/*
* This is RMS client config class.
* We only support setting up in memory cache params for now, but we expect to enable other
* customisations in the near future e.g. request timeout
*
* --------------------------------------------
* PLEASE NOTE:
* Having in-memory cache is not necessarily a free performance win, anyone considering it should
* investigate rather than blindly enabling it
* */
class ClientConfig(inMemCacheParamsOverrides: Map[
(EmbeddingType, ModelVersion),
InMemoryCacheParams
] = Map.empty) {
// In memory cache config per embedding
val inMemCacheParams = DefaultInMemoryCacheConfig.cacheParamsMap ++ inMemCacheParamsOverrides
val inMemoryCacheConfig = new InMemoryCacheConfig(inMemCacheParams)
}
object DefaultClientConfig extends ClientConfig

View File

@ -0,0 +1,53 @@
package com.twitter.representation_manager.config
import com.twitter.simclusters_v2.thriftscala.EmbeddingType
import com.twitter.simclusters_v2.thriftscala.ModelVersion
import com.twitter.util.Duration
/*
* --------------------------------------------
* PLEASE NOTE:
* Having in-memory cache is not necessarily a free performance win, anyone considering it should
* investigate rather than blindly enabling it
* --------------------------------------------
* */
sealed trait InMemoryCacheParams
/*
* This holds params that is required to set up a in-mem cache for a single embedding store
*/
case class EnabledInMemoryCacheParams(
ttl: Duration,
maxKeys: Int,
cacheName: String)
extends InMemoryCacheParams
object DisabledInMemoryCacheParams extends InMemoryCacheParams
/*
* This is the class for the in-memory cache config. Client could pass in their own cacheParamsMap to
* create a new InMemoryCacheConfig instead of using the DefaultInMemoryCacheConfig object below
* */
class InMemoryCacheConfig(
cacheParamsMap: Map[
(EmbeddingType, ModelVersion),
InMemoryCacheParams
] = Map.empty) {
def getCacheSetup(
embeddingType: EmbeddingType,
modelVersion: ModelVersion
): InMemoryCacheParams = {
// When requested embedding type doesn't exist, we return DisabledInMemoryCacheParams
cacheParamsMap.getOrElse((embeddingType, modelVersion), DisabledInMemoryCacheParams)
}
}
/*
* Default config for the in-memory cache
* Clients can directly import and use this one if they don't want to set up a customised config
* */
object DefaultInMemoryCacheConfig extends InMemoryCacheConfig {
// set default to no in-memory caching
val cacheParamsMap = Map.empty
}

View File

@ -0,0 +1,21 @@
jvm_binary(
name = "bin",
basename = "representation-manager",
main = "com.twitter.representation_manager.RepresentationManagerFedServerMain",
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"finatra/inject/inject-logback/src/main/scala",
"loglens/loglens-logback/src/main/scala/com/twitter/loglens/logback",
"representation-manager/server/src/main/resources",
"representation-manager/server/src/main/scala/com/twitter/representation_manager",
"twitter-server/logback-classic/src/main/scala",
],
)
# Aurora Workflows build phase convention requires a jvm_app named with ${project-name}-app
jvm_app(
name = "representation-manager-app",
archive = "zip",
binary = ":bin",
)

View File

@ -0,0 +1,7 @@
resources(
sources = [
"*.xml",
"config/*.yml",
],
tags = ["bazel-compatible"],
)

View File

@ -0,0 +1,219 @@
# ---------- traffic percentage by embedding type and model version ----------
# Decider strings are build dynamically following the rule in there
# i.e. s"enable_${embeddingType.name}_${modelVersion.name}"
# Hence this should be updated accordingly if usage is changed in the embedding stores
# Tweet embeddings
"enable_LogFavBasedTweet_Model20m145k2020":
comment: "Enable x% read traffic (0<=x<=10000, e.g. 1000=10%) for LogFavBasedTweet - Model20m145k2020. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_LogFavBasedTweet_Model20m145kUpdated":
comment: "Enable x% read traffic (0<=x<=10000, e.g. 1000=10%) for LogFavBasedTweet - Model20m145kUpdated. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_LogFavLongestL2EmbeddingTweet_Model20m145k2020":
comment: "Enable x% read traffic (0<=x<=10000, e.g. 1000=10%) for LogFavLongestL2EmbeddingTweet - Model20m145k2020. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_LogFavLongestL2EmbeddingTweet_Model20m145kUpdated":
comment: "Enable x% read traffic (0<=x<=10000, e.g. 1000=10%) for LogFavLongestL2EmbeddingTweet - Model20m145kUpdated. 0 means return EMPTY for all requests."
default_availability: 10000
# Topic embeddings
"enable_FavTfgTopic_Model20m145k2020":
comment: "Enable the read traffic to FavTfgTopic - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_LogFavBasedKgoApeTopic_Model20m145k2020":
comment: "Enable the read traffic to LogFavBasedKgoApeTopic - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
# User embeddings - KnownFor
"enable_FavBasedProducer_Model20m145kUpdated":
comment: "Enable the read traffic to FavBasedProducer - Model20m145kUpdated from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_FavBasedProducer_Model20m145k2020":
comment: "Enable the read traffic to FavBasedProducer - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_FollowBasedProducer_Model20m145k2020":
comment: "Enable the read traffic to FollowBasedProducer - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_AggregatableFavBasedProducer_Model20m145kUpdated":
comment: "Enable the read traffic to AggregatableFavBasedProducer - Model20m145kUpdated from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_AggregatableFavBasedProducer_Model20m145k2020":
comment: "Enable the read traffic to AggregatableFavBasedProducer - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_AggregatableLogFavBasedProducer_Model20m145kUpdated":
comment: "Enable the read traffic to AggregatableLogFavBasedProducer - Model20m145kUpdated from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_AggregatableLogFavBasedProducer_Model20m145k2020":
comment: "Enable the read traffic to AggregatableLogFavBasedProducer - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
enable_RelaxedAggregatableLogFavBasedProducer_Model20m145kUpdated:
comment: "Enable the read traffic to RelaxedAggregatableLogFavBasedProducer - Model20m145kUpdated from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
enable_RelaxedAggregatableLogFavBasedProducer_Model20m145k2020:
comment: "Enable the read traffic to RelaxedAggregatableLogFavBasedProducer - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
# User embeddings - InterestedIn
"enable_LogFavBasedUserInterestedInFromAPE_Model20m145k2020":
comment: "Enable the read traffic to LogFavBasedUserInterestedInFromAPE - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_FollowBasedUserInterestedInFromAPE_Model20m145k2020":
comment: "Enable the read traffic to FollowBasedUserInterestedInFromAPE - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_FavBasedUserInterestedIn_Model20m145kUpdated":
comment: "Enable the read traffic to FavBasedUserInterestedIn - Model20m145kUpdated from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_FavBasedUserInterestedIn_Model20m145k2020":
comment: "Enable the read traffic to FavBasedUserInterestedIn - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_FollowBasedUserInterestedIn_Model20m145k2020":
comment: "Enable the read traffic to FollowBasedUserInterestedIn - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_LogFavBasedUserInterestedIn_Model20m145k2020":
comment: "Enable the read traffic to LogFavBasedUserInterestedIn - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_FavBasedUserInterestedInFromPE_Model20m145kUpdated":
comment: "Enable the read traffic to FavBasedUserInterestedInFromPE - Model20m145kUpdated from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_FilteredUserInterestedIn_Model20m145kUpdated":
comment: "Enable the read traffic to FilteredUserInterestedIn - Model20m145kUpdated from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_FilteredUserInterestedIn_Model20m145k2020":
comment: "Enable the read traffic to FilteredUserInterestedIn - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_FilteredUserInterestedInFromPE_Model20m145kUpdated":
comment: "Enable the read traffic to FilteredUserInterestedInFromPE - Model20m145kUpdated from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_UnfilteredUserInterestedIn_Model20m145kUpdated":
comment: "Enable the read traffic to UnfilteredUserInterestedIn - Model20m145kUpdated from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_UnfilteredUserInterestedIn_Model20m145k2020":
comment: "Enable the read traffic to UnfilteredUserInterestedIn - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_UserNextInterestedIn_Model20m145k2020":
comment: "Enable the read traffic to UserNextInterestedIn - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE_Model20m145k2020":
comment: "Enable the read traffic to LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_LogFavBasedUserInterestedAverageAddressBookFromIIAPE_Model20m145k2020":
comment: "Enable the read traffic to LogFavBasedUserInterestedAverageAddressBookFromIIAPE - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_LogFavBasedUserInterestedBooktypeMaxpoolingAddressBookFromIIAPE_Model20m145k2020":
comment: "Enable the read traffic to LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_LogFavBasedUserInterestedLargestDimMaxpoolingAddressBookFromIIAPE_Model20m145k2020":
comment: "Enable the read traffic to LogFavBasedUserInterestedAverageAddressBookFromIIAPE - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_LogFavBasedUserInterestedLouvainMaxpoolingAddressBookFromIIAPE_Model20m145k2020":
comment: "Enable the read traffic to LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
"enable_LogFavBasedUserInterestedConnectedMaxpoolingAddressBookFromIIAPE_Model20m145k2020":
comment: "Enable the read traffic to LogFavBasedUserInterestedAverageAddressBookFromIIAPE - Model20m145k2020 from 0% to 100%. 0 means return EMPTY for all requests."
default_availability: 10000
# ---------- load shedding by caller id ----------
# To create a new decider, add here with the same format and caller's details :
# "representation-manager_load_shed_by_caller_id_twtr:{{role}}:{{name}}:{{environment}}:{{cluster}}"
# All the deciders below are generated by this script:
# ./strato/bin/fed deciders representation-manager --service-role=representation-manager --service-name=representation-manager
# If you need to run the script and paste the output, add ONLY the prod deciders here.
"representation-manager_load_shed_by_caller_id_all":
comment: "Reject all traffic from caller id: all"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:cr-mixer:cr-mixer:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:cr-mixer:cr-mixer:prod:atla"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:cr-mixer:cr-mixer:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:cr-mixer:cr-mixer:prod:pdxa"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:simclusters-ann:simclusters-ann-1:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:simclusters-ann:simclusters-ann-1:prod:atla"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:simclusters-ann:simclusters-ann-1:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:simclusters-ann:simclusters-ann-1:prod:pdxa"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:simclusters-ann:simclusters-ann-3:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:simclusters-ann:simclusters-ann-3:prod:atla"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:simclusters-ann:simclusters-ann-3:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:simclusters-ann:simclusters-ann-3:prod:pdxa"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:simclusters-ann:simclusters-ann-4:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:simclusters-ann:simclusters-ann-4:prod:atla"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:simclusters-ann:simclusters-ann-4:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:simclusters-ann:simclusters-ann-4:prod:pdxa"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:simclusters-ann:simclusters-ann-experimental:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:simclusters-ann:simclusters-ann-experimental:prod:atla"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:simclusters-ann:simclusters-ann-experimental:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:simclusters-ann:simclusters-ann-experimental:prod:pdxa"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:simclusters-ann:simclusters-ann:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:simclusters-ann:simclusters-ann:prod:atla"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:simclusters-ann:simclusters-ann:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:simclusters-ann:simclusters-ann:prod:pdxa"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:stratostore:stratoapi:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:stratostore:stratoapi:prod:atla"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:stratostore:stratoserver:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:stratostore:stratoserver:prod:atla"
default_availability: 0
"representation-manager_load_shed_by_caller_id_twtr:svc:stratostore:stratoserver:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:stratostore:stratoserver:prod:pdxa"
default_availability: 0
# ---------- Dark Traffic Proxy ----------
representation-manager_forward_dark_traffic:
comment: "Defines the percentage of traffic to forward to diffy-proxy. Set to 0 to disable dark traffic forwarding"
default_availability: 0

View File

@ -0,0 +1,165 @@
<configuration>
<shutdownHook class="ch.qos.logback.core.hook.DelayingShutdownHook"/>
<!-- ===================================================== -->
<!-- Service Config -->
<!-- ===================================================== -->
<property name="DEFAULT_SERVICE_PATTERN"
value="%-16X{traceId} %-12X{clientId:--} %-16X{method} %-25logger{0} %msg"/>
<property name="DEFAULT_ACCESS_PATTERN"
value="%msg"/>
<!-- ===================================================== -->
<!-- Common Config -->
<!-- ===================================================== -->
<!-- JUL/JDK14 to Logback bridge -->
<contextListener class="ch.qos.logback.classic.jul.LevelChangePropagator">
<resetJUL>true</resetJUL>
</contextListener>
<!-- ====================================================================================== -->
<!-- NOTE: The following appenders use a simple TimeBasedRollingPolicy configuration. -->
<!-- You may want to consider using a more advanced SizeAndTimeBasedRollingPolicy. -->
<!-- See: https://logback.qos.ch/manual/appenders.html#SizeAndTimeBasedRollingPolicy -->
<!-- ====================================================================================== -->
<!-- Service Log (rollover daily, keep maximum of 21 days of gzip compressed logs) -->
<appender name="SERVICE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${log.service.output}</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<!-- daily rollover -->
<fileNamePattern>${log.service.output}.%d.gz</fileNamePattern>
<!-- the maximum total size of all the log files -->
<totalSizeCap>3GB</totalSizeCap>
<!-- keep maximum 21 days' worth of history -->
<maxHistory>21</maxHistory>
<cleanHistoryOnStart>true</cleanHistoryOnStart>
</rollingPolicy>
<encoder>
<pattern>%date %.-3level ${DEFAULT_SERVICE_PATTERN}%n</pattern>
</encoder>
</appender>
<!-- Access Log (rollover daily, keep maximum of 21 days of gzip compressed logs) -->
<appender name="ACCESS" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${log.access.output}</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<!-- daily rollover -->
<fileNamePattern>${log.access.output}.%d.gz</fileNamePattern>
<!-- the maximum total size of all the log files -->
<totalSizeCap>100MB</totalSizeCap>
<!-- keep maximum 7 days' worth of history -->
<maxHistory>7</maxHistory>
<cleanHistoryOnStart>true</cleanHistoryOnStart>
</rollingPolicy>
<encoder>
<pattern>${DEFAULT_ACCESS_PATTERN}%n</pattern>
</encoder>
</appender>
<!--LogLens -->
<appender name="LOGLENS" class="com.twitter.loglens.logback.LoglensAppender">
<mdcAdditionalContext>true</mdcAdditionalContext>
<category>${log.lens.category}</category>
<index>${log.lens.index}</index>
<tag>${log.lens.tag}/service</tag>
<encoder>
<pattern>%msg</pattern>
</encoder>
</appender>
<!-- LogLens Access -->
<appender name="LOGLENS-ACCESS" class="com.twitter.loglens.logback.LoglensAppender">
<mdcAdditionalContext>true</mdcAdditionalContext>
<category>${log.lens.category}</category>
<index>${log.lens.index}</index>
<tag>${log.lens.tag}/access</tag>
<encoder>
<pattern>%msg</pattern>
</encoder>
</appender>
<!-- Pipeline Execution Logs -->
<appender name="ALLOW-LISTED-PIPELINE-EXECUTIONS" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>allow_listed_pipeline_executions.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<!-- daily rollover -->
<fileNamePattern>allow_listed_pipeline_executions.log.%d.gz</fileNamePattern>
<!-- the maximum total size of all the log files -->
<totalSizeCap>100MB</totalSizeCap>
<!-- keep maximum 7 days' worth of history -->
<maxHistory>7</maxHistory>
<cleanHistoryOnStart>true</cleanHistoryOnStart>
</rollingPolicy>
<encoder>
<pattern>%date %.-3level ${DEFAULT_SERVICE_PATTERN}%n</pattern>
</encoder>
</appender>
<!-- ===================================================== -->
<!-- Primary Async Appenders -->
<!-- ===================================================== -->
<property name="async_queue_size" value="${queue.size:-50000}"/>
<property name="async_max_flush_time" value="${max.flush.time:-0}"/>
<appender name="ASYNC-SERVICE" class="com.twitter.inject.logback.AsyncAppender">
<queueSize>${async_queue_size}</queueSize>
<maxFlushTime>${async_max_flush_time}</maxFlushTime>
<appender-ref ref="SERVICE"/>
</appender>
<appender name="ASYNC-ACCESS" class="com.twitter.inject.logback.AsyncAppender">
<queueSize>${async_queue_size}</queueSize>
<maxFlushTime>${async_max_flush_time}</maxFlushTime>
<appender-ref ref="ACCESS"/>
</appender>
<appender name="ASYNC-ALLOW-LISTED-PIPELINE-EXECUTIONS" class="com.twitter.inject.logback.AsyncAppender">
<queueSize>${async_queue_size}</queueSize>
<maxFlushTime>${async_max_flush_time}</maxFlushTime>
<appender-ref ref="ALLOW-LISTED-PIPELINE-EXECUTIONS"/>
</appender>
<appender name="ASYNC-LOGLENS" class="com.twitter.inject.logback.AsyncAppender">
<queueSize>${async_queue_size}</queueSize>
<maxFlushTime>${async_max_flush_time}</maxFlushTime>
<appender-ref ref="LOGLENS"/>
</appender>
<appender name="ASYNC-LOGLENS-ACCESS" class="com.twitter.inject.logback.AsyncAppender">
<queueSize>${async_queue_size}</queueSize>
<maxFlushTime>${async_max_flush_time}</maxFlushTime>
<appender-ref ref="LOGLENS-ACCESS"/>
</appender>
<!-- ===================================================== -->
<!-- Package Config -->
<!-- ===================================================== -->
<!-- Per-Package Config -->
<logger name="com.twitter" level="INHERITED"/>
<logger name="com.twitter.wilyns" level="INHERITED"/>
<logger name="com.twitter.configbus.client.file" level="INHERITED"/>
<logger name="com.twitter.finagle.mux" level="INHERITED"/>
<logger name="com.twitter.finagle.serverset2" level="INHERITED"/>
<logger name="com.twitter.logging.ScribeHandler" level="INHERITED"/>
<logger name="com.twitter.zookeeper.client.internal" level="INHERITED"/>
<!-- Root Config -->
<!-- For all logs except access logs, disable logging below log_level level by default. This can be overriden in the per-package loggers, and dynamically in the admin panel of individual instances. -->
<root level="${log_level:-INFO}">
<appender-ref ref="ASYNC-SERVICE"/>
<appender-ref ref="ASYNC-LOGLENS"/>
</root>
<!-- Access Logging -->
<!-- Access logs are turned off by default -->
<logger name="com.twitter.finatra.thrift.filters.AccessLoggingFilter" level="OFF" additivity="false">
<appender-ref ref="ASYNC-ACCESS"/>
<appender-ref ref="ASYNC-LOGLENS-ACCESS"/>
</logger>
</configuration>

View File

@ -0,0 +1,13 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"finatra/inject/inject-thrift-client",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/columns/topic",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/columns/tweet",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/columns/user",
"strato/src/main/scala/com/twitter/strato/fed",
"strato/src/main/scala/com/twitter/strato/fed/server",
],
)

View File

@ -0,0 +1,40 @@
package com.twitter.representation_manager
import com.google.inject.Module
import com.twitter.inject.thrift.modules.ThriftClientIdModule
import com.twitter.representation_manager.columns.topic.LocaleEntityIdSimClustersEmbeddingCol
import com.twitter.representation_manager.columns.topic.TopicIdSimClustersEmbeddingCol
import com.twitter.representation_manager.columns.tweet.TweetSimClustersEmbeddingCol
import com.twitter.representation_manager.columns.user.UserSimClustersEmbeddingCol
import com.twitter.representation_manager.modules.CacheModule
import com.twitter.representation_manager.modules.InterestsThriftClientModule
import com.twitter.representation_manager.modules.LegacyRMSConfigModule
import com.twitter.representation_manager.modules.StoreModule
import com.twitter.representation_manager.modules.TimerModule
import com.twitter.representation_manager.modules.UttClientModule
import com.twitter.strato.fed._
import com.twitter.strato.fed.server._
object RepresentationManagerFedServerMain extends RepresentationManagerFedServer
trait RepresentationManagerFedServer extends StratoFedServer {
override def dest: String = "/s/representation-manager/representation-manager"
override val modules: Seq[Module] =
Seq(
CacheModule,
InterestsThriftClientModule,
LegacyRMSConfigModule,
StoreModule,
ThriftClientIdModule,
TimerModule,
UttClientModule
)
override def columns: Seq[Class[_ <: StratoFed.Column]] =
Seq(
classOf[TweetSimClustersEmbeddingCol],
classOf[UserSimClustersEmbeddingCol],
classOf[TopicIdSimClustersEmbeddingCol],
classOf[LocaleEntityIdSimClustersEmbeddingCol]
)
}

View File

@ -0,0 +1,9 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"strato/src/main/scala/com/twitter/strato/fed",
"strato/src/main/scala/com/twitter/strato/fed/server",
],
)

View File

@ -0,0 +1,26 @@
package com.twitter.representation_manager.columns
import com.twitter.strato.access.Access.LdapGroup
import com.twitter.strato.config.ContactInfo
import com.twitter.strato.config.FromColumns
import com.twitter.strato.config.Has
import com.twitter.strato.config.Prefix
import com.twitter.strato.config.ServiceIdentifierPattern
object ColumnConfigBase {
/****************** Internal permissions *******************/
val recosPermissions: Seq[com.twitter.strato.config.Policy] = Seq()
/****************** External permissions *******************/
// This is used to grant limited access to members outside of RP team.
val externalPermissions: Seq[com.twitter.strato.config.Policy] = Seq()
val contactInfo: ContactInfo = ContactInfo(
description = "Please contact Relevance Platform for more details",
contactEmail = "no-reply@twitter.com",
ldapGroup = "ldap",
jiraProject = "JIRA",
links = Seq("http://go/rms-runbook")
)
}

View File

@ -0,0 +1,14 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"finatra/inject/inject-core/src/main/scala",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/columns",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/modules",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/store",
"representation-manager/server/src/main/thrift:thrift-scala",
"strato/src/main/scala/com/twitter/strato/fed",
"strato/src/main/scala/com/twitter/strato/fed/server",
],
)

View File

@ -0,0 +1,77 @@
package com.twitter.representation_manager.columns.topic
import com.twitter.representation_manager.columns.ColumnConfigBase
import com.twitter.representation_manager.store.TopicSimClustersEmbeddingStore
import com.twitter.representation_manager.thriftscala.SimClustersEmbeddingView
import com.twitter.simclusters_v2.thriftscala.InternalId
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.simclusters_v2.thriftscala.LocaleEntityId
import com.twitter.stitch
import com.twitter.stitch.Stitch
import com.twitter.stitch.storehaus.StitchOfReadableStore
import com.twitter.strato.catalog.OpMetadata
import com.twitter.strato.config.AnyOf
import com.twitter.strato.config.ContactInfo
import com.twitter.strato.config.FromColumns
import com.twitter.strato.config.Policy
import com.twitter.strato.config.Prefix
import com.twitter.strato.data.Conv
import com.twitter.strato.data.Description.PlainText
import com.twitter.strato.data.Lifecycle
import com.twitter.strato.fed._
import com.twitter.strato.thrift.ScroogeConv
import javax.inject.Inject
class LocaleEntityIdSimClustersEmbeddingCol @Inject() (
embeddingStore: TopicSimClustersEmbeddingStore)
extends StratoFed.Column(
"recommendations/representation_manager/simClustersEmbedding.LocaleEntityId")
with StratoFed.Fetch.Stitch {
private val storeStitch: SimClustersEmbeddingId => Stitch[SimClustersEmbedding] =
StitchOfReadableStore(embeddingStore.topicSimClustersEmbeddingStore.mapValues(_.toThrift))
val colPermissions: Seq[com.twitter.strato.config.Policy] =
ColumnConfigBase.recosPermissions ++ ColumnConfigBase.externalPermissions :+ FromColumns(
Set(
Prefix("ml/featureStore/simClusters"),
))
override val policy: Policy = AnyOf({
colPermissions
})
override type Key = LocaleEntityId
override type View = SimClustersEmbeddingView
override type Value = SimClustersEmbedding
override val keyConv: Conv[Key] = ScroogeConv.fromStruct[LocaleEntityId]
override val viewConv: Conv[View] = ScroogeConv.fromStruct[SimClustersEmbeddingView]
override val valueConv: Conv[Value] = ScroogeConv.fromStruct[SimClustersEmbedding]
override val contactInfo: ContactInfo = ColumnConfigBase.contactInfo
override val metadata: OpMetadata = OpMetadata(
lifecycle = Some(Lifecycle.Production),
description = Some(
PlainText(
"The Topic SimClusters Embedding Endpoint in Representation Management Service with LocaleEntityId." +
" TDD: http://go/rms-tdd"))
)
override def fetch(key: Key, view: View): Stitch[Result[Value]] = {
val embeddingId = SimClustersEmbeddingId(
view.embeddingType,
view.modelVersion,
InternalId.LocaleEntityId(key)
)
storeStitch(embeddingId)
.map(embedding => found(embedding))
.handle {
case stitch.NotFound => missing
}
}
}

View File

@ -0,0 +1,74 @@
package com.twitter.representation_manager.columns.topic
import com.twitter.representation_manager.columns.ColumnConfigBase
import com.twitter.representation_manager.store.TopicSimClustersEmbeddingStore
import com.twitter.representation_manager.thriftscala.SimClustersEmbeddingView
import com.twitter.simclusters_v2.thriftscala.InternalId
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.simclusters_v2.thriftscala.TopicId
import com.twitter.stitch
import com.twitter.stitch.Stitch
import com.twitter.stitch.storehaus.StitchOfReadableStore
import com.twitter.strato.catalog.OpMetadata
import com.twitter.strato.config.AnyOf
import com.twitter.strato.config.ContactInfo
import com.twitter.strato.config.FromColumns
import com.twitter.strato.config.Policy
import com.twitter.strato.config.Prefix
import com.twitter.strato.data.Conv
import com.twitter.strato.data.Description.PlainText
import com.twitter.strato.data.Lifecycle
import com.twitter.strato.fed._
import com.twitter.strato.thrift.ScroogeConv
import javax.inject.Inject
class TopicIdSimClustersEmbeddingCol @Inject() (embeddingStore: TopicSimClustersEmbeddingStore)
extends StratoFed.Column("recommendations/representation_manager/simClustersEmbedding.TopicId")
with StratoFed.Fetch.Stitch {
private val storeStitch: SimClustersEmbeddingId => Stitch[SimClustersEmbedding] =
StitchOfReadableStore(embeddingStore.topicSimClustersEmbeddingStore.mapValues(_.toThrift))
val colPermissions: Seq[com.twitter.strato.config.Policy] =
ColumnConfigBase.recosPermissions ++ ColumnConfigBase.externalPermissions :+ FromColumns(
Set(
Prefix("ml/featureStore/simClusters"),
))
override val policy: Policy = AnyOf({
colPermissions
})
override type Key = TopicId
override type View = SimClustersEmbeddingView
override type Value = SimClustersEmbedding
override val keyConv: Conv[Key] = ScroogeConv.fromStruct[TopicId]
override val viewConv: Conv[View] = ScroogeConv.fromStruct[SimClustersEmbeddingView]
override val valueConv: Conv[Value] = ScroogeConv.fromStruct[SimClustersEmbedding]
override val contactInfo: ContactInfo = ColumnConfigBase.contactInfo
override val metadata: OpMetadata = OpMetadata(
lifecycle = Some(Lifecycle.Production),
description = Some(PlainText(
"The Topic SimClusters Embedding Endpoint in Representation Management Service with TopicId." +
" TDD: http://go/rms-tdd"))
)
override def fetch(key: Key, view: View): Stitch[Result[Value]] = {
val embeddingId = SimClustersEmbeddingId(
view.embeddingType,
view.modelVersion,
InternalId.TopicId(key)
)
storeStitch(embeddingId)
.map(embedding => found(embedding))
.handle {
case stitch.NotFound => missing
}
}
}

View File

@ -0,0 +1,14 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"finatra/inject/inject-core/src/main/scala",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/columns",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/modules",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/store",
"representation-manager/server/src/main/thrift:thrift-scala",
"strato/src/main/scala/com/twitter/strato/fed",
"strato/src/main/scala/com/twitter/strato/fed/server",
],
)

View File

@ -0,0 +1,73 @@
package com.twitter.representation_manager.columns.tweet
import com.twitter.representation_manager.columns.ColumnConfigBase
import com.twitter.representation_manager.store.TweetSimClustersEmbeddingStore
import com.twitter.representation_manager.thriftscala.SimClustersEmbeddingView
import com.twitter.simclusters_v2.thriftscala.InternalId
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.stitch
import com.twitter.stitch.Stitch
import com.twitter.stitch.storehaus.StitchOfReadableStore
import com.twitter.strato.catalog.OpMetadata
import com.twitter.strato.config.AnyOf
import com.twitter.strato.config.ContactInfo
import com.twitter.strato.config.FromColumns
import com.twitter.strato.config.Policy
import com.twitter.strato.config.Prefix
import com.twitter.strato.data.Conv
import com.twitter.strato.data.Description.PlainText
import com.twitter.strato.data.Lifecycle
import com.twitter.strato.fed._
import com.twitter.strato.thrift.ScroogeConv
import javax.inject.Inject
class TweetSimClustersEmbeddingCol @Inject() (embeddingStore: TweetSimClustersEmbeddingStore)
extends StratoFed.Column("recommendations/representation_manager/simClustersEmbedding.Tweet")
with StratoFed.Fetch.Stitch {
private val storeStitch: SimClustersEmbeddingId => Stitch[SimClustersEmbedding] =
StitchOfReadableStore(embeddingStore.tweetSimClustersEmbeddingStore.mapValues(_.toThrift))
val colPermissions: Seq[com.twitter.strato.config.Policy] =
ColumnConfigBase.recosPermissions ++ ColumnConfigBase.externalPermissions :+ FromColumns(
Set(
Prefix("ml/featureStore/simClusters"),
))
override val policy: Policy = AnyOf({
colPermissions
})
override type Key = Long // TweetId
override type View = SimClustersEmbeddingView
override type Value = SimClustersEmbedding
override val keyConv: Conv[Key] = Conv.long
override val viewConv: Conv[View] = ScroogeConv.fromStruct[SimClustersEmbeddingView]
override val valueConv: Conv[Value] = ScroogeConv.fromStruct[SimClustersEmbedding]
override val contactInfo: ContactInfo = ColumnConfigBase.contactInfo
override val metadata: OpMetadata = OpMetadata(
lifecycle = Some(Lifecycle.Production),
description = Some(
PlainText("The Tweet SimClusters Embedding Endpoint in Representation Management Service." +
" TDD: http://go/rms-tdd"))
)
override def fetch(key: Key, view: View): Stitch[Result[Value]] = {
val embeddingId = SimClustersEmbeddingId(
view.embeddingType,
view.modelVersion,
InternalId.TweetId(key)
)
storeStitch(embeddingId)
.map(embedding => found(embedding))
.handle {
case stitch.NotFound => missing
}
}
}

View File

@ -0,0 +1,14 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"finatra/inject/inject-core/src/main/scala",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/columns",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/modules",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/store",
"representation-manager/server/src/main/thrift:thrift-scala",
"strato/src/main/scala/com/twitter/strato/fed",
"strato/src/main/scala/com/twitter/strato/fed/server",
],
)

View File

@ -0,0 +1,73 @@
package com.twitter.representation_manager.columns.user
import com.twitter.representation_manager.columns.ColumnConfigBase
import com.twitter.representation_manager.store.UserSimClustersEmbeddingStore
import com.twitter.representation_manager.thriftscala.SimClustersEmbeddingView
import com.twitter.simclusters_v2.thriftscala.InternalId
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbedding
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.stitch
import com.twitter.stitch.Stitch
import com.twitter.stitch.storehaus.StitchOfReadableStore
import com.twitter.strato.catalog.OpMetadata
import com.twitter.strato.config.AnyOf
import com.twitter.strato.config.ContactInfo
import com.twitter.strato.config.FromColumns
import com.twitter.strato.config.Policy
import com.twitter.strato.config.Prefix
import com.twitter.strato.data.Conv
import com.twitter.strato.data.Description.PlainText
import com.twitter.strato.data.Lifecycle
import com.twitter.strato.fed._
import com.twitter.strato.thrift.ScroogeConv
import javax.inject.Inject
class UserSimClustersEmbeddingCol @Inject() (embeddingStore: UserSimClustersEmbeddingStore)
extends StratoFed.Column("recommendations/representation_manager/simClustersEmbedding.User")
with StratoFed.Fetch.Stitch {
private val storeStitch: SimClustersEmbeddingId => Stitch[SimClustersEmbedding] =
StitchOfReadableStore(embeddingStore.userSimClustersEmbeddingStore.mapValues(_.toThrift))
val colPermissions: Seq[com.twitter.strato.config.Policy] =
ColumnConfigBase.recosPermissions ++ ColumnConfigBase.externalPermissions :+ FromColumns(
Set(
Prefix("ml/featureStore/simClusters"),
))
override val policy: Policy = AnyOf({
colPermissions
})
override type Key = Long // UserId
override type View = SimClustersEmbeddingView
override type Value = SimClustersEmbedding
override val keyConv: Conv[Key] = Conv.long
override val viewConv: Conv[View] = ScroogeConv.fromStruct[SimClustersEmbeddingView]
override val valueConv: Conv[Value] = ScroogeConv.fromStruct[SimClustersEmbedding]
override val contactInfo: ContactInfo = ColumnConfigBase.contactInfo
override val metadata: OpMetadata = OpMetadata(
lifecycle = Some(Lifecycle.Production),
description = Some(
PlainText("The User SimClusters Embedding Endpoint in Representation Management Service." +
" TDD: http://go/rms-tdd"))
)
override def fetch(key: Key, view: View): Stitch[Result[Value]] = {
val embeddingId = SimClustersEmbeddingId(
view.embeddingType,
view.modelVersion,
InternalId.UserId(key)
)
storeStitch(embeddingId)
.map(embedding => found(embedding))
.handle {
case stitch.NotFound => missing
}
}
}

View File

@ -0,0 +1,13 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"decider/src/main/scala",
"finagle/finagle-memcached",
"hermit/hermit-core/src/main/scala/com/twitter/hermit/store/common",
"relevance-platform/src/main/scala/com/twitter/relevance_platform/common/injection",
"src/scala/com/twitter/simclusters_v2/common",
"src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala",
],
)

View File

@ -0,0 +1,153 @@
package com.twitter.representation_manager.common
import com.twitter.bijection.scrooge.BinaryScalaCodec
import com.twitter.conversions.DurationOps._
import com.twitter.finagle.memcached.Client
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.hashing.KeyHasher
import com.twitter.hermit.store.common.ObservedMemcachedReadableStore
import com.twitter.relevance_platform.common.injection.LZ4Injection
import com.twitter.simclusters_v2.common.SimClustersEmbedding
import com.twitter.simclusters_v2.common.SimClustersEmbeddingIdCacheKeyBuilder
import com.twitter.simclusters_v2.thriftscala.EmbeddingType
import com.twitter.simclusters_v2.thriftscala.EmbeddingType._
import com.twitter.simclusters_v2.thriftscala.ModelVersion
import com.twitter.simclusters_v2.thriftscala.ModelVersion._
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding}
import com.twitter.storehaus.ReadableStore
import com.twitter.util.Duration
/*
* NOTE - ALL the cache configs here are just placeholders, NONE of them is used anyweher in RMS yet
* */
sealed trait MemCacheParams
sealed trait MemCacheConfig
/*
* This holds params that is required to set up a memcache cache for a single embedding store
* */
case class EnabledMemCacheParams(ttl: Duration) extends MemCacheParams
object DisabledMemCacheParams extends MemCacheParams
/*
* We use this MemcacheConfig as the single source to set up the memcache for all RMS use cases
* NO OVERRIDE FROM CLIENT
* */
object MemCacheConfig {
val keyHasher: KeyHasher = KeyHasher.FNV1A_64
val hashKeyPrefix: String = "RMS"
val simclustersEmbeddingCacheKeyBuilder =
SimClustersEmbeddingIdCacheKeyBuilder(keyHasher.hashKey, hashKeyPrefix)
val cacheParamsMap: Map[
(EmbeddingType, ModelVersion),
MemCacheParams
] = Map(
// Tweet Embeddings
(LogFavBasedTweet, Model20m145kUpdated) -> EnabledMemCacheParams(ttl = 10.minutes),
(LogFavBasedTweet, Model20m145k2020) -> EnabledMemCacheParams(ttl = 10.minutes),
(LogFavLongestL2EmbeddingTweet, Model20m145kUpdated) -> EnabledMemCacheParams(ttl = 10.minutes),
(LogFavLongestL2EmbeddingTweet, Model20m145k2020) -> EnabledMemCacheParams(ttl = 10.minutes),
// User - KnownFor Embeddings
(FavBasedProducer, Model20m145kUpdated) -> EnabledMemCacheParams(ttl = 12.hours),
(FavBasedProducer, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(FollowBasedProducer, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(AggregatableLogFavBasedProducer, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(RelaxedAggregatableLogFavBasedProducer, Model20m145kUpdated) -> EnabledMemCacheParams(ttl =
12.hours),
(RelaxedAggregatableLogFavBasedProducer, Model20m145k2020) -> EnabledMemCacheParams(ttl =
12.hours),
// User - InterestedIn Embeddings
(LogFavBasedUserInterestedInFromAPE, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(FollowBasedUserInterestedInFromAPE, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(FavBasedUserInterestedIn, Model20m145kUpdated) -> EnabledMemCacheParams(ttl = 12.hours),
(FavBasedUserInterestedIn, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(FollowBasedUserInterestedIn, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(LogFavBasedUserInterestedIn, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(FavBasedUserInterestedInFromPE, Model20m145kUpdated) -> EnabledMemCacheParams(ttl = 12.hours),
(FilteredUserInterestedIn, Model20m145kUpdated) -> EnabledMemCacheParams(ttl = 12.hours),
(FilteredUserInterestedIn, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(FilteredUserInterestedInFromPE, Model20m145kUpdated) -> EnabledMemCacheParams(ttl = 12.hours),
(UnfilteredUserInterestedIn, Model20m145kUpdated) -> EnabledMemCacheParams(ttl = 12.hours),
(UnfilteredUserInterestedIn, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(UserNextInterestedIn, Model20m145k2020) -> EnabledMemCacheParams(ttl =
30.minutes), //embedding is updated every 2 hours, keeping it lower to avoid staleness
(
LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE,
Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(
LogFavBasedUserInterestedAverageAddressBookFromIIAPE,
Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(
LogFavBasedUserInterestedBooktypeMaxpoolingAddressBookFromIIAPE,
Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(
LogFavBasedUserInterestedLargestDimMaxpoolingAddressBookFromIIAPE,
Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(
LogFavBasedUserInterestedLouvainMaxpoolingAddressBookFromIIAPE,
Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(
LogFavBasedUserInterestedConnectedMaxpoolingAddressBookFromIIAPE,
Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
// Topic Embeddings
(FavTfgTopic, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
(LogFavBasedKgoApeTopic, Model20m145k2020) -> EnabledMemCacheParams(ttl = 12.hours),
)
def getCacheSetup(
embeddingType: EmbeddingType,
modelVersion: ModelVersion
): MemCacheParams = {
// When requested (embeddingType, modelVersion) doesn't exist, we return DisabledMemCacheParams
cacheParamsMap.getOrElse((embeddingType, modelVersion), DisabledMemCacheParams)
}
def getCacheKeyPrefix(embeddingType: EmbeddingType, modelVersion: ModelVersion) =
s"${embeddingType.value}_${modelVersion.value}_"
def getStatsName(embeddingType: EmbeddingType, modelVersion: ModelVersion) =
s"${embeddingType.name}_${modelVersion.name}_mem_cache"
/**
* Build a ReadableStore based on MemCacheConfig.
*
* If memcache is disabled, it will return a normal readable store wrapper of the rawStore,
* with SimClustersEmbedding as value;
* If memcache is enabled, it will return a ObservedMemcachedReadableStore wrapper of the rawStore,
* with memcache set up according to the EnabledMemCacheParams
* */
def buildMemCacheStoreForSimClustersEmbedding(
rawStore: ReadableStore[SimClustersEmbeddingId, ThriftSimClustersEmbedding],
cacheClient: Client,
embeddingType: EmbeddingType,
modelVersion: ModelVersion,
stats: StatsReceiver
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val cacheParams = getCacheSetup(embeddingType, modelVersion)
val store = cacheParams match {
case DisabledMemCacheParams => rawStore
case EnabledMemCacheParams(ttl) =>
val memCacheKeyPrefix = MemCacheConfig.getCacheKeyPrefix(
embeddingType,
modelVersion
)
val statsName = MemCacheConfig.getStatsName(
embeddingType,
modelVersion
)
ObservedMemcachedReadableStore.fromCacheClient(
backingStore = rawStore,
cacheClient = cacheClient,
ttl = ttl
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver = stats.scope(statsName),
keyToString = { k => memCacheKeyPrefix + k.toString }
)
}
store.mapValues(SimClustersEmbedding(_))
}
}

View File

@ -0,0 +1,25 @@
package com.twitter.representation_manager.common
import com.twitter.decider.Decider
import com.twitter.decider.RandomRecipient
import com.twitter.decider.Recipient
import com.twitter.simclusters_v2.common.DeciderGateBuilderWithIdHashing
import javax.inject.Inject
case class RepresentationManagerDecider @Inject() (decider: Decider) {
val deciderGateBuilder = new DeciderGateBuilderWithIdHashing(decider)
def isAvailable(feature: String, recipient: Option[Recipient]): Boolean = {
decider.isAvailable(feature, recipient)
}
/**
* When useRandomRecipient is set to false, the decider is either completely on or off.
* When useRandomRecipient is set to true, the decider is on for the specified % of traffic.
*/
def isAvailable(feature: String, useRandomRecipient: Boolean = true): Boolean = {
if (useRandomRecipient) isAvailable(feature, Some(RandomRecipient))
else isAvailable(feature, None)
}
}

View File

@ -0,0 +1,25 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"content-recommender/server/src/main/scala/com/twitter/contentrecommender:representation-manager-deps",
"frigate/frigate-common/src/main/scala/com/twitter/frigate/common/store/strato",
"frigate/frigate-common/src/main/scala/com/twitter/frigate/common/util",
"hermit/hermit-core/src/main/scala/com/twitter/hermit/store/common",
"relevance-platform/src/main/scala/com/twitter/relevance_platform/common/injection",
"relevance-platform/src/main/scala/com/twitter/relevance_platform/common/readablestore",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/common",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/store",
"src/scala/com/twitter/ml/api/embedding",
"src/scala/com/twitter/simclusters_v2/common",
"src/scala/com/twitter/simclusters_v2/score",
"src/scala/com/twitter/simclusters_v2/summingbird/stores",
"src/scala/com/twitter/storehaus_internal/manhattan",
"src/scala/com/twitter/storehaus_internal/util",
"src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala",
"src/thrift/com/twitter/socialgraph:thrift-scala",
"storage/clients/manhattan/client/src/main/scala",
"tweetypie/src/scala/com/twitter/tweetypie/util",
],
)

View File

@ -0,0 +1,846 @@
package com.twitter.representation_manager.migration
import com.twitter.bijection.Injection
import com.twitter.bijection.scrooge.BinaryScalaCodec
import com.twitter.contentrecommender.store.ApeEntityEmbeddingStore
import com.twitter.contentrecommender.store.InterestsOptOutStore
import com.twitter.contentrecommender.store.SemanticCoreTopicSeedStore
import com.twitter.contentrecommender.twistly
import com.twitter.conversions.DurationOps._
import com.twitter.decider.Decider
import com.twitter.escherbird.util.uttclient.CacheConfigV2
import com.twitter.escherbird.util.uttclient.CachedUttClientV2
import com.twitter.escherbird.util.uttclient.UttClientCacheConfigsV2
import com.twitter.escherbird.utt.strato.thriftscala.Environment
import com.twitter.finagle.ThriftMux
import com.twitter.finagle.memcached.Client
import com.twitter.finagle.mtls.authentication.ServiceIdentifier
import com.twitter.finagle.mtls.client.MtlsStackClient.MtlsThriftMuxClientSyntax
import com.twitter.finagle.mux.ClientDiscardedRequestException
import com.twitter.finagle.service.ReqRep
import com.twitter.finagle.service.ResponseClass
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.finagle.thrift.ClientId
import com.twitter.frigate.common.store.strato.StratoFetchableStore
import com.twitter.frigate.common.util.SeqLongInjection
import com.twitter.hashing.KeyHasher
import com.twitter.hermit.store.common.DeciderableReadableStore
import com.twitter.hermit.store.common.ObservedCachedReadableStore
import com.twitter.hermit.store.common.ObservedMemcachedReadableStore
import com.twitter.hermit.store.common.ObservedReadableStore
import com.twitter.interests.thriftscala.InterestsThriftService
import com.twitter.relevance_platform.common.injection.LZ4Injection
import com.twitter.relevance_platform.common.readablestore.ReadableStoreWithTimeout
import com.twitter.representation_manager.common.RepresentationManagerDecider
import com.twitter.representation_manager.store.DeciderConstants
import com.twitter.representation_manager.store.DeciderKey
import com.twitter.simclusters_v2.common.ModelVersions
import com.twitter.simclusters_v2.common.SimClustersEmbedding
import com.twitter.simclusters_v2.common.SimClustersEmbeddingIdCacheKeyBuilder
import com.twitter.simclusters_v2.stores.SimClustersEmbeddingStore
import com.twitter.simclusters_v2.summingbird.stores.PersistentTweetEmbeddingStore
import com.twitter.simclusters_v2.summingbird.stores.ProducerClusterEmbeddingReadableStores
import com.twitter.simclusters_v2.summingbird.stores.UserInterestedInReadableStore
import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn
import com.twitter.simclusters_v2.thriftscala.EmbeddingType
import com.twitter.simclusters_v2.thriftscala.EmbeddingType._
import com.twitter.simclusters_v2.thriftscala.InternalId
import com.twitter.simclusters_v2.thriftscala.ModelVersion
import com.twitter.simclusters_v2.thriftscala.ModelVersion.Model20m145k2020
import com.twitter.simclusters_v2.thriftscala.ModelVersion.Model20m145kUpdated
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbedding
import com.twitter.simclusters_v2.thriftscala.SimClustersMultiEmbeddingId
import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding}
import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams
import com.twitter.storehaus.ReadableStore
import com.twitter.storehaus_internal.manhattan.Athena
import com.twitter.storehaus_internal.manhattan.ManhattanRO
import com.twitter.storehaus_internal.manhattan.ManhattanROConfig
import com.twitter.storehaus_internal.util.ApplicationID
import com.twitter.storehaus_internal.util.DatasetName
import com.twitter.storehaus_internal.util.HDFSPath
import com.twitter.strato.client.Strato
import com.twitter.strato.client.{Client => StratoClient}
import com.twitter.strato.thrift.ScroogeConvImplicits._
import com.twitter.tweetypie.util.UserId
import com.twitter.util.Duration
import com.twitter.util.Future
import com.twitter.util.Throw
import com.twitter.util.Timer
import javax.inject.Inject
import javax.inject.Named
import scala.reflect.ClassTag
class LegacyRMS @Inject() (
serviceIdentifier: ServiceIdentifier,
cacheClient: Client,
stats: StatsReceiver,
decider: Decider,
clientId: ClientId,
timer: Timer,
@Named("cacheHashKeyPrefix") val cacheHashKeyPrefix: String = "RMS",
@Named("useContentRecommenderConfiguration") val useContentRecommenderConfiguration: Boolean =
false) {
private val mhMtlsParams: ManhattanKVClientMtlsParams = ManhattanKVClientMtlsParams(
serviceIdentifier)
private val rmsDecider = RepresentationManagerDecider(decider)
val keyHasher: KeyHasher = KeyHasher.FNV1A_64
private val embeddingCacheKeyBuilder =
SimClustersEmbeddingIdCacheKeyBuilder(keyHasher.hashKey, cacheHashKeyPrefix)
private val statsReceiver = stats.scope("representation_management")
// Strato client, default timeout = 280ms
val stratoClient: StratoClient =
Strato.client
.withMutualTls(serviceIdentifier)
.build()
// Builds ThriftMux client builder for Content-Recommender service
private def makeThriftClientBuilder(
requestTimeout: Duration
): ThriftMux.Client = {
ThriftMux.client
.withClientId(clientId)
.withMutualTls(serviceIdentifier)
.withRequestTimeout(requestTimeout)
.withStatsReceiver(statsReceiver.scope("clnt"))
.withResponseClassifier {
case ReqRep(_, Throw(_: ClientDiscardedRequestException)) => ResponseClass.Ignorable
}
}
private def makeThriftClient[ThriftServiceType: ClassTag](
dest: String,
label: String,
requestTimeout: Duration = 450.milliseconds
): ThriftServiceType = {
makeThriftClientBuilder(requestTimeout)
.build[ThriftServiceType](dest, label)
}
/** *** SimCluster Embedding Stores ******/
implicit val simClustersEmbeddingIdInjection: Injection[SimClustersEmbeddingId, Array[Byte]] =
BinaryScalaCodec(SimClustersEmbeddingId)
implicit val simClustersEmbeddingInjection: Injection[ThriftSimClustersEmbedding, Array[Byte]] =
BinaryScalaCodec(ThriftSimClustersEmbedding)
implicit val simClustersMultiEmbeddingInjection: Injection[SimClustersMultiEmbedding, Array[
Byte
]] =
BinaryScalaCodec(SimClustersMultiEmbedding)
implicit val simClustersMultiEmbeddingIdInjection: Injection[SimClustersMultiEmbeddingId, Array[
Byte
]] =
BinaryScalaCodec(SimClustersMultiEmbeddingId)
def getEmbeddingsDataset(
mhMtlsParams: ManhattanKVClientMtlsParams,
datasetName: String
): ReadableStore[SimClustersEmbeddingId, ThriftSimClustersEmbedding] = {
ManhattanRO.getReadableStoreWithMtls[SimClustersEmbeddingId, ThriftSimClustersEmbedding](
ManhattanROConfig(
HDFSPath(""), // not needed
ApplicationID("content_recommender_athena"),
DatasetName(datasetName), // this should be correct
Athena
),
mhMtlsParams
)
}
lazy val logFavBasedLongestL2Tweet20M145K2020EmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore =
PersistentTweetEmbeddingStore
.longestL2NormTweetEmbeddingStoreManhattan(
mhMtlsParams,
PersistentTweetEmbeddingStore.LogFavBased20m145k2020Dataset,
statsReceiver,
maxLength = 10,
).mapValues(_.toThrift)
val memcachedStore = ObservedMemcachedReadableStore.fromCacheClient(
backingStore = rawStore,
cacheClient = cacheClient,
ttl = 15.minutes
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver =
statsReceiver.scope("log_fav_based_longest_l2_tweet_embedding_20m145k2020_mem_cache"),
keyToString = { k =>
s"scez_l2:${LogFavBasedTweet}_${ModelVersions.Model20M145K2020}_$k"
}
)
val inMemoryCacheStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] =
memcachedStore
.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(
LogFavLongestL2EmbeddingTweet,
Model20m145k2020,
InternalId.TweetId(tweetId)) =>
tweetId
}
.mapValues(SimClustersEmbedding(_))
ObservedCachedReadableStore.from[SimClustersEmbeddingId, SimClustersEmbedding](
inMemoryCacheStore,
ttl = 12.minute,
maxKeys = 1048575,
cacheName = "log_fav_based_longest_l2_tweet_embedding_20m145k2020_cache",
windowSize = 10000L
)(statsReceiver.scope("log_fav_based_longest_l2_tweet_embedding_20m145k2020_store"))
}
lazy val logFavBased20M145KUpdatedTweetEmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore =
PersistentTweetEmbeddingStore
.mostRecentTweetEmbeddingStoreManhattan(
mhMtlsParams,
PersistentTweetEmbeddingStore.LogFavBased20m145kUpdatedDataset,
statsReceiver
).mapValues(_.toThrift)
val memcachedStore = ObservedMemcachedReadableStore.fromCacheClient(
backingStore = rawStore,
cacheClient = cacheClient,
ttl = 10.minutes
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver = statsReceiver.scope("log_fav_based_tweet_embedding_mem_cache"),
keyToString = { k =>
// SimClusters_embedding_LZ4/embeddingType_modelVersion_tweetId
s"scez:${LogFavBasedTweet}_${ModelVersions.Model20M145KUpdated}_$k"
}
)
val inMemoryCacheStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
memcachedStore
.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(
LogFavBasedTweet,
Model20m145kUpdated,
InternalId.TweetId(tweetId)) =>
tweetId
}
.mapValues(SimClustersEmbedding(_))
}
ObservedCachedReadableStore.from[SimClustersEmbeddingId, SimClustersEmbedding](
inMemoryCacheStore,
ttl = 5.minute,
maxKeys = 1048575, // 200MB
cacheName = "log_fav_based_tweet_embedding_cache",
windowSize = 10000L
)(statsReceiver.scope("log_fav_based_tweet_embedding_store"))
}
lazy val logFavBased20M145K2020TweetEmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore =
PersistentTweetEmbeddingStore
.mostRecentTweetEmbeddingStoreManhattan(
mhMtlsParams,
PersistentTweetEmbeddingStore.LogFavBased20m145k2020Dataset,
statsReceiver,
maxLength = 10,
).mapValues(_.toThrift)
val memcachedStore = ObservedMemcachedReadableStore.fromCacheClient(
backingStore = rawStore,
cacheClient = cacheClient,
ttl = 15.minutes
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver = statsReceiver.scope("log_fav_based_tweet_embedding_20m145k2020_mem_cache"),
keyToString = { k =>
// SimClusters_embedding_LZ4/embeddingType_modelVersion_tweetId
s"scez:${LogFavBasedTweet}_${ModelVersions.Model20M145K2020}_$k"
}
)
val inMemoryCacheStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] =
memcachedStore
.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(
LogFavBasedTweet,
Model20m145k2020,
InternalId.TweetId(tweetId)) =>
tweetId
}
.mapValues(SimClustersEmbedding(_))
ObservedCachedReadableStore.from[SimClustersEmbeddingId, SimClustersEmbedding](
inMemoryCacheStore,
ttl = 12.minute,
maxKeys = 16777215,
cacheName = "log_fav_based_tweet_embedding_20m145k2020_cache",
windowSize = 10000L
)(statsReceiver.scope("log_fav_based_tweet_embedding_20m145k2020_store"))
}
lazy val favBasedTfgTopicEmbedding2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val stratoStore =
StratoFetchableStore
.withUnitView[SimClustersEmbeddingId, ThriftSimClustersEmbedding](
stratoClient,
"recommendations/simclusters_v2/embeddings/favBasedTFGTopic20M145K2020")
val truncatedStore = stratoStore.mapValues { embedding =>
SimClustersEmbedding(embedding, truncate = 50)
}
ObservedCachedReadableStore.from(
ObservedReadableStore(truncatedStore)(
statsReceiver.scope("fav_tfg_topic_embedding_2020_cache_backing_store")),
ttl = 12.hours,
maxKeys = 262143, // 200MB
cacheName = "fav_tfg_topic_embedding_2020_cache",
windowSize = 10000L
)(statsReceiver.scope("fav_tfg_topic_embedding_2020_cache"))
}
lazy val logFavBasedApe20M145K2020EmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
ObservedReadableStore(
StratoFetchableStore
.withUnitView[SimClustersEmbeddingId, ThriftSimClustersEmbedding](
stratoClient,
"recommendations/simclusters_v2/embeddings/logFavBasedAPE20M145K2020")
.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(
AggregatableLogFavBasedProducer,
Model20m145k2020,
internalId) =>
SimClustersEmbeddingId(AggregatableLogFavBasedProducer, Model20m145k2020, internalId)
}
.mapValues(embedding => SimClustersEmbedding(embedding, 50))
)(statsReceiver.scope("aggregatable_producer_embeddings_by_logfav_score_2020"))
}
val interestService: InterestsThriftService.MethodPerEndpoint =
makeThriftClient[InterestsThriftService.MethodPerEndpoint](
"/s/interests-thrift-service/interests-thrift-service",
"interests_thrift_service"
)
val interestsOptOutStore: InterestsOptOutStore = InterestsOptOutStore(interestService)
// Save 2 ^ 18 UTTs. Promising 100% cache rate
lazy val defaultCacheConfigV2: CacheConfigV2 = CacheConfigV2(262143)
lazy val uttClientCacheConfigsV2: UttClientCacheConfigsV2 = UttClientCacheConfigsV2(
getTaxonomyConfig = defaultCacheConfigV2,
getUttTaxonomyConfig = defaultCacheConfigV2,
getLeafIds = defaultCacheConfigV2,
getLeafUttEntities = defaultCacheConfigV2
)
// CachedUttClient to use StratoClient
lazy val cachedUttClientV2: CachedUttClientV2 = new CachedUttClientV2(
stratoClient = stratoClient,
env = Environment.Prod,
cacheConfigs = uttClientCacheConfigsV2,
statsReceiver = statsReceiver.scope("cached_utt_client")
)
lazy val semanticCoreTopicSeedStore: ReadableStore[
SemanticCoreTopicSeedStore.Key,
Seq[UserId]
] = {
/*
Up to 1000 Long seeds per topic/language = 62.5kb per topic/language (worst case)
Assume ~10k active topic/languages ~= 650MB (worst case)
*/
val underlying = new SemanticCoreTopicSeedStore(cachedUttClientV2, interestsOptOutStore)(
statsReceiver.scope("semantic_core_topic_seed_store"))
val memcacheStore = ObservedMemcachedReadableStore.fromCacheClient(
backingStore = underlying,
cacheClient = cacheClient,
ttl = 12.hours
)(
valueInjection = SeqLongInjection,
statsReceiver = statsReceiver.scope("topic_producer_seed_store_mem_cache"),
keyToString = { k => s"tpss:${k.entityId}_${k.languageCode}" }
)
ObservedCachedReadableStore.from[SemanticCoreTopicSeedStore.Key, Seq[UserId]](
store = memcacheStore,
ttl = 6.hours,
maxKeys = 20e3.toInt,
cacheName = "topic_producer_seed_store_cache",
windowSize = 5000
)(statsReceiver.scope("topic_producer_seed_store_cache"))
}
lazy val logFavBasedApeEntity20M145K2020EmbeddingStore: ApeEntityEmbeddingStore = {
val apeStore = logFavBasedApe20M145K2020EmbeddingStore.composeKeyMapping[UserId]({ id =>
SimClustersEmbeddingId(
AggregatableLogFavBasedProducer,
Model20m145k2020,
InternalId.UserId(id))
})
new ApeEntityEmbeddingStore(
semanticCoreSeedStore = semanticCoreTopicSeedStore,
aggregatableProducerEmbeddingStore = apeStore,
statsReceiver = statsReceiver.scope("log_fav_based_ape_entity_2020_embedding_store"))
}
lazy val logFavBasedApeEntity20M145K2020EmbeddingCachedStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val truncatedStore =
logFavBasedApeEntity20M145K2020EmbeddingStore.mapValues(_.truncate(50).toThrift)
val memcachedStore = ObservedMemcachedReadableStore
.fromCacheClient(
backingStore = truncatedStore,
cacheClient = cacheClient,
ttl = 12.hours
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver = statsReceiver.scope("log_fav_based_ape_entity_2020_embedding_mem_cache"),
keyToString = { k => embeddingCacheKeyBuilder.apply(k) }
).mapValues(SimClustersEmbedding(_))
val inMemoryCachedStore =
ObservedCachedReadableStore.from[SimClustersEmbeddingId, SimClustersEmbedding](
memcachedStore,
ttl = 6.hours,
maxKeys = 262143,
cacheName = "log_fav_based_ape_entity_2020_embedding_cache",
windowSize = 10000L
)(statsReceiver.scope("log_fav_based_ape_entity_2020_embedding_cached_store"))
DeciderableReadableStore(
inMemoryCachedStore,
rmsDecider.deciderGateBuilder.idGateWithHashing[SimClustersEmbeddingId](
DeciderKey.enableLogFavBasedApeEntity20M145K2020EmbeddingCachedStore),
statsReceiver.scope("log_fav_based_ape_entity_2020_embedding_deciderable_store")
)
}
lazy val relaxedLogFavBasedApe20M145K2020EmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
ObservedReadableStore(
StratoFetchableStore
.withUnitView[SimClustersEmbeddingId, ThriftSimClustersEmbedding](
stratoClient,
"recommendations/simclusters_v2/embeddings/logFavBasedAPERelaxedFavEngagementThreshold20M145K2020")
.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(
RelaxedAggregatableLogFavBasedProducer,
Model20m145k2020,
internalId) =>
SimClustersEmbeddingId(
RelaxedAggregatableLogFavBasedProducer,
Model20m145k2020,
internalId)
}
.mapValues(embedding => SimClustersEmbedding(embedding).truncate(50))
)(statsReceiver.scope(
"aggregatable_producer_embeddings_by_logfav_score_relaxed_fav_engagement_threshold_2020"))
}
lazy val relaxedLogFavBasedApe20M145K2020EmbeddingCachedStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val truncatedStore =
relaxedLogFavBasedApe20M145K2020EmbeddingStore.mapValues(_.truncate(50).toThrift)
val memcachedStore = ObservedMemcachedReadableStore
.fromCacheClient(
backingStore = truncatedStore,
cacheClient = cacheClient,
ttl = 12.hours
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver =
statsReceiver.scope("relaxed_log_fav_based_ape_entity_2020_embedding_mem_cache"),
keyToString = { k: SimClustersEmbeddingId => embeddingCacheKeyBuilder.apply(k) }
).mapValues(SimClustersEmbedding(_))
ObservedCachedReadableStore.from[SimClustersEmbeddingId, SimClustersEmbedding](
memcachedStore,
ttl = 6.hours,
maxKeys = 262143,
cacheName = "relaxed_log_fav_based_ape_entity_2020_embedding_cache",
windowSize = 10000L
)(statsReceiver.scope("relaxed_log_fav_based_ape_entity_2020_embedding_cache_store"))
}
lazy val favBasedProducer20M145K2020EmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val underlyingStore = ProducerClusterEmbeddingReadableStores
.getProducerTopKSimClusters2020EmbeddingsStore(
mhMtlsParams
).composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(
FavBasedProducer,
Model20m145k2020,
InternalId.UserId(userId)) =>
userId
}.mapValues { topSimClustersWithScore =>
ThriftSimClustersEmbedding(topSimClustersWithScore.topClusters.take(10))
}
// same memcache config as for favBasedUserInterestedIn20M145K2020Store
val memcachedStore = ObservedMemcachedReadableStore
.fromCacheClient(
backingStore = underlyingStore,
cacheClient = cacheClient,
ttl = 24.hours
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver = statsReceiver.scope("fav_based_producer_embedding_20M_145K_2020_mem_cache"),
keyToString = { k => embeddingCacheKeyBuilder.apply(k) }
).mapValues(SimClustersEmbedding(_))
ObservedCachedReadableStore.from[SimClustersEmbeddingId, SimClustersEmbedding](
memcachedStore,
ttl = 12.hours,
maxKeys = 16777215,
cacheName = "fav_based_producer_embedding_20M_145K_2020_embedding_cache",
windowSize = 10000L
)(statsReceiver.scope("fav_based_producer_embedding_20M_145K_2020_embedding_store"))
}
// Production
lazy val interestedIn20M145KUpdatedStore: ReadableStore[UserId, ClustersUserIsInterestedIn] = {
UserInterestedInReadableStore.defaultStoreWithMtls(
mhMtlsParams,
modelVersion = ModelVersions.Model20M145KUpdated
)
}
// Production
lazy val interestedIn20M145K2020Store: ReadableStore[UserId, ClustersUserIsInterestedIn] = {
UserInterestedInReadableStore.defaultStoreWithMtls(
mhMtlsParams,
modelVersion = ModelVersions.Model20M145K2020
)
}
// Production
lazy val InterestedInFromPE20M145KUpdatedStore: ReadableStore[
UserId,
ClustersUserIsInterestedIn
] = {
UserInterestedInReadableStore.defaultIIPEStoreWithMtls(
mhMtlsParams,
modelVersion = ModelVersions.Model20M145KUpdated)
}
lazy val simClustersInterestedInStore: ReadableStore[
(UserId, ModelVersion),
ClustersUserIsInterestedIn
] = {
new ReadableStore[(UserId, ModelVersion), ClustersUserIsInterestedIn] {
override def get(k: (UserId, ModelVersion)): Future[Option[ClustersUserIsInterestedIn]] = {
k match {
case (userId, Model20m145kUpdated) =>
interestedIn20M145KUpdatedStore.get(userId)
case (userId, Model20m145k2020) =>
interestedIn20M145K2020Store.get(userId)
case _ =>
Future.None
}
}
}
}
lazy val simClustersInterestedInFromProducerEmbeddingsStore: ReadableStore[
(UserId, ModelVersion),
ClustersUserIsInterestedIn
] = {
new ReadableStore[(UserId, ModelVersion), ClustersUserIsInterestedIn] {
override def get(k: (UserId, ModelVersion)): Future[Option[ClustersUserIsInterestedIn]] = {
k match {
case (userId, ModelVersion.Model20m145kUpdated) =>
InterestedInFromPE20M145KUpdatedStore.get(userId)
case _ =>
Future.None
}
}
}
}
lazy val userInterestedInStore =
new twistly.interestedin.EmbeddingStore(
interestedInStore = simClustersInterestedInStore,
interestedInFromProducerEmbeddingStore = simClustersInterestedInFromProducerEmbeddingsStore,
statsReceiver = statsReceiver
)
// Production
lazy val favBasedUserInterestedIn20M145KUpdatedStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val underlyingStore =
UserInterestedInReadableStore
.defaultSimClustersEmbeddingStoreWithMtls(
mhMtlsParams,
EmbeddingType.FavBasedUserInterestedIn,
ModelVersion.Model20m145kUpdated)
.mapValues(_.toThrift)
val memcachedStore = ObservedMemcachedReadableStore
.fromCacheClient(
backingStore = underlyingStore,
cacheClient = cacheClient,
ttl = 12.hours
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver = statsReceiver.scope("fav_based_user_interested_in_mem_cache"),
keyToString = { k => embeddingCacheKeyBuilder.apply(k) }
).mapValues(SimClustersEmbedding(_))
ObservedCachedReadableStore.from[SimClustersEmbeddingId, SimClustersEmbedding](
memcachedStore,
ttl = 6.hours,
maxKeys = 262143,
cacheName = "fav_based_user_interested_in_cache",
windowSize = 10000L
)(statsReceiver.scope("fav_based_user_interested_in_store"))
}
// Production
lazy val LogFavBasedInterestedInFromAPE20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val underlyingStore =
UserInterestedInReadableStore
.defaultIIAPESimClustersEmbeddingStoreWithMtls(
mhMtlsParams,
EmbeddingType.LogFavBasedUserInterestedInFromAPE,
ModelVersion.Model20m145k2020)
.mapValues(_.toThrift)
val memcachedStore = ObservedMemcachedReadableStore
.fromCacheClient(
backingStore = underlyingStore,
cacheClient = cacheClient,
ttl = 12.hours
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver = statsReceiver.scope("log_fav_based_user_interested_in_from_ape_mem_cache"),
keyToString = { k => embeddingCacheKeyBuilder.apply(k) }
).mapValues(SimClustersEmbedding(_))
ObservedCachedReadableStore.from[SimClustersEmbeddingId, SimClustersEmbedding](
memcachedStore,
ttl = 6.hours,
maxKeys = 262143,
cacheName = "log_fav_based_user_interested_in_from_ape_cache",
windowSize = 10000L
)(statsReceiver.scope("log_fav_based_user_interested_in_from_ape_store"))
}
// Production
lazy val FollowBasedInterestedInFromAPE20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val underlyingStore =
UserInterestedInReadableStore
.defaultIIAPESimClustersEmbeddingStoreWithMtls(
mhMtlsParams,
EmbeddingType.FollowBasedUserInterestedInFromAPE,
ModelVersion.Model20m145k2020)
.mapValues(_.toThrift)
val memcachedStore = ObservedMemcachedReadableStore
.fromCacheClient(
backingStore = underlyingStore,
cacheClient = cacheClient,
ttl = 12.hours
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver = statsReceiver.scope("follow_based_user_interested_in_from_ape_mem_cache"),
keyToString = { k => embeddingCacheKeyBuilder.apply(k) }
).mapValues(SimClustersEmbedding(_))
ObservedCachedReadableStore.from[SimClustersEmbeddingId, SimClustersEmbedding](
memcachedStore,
ttl = 6.hours,
maxKeys = 262143,
cacheName = "follow_based_user_interested_in_from_ape_cache",
windowSize = 10000L
)(statsReceiver.scope("follow_based_user_interested_in_from_ape_store"))
}
// production
lazy val favBasedUserInterestedIn20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val underlyingStore: ReadableStore[SimClustersEmbeddingId, ThriftSimClustersEmbedding] =
UserInterestedInReadableStore
.defaultSimClustersEmbeddingStoreWithMtls(
mhMtlsParams,
EmbeddingType.FavBasedUserInterestedIn,
ModelVersion.Model20m145k2020).mapValues(_.toThrift)
ObservedMemcachedReadableStore
.fromCacheClient(
backingStore = underlyingStore,
cacheClient = cacheClient,
ttl = 12.hours
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver = statsReceiver.scope("fav_based_user_interested_in_2020_mem_cache"),
keyToString = { k => embeddingCacheKeyBuilder.apply(k) }
).mapValues(SimClustersEmbedding(_))
}
// Production
lazy val logFavBasedUserInterestedIn20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val underlyingStore =
UserInterestedInReadableStore
.defaultSimClustersEmbeddingStoreWithMtls(
mhMtlsParams,
EmbeddingType.LogFavBasedUserInterestedIn,
ModelVersion.Model20m145k2020)
val memcachedStore = ObservedMemcachedReadableStore
.fromCacheClient(
backingStore = underlyingStore.mapValues(_.toThrift),
cacheClient = cacheClient,
ttl = 12.hours
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver = statsReceiver.scope("log_fav_based_user_interested_in_2020_store"),
keyToString = { k => embeddingCacheKeyBuilder.apply(k) }
).mapValues(SimClustersEmbedding(_))
ObservedCachedReadableStore.from[SimClustersEmbeddingId, SimClustersEmbedding](
memcachedStore,
ttl = 6.hours,
maxKeys = 262143,
cacheName = "log_fav_based_user_interested_in_2020_cache",
windowSize = 10000L
)(statsReceiver.scope("log_fav_based_user_interested_in_2020_store"))
}
// Production
lazy val favBasedUserInterestedInFromPE20M145KUpdatedStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val underlyingStore =
UserInterestedInReadableStore
.defaultIIPESimClustersEmbeddingStoreWithMtls(
mhMtlsParams,
EmbeddingType.FavBasedUserInterestedInFromPE,
ModelVersion.Model20m145kUpdated)
.mapValues(_.toThrift)
val memcachedStore = ObservedMemcachedReadableStore
.fromCacheClient(
backingStore = underlyingStore,
cacheClient = cacheClient,
ttl = 12.hours
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(ThriftSimClustersEmbedding)),
statsReceiver = statsReceiver.scope("fav_based_user_interested_in_from_pe_mem_cache"),
keyToString = { k => embeddingCacheKeyBuilder.apply(k) }
).mapValues(SimClustersEmbedding(_))
ObservedCachedReadableStore.from[SimClustersEmbeddingId, SimClustersEmbedding](
memcachedStore,
ttl = 6.hours,
maxKeys = 262143,
cacheName = "fav_based_user_interested_in_from_pe_cache",
windowSize = 10000L
)(statsReceiver.scope("fav_based_user_interested_in_from_pe_cache"))
}
private val underlyingStores: Map[
(EmbeddingType, ModelVersion),
ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding]
] = Map(
// Tweet Embeddings
(LogFavBasedTweet, Model20m145kUpdated) -> logFavBased20M145KUpdatedTweetEmbeddingStore,
(LogFavBasedTweet, Model20m145k2020) -> logFavBased20M145K2020TweetEmbeddingStore,
(
LogFavLongestL2EmbeddingTweet,
Model20m145k2020) -> logFavBasedLongestL2Tweet20M145K2020EmbeddingStore,
// Entity Embeddings
(FavTfgTopic, Model20m145k2020) -> favBasedTfgTopicEmbedding2020Store,
(
LogFavBasedKgoApeTopic,
Model20m145k2020) -> logFavBasedApeEntity20M145K2020EmbeddingCachedStore,
// KnownFor Embeddings
(FavBasedProducer, Model20m145k2020) -> favBasedProducer20M145K2020EmbeddingStore,
(
RelaxedAggregatableLogFavBasedProducer,
Model20m145k2020) -> relaxedLogFavBasedApe20M145K2020EmbeddingCachedStore,
// InterestedIn Embeddings
(
LogFavBasedUserInterestedInFromAPE,
Model20m145k2020) -> LogFavBasedInterestedInFromAPE20M145K2020Store,
(
FollowBasedUserInterestedInFromAPE,
Model20m145k2020) -> FollowBasedInterestedInFromAPE20M145K2020Store,
(FavBasedUserInterestedIn, Model20m145kUpdated) -> favBasedUserInterestedIn20M145KUpdatedStore,
(FavBasedUserInterestedIn, Model20m145k2020) -> favBasedUserInterestedIn20M145K2020Store,
(LogFavBasedUserInterestedIn, Model20m145k2020) -> logFavBasedUserInterestedIn20M145K2020Store,
(
FavBasedUserInterestedInFromPE,
Model20m145kUpdated) -> favBasedUserInterestedInFromPE20M145KUpdatedStore,
(FilteredUserInterestedIn, Model20m145kUpdated) -> userInterestedInStore,
(FilteredUserInterestedIn, Model20m145k2020) -> userInterestedInStore,
(FilteredUserInterestedInFromPE, Model20m145kUpdated) -> userInterestedInStore,
(UnfilteredUserInterestedIn, Model20m145kUpdated) -> userInterestedInStore,
(UnfilteredUserInterestedIn, Model20m145k2020) -> userInterestedInStore,
)
val simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val underlying: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] =
SimClustersEmbeddingStore.buildWithDecider(
underlyingStores = underlyingStores,
decider = rmsDecider.decider,
statsReceiver = statsReceiver.scope("simClusters_embeddings_store_deciderable")
)
val underlyingWithTimeout: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] =
new ReadableStoreWithTimeout(
rs = underlying,
decider = rmsDecider.decider,
enableTimeoutDeciderKey = DeciderConstants.enableSimClustersEmbeddingStoreTimeouts,
timeoutValueKey = DeciderConstants.simClustersEmbeddingStoreTimeoutValueMillis,
timer = timer,
statsReceiver = statsReceiver.scope("simClusters_embedding_store_timeouts")
)
ObservedReadableStore(
store = underlyingWithTimeout
)(statsReceiver.scope("simClusters_embeddings_store"))
}
}

View File

@ -0,0 +1,18 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication",
"finagle/finagle-stats",
"finatra/inject/inject-core/src/main/scala",
"frigate/frigate-common/src/main/scala/com/twitter/frigate/common/util",
"interests-service/thrift/src/main/thrift:thrift-scala",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/common",
"servo/util",
"src/scala/com/twitter/storehaus_internal/manhattan",
"src/scala/com/twitter/storehaus_internal/memcache",
"src/scala/com/twitter/storehaus_internal/util",
"strato/src/main/scala/com/twitter/strato/client",
],
)

View File

@ -0,0 +1,34 @@
package com.twitter.representation_manager.modules
import com.google.inject.Provides
import com.twitter.finagle.memcached.Client
import javax.inject.Singleton
import com.twitter.conversions.DurationOps._
import com.twitter.inject.TwitterModule
import com.twitter.finagle.mtls.authentication.ServiceIdentifier
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.storehaus_internal.memcache.MemcacheStore
import com.twitter.storehaus_internal.util.ClientName
import com.twitter.storehaus_internal.util.ZkEndPoint
object CacheModule extends TwitterModule {
private val cacheDest = flag[String]("cache_module.dest", "Path to memcache service")
private val timeout = flag[Int]("memcache.timeout", "Memcache client timeout")
private val retries = flag[Int]("memcache.retries", "Memcache timeout retries")
@Singleton
@Provides
def providesCache(
serviceIdentifier: ServiceIdentifier,
stats: StatsReceiver
): Client =
MemcacheStore.memcachedClient(
name = ClientName("memcache_representation_manager"),
dest = ZkEndPoint(cacheDest()),
timeout = timeout().milliseconds,
retries = retries(),
statsReceiver = stats.scope("cache_client"),
serviceIdentifier = serviceIdentifier
)
}

View File

@ -0,0 +1,40 @@
package com.twitter.representation_manager.modules
import com.google.inject.Provides
import com.twitter.conversions.DurationOps._
import com.twitter.finagle.ThriftMux
import com.twitter.finagle.mtls.authentication.ServiceIdentifier
import com.twitter.finagle.mtls.client.MtlsStackClient.MtlsThriftMuxClientSyntax
import com.twitter.finagle.mux.ClientDiscardedRequestException
import com.twitter.finagle.service.ReqRep
import com.twitter.finagle.service.ResponseClass
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.finagle.thrift.ClientId
import com.twitter.inject.TwitterModule
import com.twitter.interests.thriftscala.InterestsThriftService
import com.twitter.util.Throw
import javax.inject.Singleton
object InterestsThriftClientModule extends TwitterModule {
@Singleton
@Provides
def providesInterestsThriftClient(
clientId: ClientId,
serviceIdentifier: ServiceIdentifier,
statsReceiver: StatsReceiver
): InterestsThriftService.MethodPerEndpoint = {
ThriftMux.client
.withClientId(clientId)
.withMutualTls(serviceIdentifier)
.withRequestTimeout(450.milliseconds)
.withStatsReceiver(statsReceiver.scope("InterestsThriftClient"))
.withResponseClassifier {
case ReqRep(_, Throw(_: ClientDiscardedRequestException)) => ResponseClass.Ignorable
}
.build[InterestsThriftService.MethodPerEndpoint](
dest = "/s/interests-thrift-service/interests-thrift-service",
label = "interests_thrift_service"
)
}
}

View File

@ -0,0 +1,18 @@
package com.twitter.representation_manager.modules
import com.google.inject.Provides
import com.twitter.inject.TwitterModule
import javax.inject.Named
import javax.inject.Singleton
object LegacyRMSConfigModule extends TwitterModule {
@Singleton
@Provides
@Named("cacheHashKeyPrefix")
def providesCacheHashKeyPrefix: String = "RMS"
@Singleton
@Provides
@Named("useContentRecommenderConfiguration")
def providesUseContentRecommenderConfiguration: Boolean = false
}

View File

@ -0,0 +1,24 @@
package com.twitter.representation_manager.modules
import com.google.inject.Provides
import javax.inject.Singleton
import com.twitter.inject.TwitterModule
import com.twitter.decider.Decider
import com.twitter.finagle.mtls.authentication.ServiceIdentifier
import com.twitter.representation_manager.common.RepresentationManagerDecider
import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams
object StoreModule extends TwitterModule {
@Singleton
@Provides
def providesMhMtlsParams(
serviceIdentifier: ServiceIdentifier
): ManhattanKVClientMtlsParams = ManhattanKVClientMtlsParams(serviceIdentifier)
@Singleton
@Provides
def providesRmsDecider(
decider: Decider
): RepresentationManagerDecider = RepresentationManagerDecider(decider)
}

View File

@ -0,0 +1,13 @@
package com.twitter.representation_manager.modules
import com.google.inject.Provides
import com.twitter.finagle.util.DefaultTimer
import com.twitter.inject.TwitterModule
import com.twitter.util.Timer
import javax.inject.Singleton
object TimerModule extends TwitterModule {
@Singleton
@Provides
def providesTimer: Timer = DefaultTimer
}

View File

@ -0,0 +1,39 @@
package com.twitter.representation_manager.modules
import com.google.inject.Provides
import com.twitter.escherbird.util.uttclient.CacheConfigV2
import com.twitter.escherbird.util.uttclient.CachedUttClientV2
import com.twitter.escherbird.util.uttclient.UttClientCacheConfigsV2
import com.twitter.escherbird.utt.strato.thriftscala.Environment
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.inject.TwitterModule
import com.twitter.strato.client.{Client => StratoClient}
import javax.inject.Singleton
object UttClientModule extends TwitterModule {
@Singleton
@Provides
def providesUttClient(
stratoClient: StratoClient,
statsReceiver: StatsReceiver
): CachedUttClientV2 = {
// Save 2 ^ 18 UTTs. Promising 100% cache rate
val defaultCacheConfigV2: CacheConfigV2 = CacheConfigV2(262143)
val uttClientCacheConfigsV2: UttClientCacheConfigsV2 = UttClientCacheConfigsV2(
getTaxonomyConfig = defaultCacheConfigV2,
getUttTaxonomyConfig = defaultCacheConfigV2,
getLeafIds = defaultCacheConfigV2,
getLeafUttEntities = defaultCacheConfigV2
)
// CachedUttClient to use StratoClient
new CachedUttClientV2(
stratoClient = stratoClient,
env = Environment.Prod,
cacheConfigs = uttClientCacheConfigsV2,
statsReceiver = statsReceiver.scope("cached_utt_client")
)
}
}

View File

@ -0,0 +1,16 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"content-recommender/server/src/main/scala/com/twitter/contentrecommender:representation-manager-deps",
"frigate/frigate-common/src/main/scala/com/twitter/frigate/common/util",
"hermit/hermit-core/src/main/scala/com/twitter/hermit/store/common",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/common",
"src/scala/com/twitter/simclusters_v2/stores",
"src/scala/com/twitter/simclusters_v2/summingbird/stores",
"src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala",
"storage/clients/manhattan/client/src/main/scala",
"tweetypie/src/scala/com/twitter/tweetypie/util",
],
)

View File

@ -0,0 +1,39 @@
package com.twitter.representation_manager.store
import com.twitter.servo.decider.DeciderKeyEnum
object DeciderConstants {
// Deciders inherited from CR and RSX and only used in LegacyRMS
// Their value are manipulated by CR and RSX's yml file and their decider dashboard
// We will remove them after migration completed
val enableLogFavBasedApeEntity20M145KUpdatedEmbeddingCachedStore =
"enableLogFavBasedApeEntity20M145KUpdatedEmbeddingCachedStore"
val enableLogFavBasedApeEntity20M145K2020EmbeddingCachedStore =
"enableLogFavBasedApeEntity20M145K2020EmbeddingCachedStore"
val enablelogFavBased20M145K2020TweetEmbeddingStoreTimeouts =
"enable_log_fav_based_tweet_embedding_20m145k2020_timeouts"
val logFavBased20M145K2020TweetEmbeddingStoreTimeoutValueMillis =
"log_fav_based_tweet_embedding_20m145k2020_timeout_value_millis"
val enablelogFavBased20M145KUpdatedTweetEmbeddingStoreTimeouts =
"enable_log_fav_based_tweet_embedding_20m145kUpdated_timeouts"
val logFavBased20M145KUpdatedTweetEmbeddingStoreTimeoutValueMillis =
"log_fav_based_tweet_embedding_20m145kUpdated_timeout_value_millis"
val enableSimClustersEmbeddingStoreTimeouts = "enable_sim_clusters_embedding_store_timeouts"
val simClustersEmbeddingStoreTimeoutValueMillis =
"sim_clusters_embedding_store_timeout_value_millis"
}
// Necessary for using servo Gates
object DeciderKey extends DeciderKeyEnum {
val enableLogFavBasedApeEntity20M145KUpdatedEmbeddingCachedStore: Value = Value(
DeciderConstants.enableLogFavBasedApeEntity20M145KUpdatedEmbeddingCachedStore
)
val enableLogFavBasedApeEntity20M145K2020EmbeddingCachedStore: Value = Value(
DeciderConstants.enableLogFavBasedApeEntity20M145K2020EmbeddingCachedStore
)
}

View File

@ -0,0 +1,198 @@
package com.twitter.representation_manager.store
import com.twitter.contentrecommender.store.ApeEntityEmbeddingStore
import com.twitter.contentrecommender.store.InterestsOptOutStore
import com.twitter.contentrecommender.store.SemanticCoreTopicSeedStore
import com.twitter.conversions.DurationOps._
import com.twitter.escherbird.util.uttclient.CachedUttClientV2
import com.twitter.finagle.memcached.Client
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.frigate.common.store.strato.StratoFetchableStore
import com.twitter.frigate.common.util.SeqLongInjection
import com.twitter.hermit.store.common.ObservedCachedReadableStore
import com.twitter.hermit.store.common.ObservedMemcachedReadableStore
import com.twitter.hermit.store.common.ObservedReadableStore
import com.twitter.interests.thriftscala.InterestsThriftService
import com.twitter.representation_manager.common.MemCacheConfig
import com.twitter.representation_manager.common.RepresentationManagerDecider
import com.twitter.simclusters_v2.common.SimClustersEmbedding
import com.twitter.simclusters_v2.stores.SimClustersEmbeddingStore
import com.twitter.simclusters_v2.thriftscala.EmbeddingType
import com.twitter.simclusters_v2.thriftscala.EmbeddingType._
import com.twitter.simclusters_v2.thriftscala.InternalId
import com.twitter.simclusters_v2.thriftscala.ModelVersion
import com.twitter.simclusters_v2.thriftscala.ModelVersion._
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.simclusters_v2.thriftscala.TopicId
import com.twitter.simclusters_v2.thriftscala.LocaleEntityId
import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding}
import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams
import com.twitter.storehaus.ReadableStore
import com.twitter.strato.client.{Client => StratoClient}
import com.twitter.tweetypie.util.UserId
import javax.inject.Inject
class TopicSimClustersEmbeddingStore @Inject() (
stratoClient: StratoClient,
cacheClient: Client,
globalStats: StatsReceiver,
mhMtlsParams: ManhattanKVClientMtlsParams,
rmsDecider: RepresentationManagerDecider,
interestService: InterestsThriftService.MethodPerEndpoint,
uttClient: CachedUttClientV2) {
private val stats = globalStats.scope(this.getClass.getSimpleName)
private val interestsOptOutStore = InterestsOptOutStore(interestService)
/**
* Note this is NOT an embedding store. It is a list of author account ids we use to represent
* topics
*/
private val semanticCoreTopicSeedStore: ReadableStore[
SemanticCoreTopicSeedStore.Key,
Seq[UserId]
] = {
/*
Up to 1000 Long seeds per topic/language = 62.5kb per topic/language (worst case)
Assume ~10k active topic/languages ~= 650MB (worst case)
*/
val underlying = new SemanticCoreTopicSeedStore(uttClient, interestsOptOutStore)(
stats.scope("semantic_core_topic_seed_store"))
val memcacheStore = ObservedMemcachedReadableStore.fromCacheClient(
backingStore = underlying,
cacheClient = cacheClient,
ttl = 12.hours)(
valueInjection = SeqLongInjection,
statsReceiver = stats.scope("topic_producer_seed_store_mem_cache"),
keyToString = { k => s"tpss:${k.entityId}_${k.languageCode}" }
)
ObservedCachedReadableStore.from[SemanticCoreTopicSeedStore.Key, Seq[UserId]](
store = memcacheStore,
ttl = 6.hours,
maxKeys = 20e3.toInt,
cacheName = "topic_producer_seed_store_cache",
windowSize = 5000
)(stats.scope("topic_producer_seed_store_cache"))
}
private val favBasedTfgTopicEmbedding20m145k2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore =
StratoFetchableStore
.withUnitView[SimClustersEmbeddingId, ThriftSimClustersEmbedding](
stratoClient,
"recommendations/simclusters_v2/embeddings/favBasedTFGTopic20M145K2020").mapValues(
embedding => SimClustersEmbedding(embedding, truncate = 50).toThrift)
.composeKeyMapping[LocaleEntityId] { localeEntityId =>
SimClustersEmbeddingId(
FavTfgTopic,
Model20m145k2020,
InternalId.LocaleEntityId(localeEntityId))
}
buildLocaleEntityIdMemCacheStore(rawStore, FavTfgTopic, Model20m145k2020)
}
private val logFavBasedApeEntity20M145K2020EmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val apeStore = StratoFetchableStore
.withUnitView[SimClustersEmbeddingId, ThriftSimClustersEmbedding](
stratoClient,
"recommendations/simclusters_v2/embeddings/logFavBasedAPE20M145K2020")
.mapValues(embedding => SimClustersEmbedding(embedding, truncate = 50))
.composeKeyMapping[UserId]({ id =>
SimClustersEmbeddingId(
AggregatableLogFavBasedProducer,
Model20m145k2020,
InternalId.UserId(id))
})
val rawStore = new ApeEntityEmbeddingStore(
semanticCoreSeedStore = semanticCoreTopicSeedStore,
aggregatableProducerEmbeddingStore = apeStore,
statsReceiver = stats.scope("log_fav_based_ape_entity_2020_embedding_store"))
.mapValues(embedding => SimClustersEmbedding(embedding.toThrift, truncate = 50).toThrift)
.composeKeyMapping[TopicId] { topicId =>
SimClustersEmbeddingId(
LogFavBasedKgoApeTopic,
Model20m145k2020,
InternalId.TopicId(topicId))
}
buildTopicIdMemCacheStore(rawStore, LogFavBasedKgoApeTopic, Model20m145k2020)
}
private def buildTopicIdMemCacheStore(
rawStore: ReadableStore[TopicId, ThriftSimClustersEmbedding],
embeddingType: EmbeddingType,
modelVersion: ModelVersion
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val observedStore: ObservedReadableStore[TopicId, ThriftSimClustersEmbedding] =
ObservedReadableStore(
store = rawStore
)(stats.scope(embeddingType.name).scope(modelVersion.name))
val storeWithKeyMapping = observedStore.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(_, _, InternalId.TopicId(topicId)) =>
topicId
}
MemCacheConfig.buildMemCacheStoreForSimClustersEmbedding(
storeWithKeyMapping,
cacheClient,
embeddingType,
modelVersion,
stats
)
}
private def buildLocaleEntityIdMemCacheStore(
rawStore: ReadableStore[LocaleEntityId, ThriftSimClustersEmbedding],
embeddingType: EmbeddingType,
modelVersion: ModelVersion
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val observedStore: ObservedReadableStore[LocaleEntityId, ThriftSimClustersEmbedding] =
ObservedReadableStore(
store = rawStore
)(stats.scope(embeddingType.name).scope(modelVersion.name))
val storeWithKeyMapping = observedStore.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(_, _, InternalId.LocaleEntityId(localeEntityId)) =>
localeEntityId
}
MemCacheConfig.buildMemCacheStoreForSimClustersEmbedding(
storeWithKeyMapping,
cacheClient,
embeddingType,
modelVersion,
stats
)
}
private val underlyingStores: Map[
(EmbeddingType, ModelVersion),
ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding]
] = Map(
// Topic Embeddings
(FavTfgTopic, Model20m145k2020) -> favBasedTfgTopicEmbedding20m145k2020Store,
(LogFavBasedKgoApeTopic, Model20m145k2020) -> logFavBasedApeEntity20M145K2020EmbeddingStore,
)
val topicSimClustersEmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
SimClustersEmbeddingStore.buildWithDecider(
underlyingStores = underlyingStores,
decider = rmsDecider.decider,
statsReceiver = stats
)
}
}

View File

@ -0,0 +1,141 @@
package com.twitter.representation_manager.store
import com.twitter.finagle.memcached.Client
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.hermit.store.common.ObservedReadableStore
import com.twitter.representation_manager.common.MemCacheConfig
import com.twitter.representation_manager.common.RepresentationManagerDecider
import com.twitter.simclusters_v2.common.SimClustersEmbedding
import com.twitter.simclusters_v2.common.TweetId
import com.twitter.simclusters_v2.stores.SimClustersEmbeddingStore
import com.twitter.simclusters_v2.summingbird.stores.PersistentTweetEmbeddingStore
import com.twitter.simclusters_v2.thriftscala.EmbeddingType
import com.twitter.simclusters_v2.thriftscala.EmbeddingType._
import com.twitter.simclusters_v2.thriftscala.InternalId
import com.twitter.simclusters_v2.thriftscala.ModelVersion
import com.twitter.simclusters_v2.thriftscala.ModelVersion._
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding}
import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams
import com.twitter.storehaus.ReadableStore
import javax.inject.Inject
class TweetSimClustersEmbeddingStore @Inject() (
cacheClient: Client,
globalStats: StatsReceiver,
mhMtlsParams: ManhattanKVClientMtlsParams,
rmsDecider: RepresentationManagerDecider) {
private val stats = globalStats.scope(this.getClass.getSimpleName)
val logFavBasedLongestL2Tweet20M145KUpdatedEmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore =
PersistentTweetEmbeddingStore
.longestL2NormTweetEmbeddingStoreManhattan(
mhMtlsParams,
PersistentTweetEmbeddingStore.LogFavBased20m145kUpdatedDataset,
stats
).mapValues(_.toThrift)
buildMemCacheStore(rawStore, LogFavLongestL2EmbeddingTweet, Model20m145kUpdated)
}
val logFavBasedLongestL2Tweet20M145K2020EmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore =
PersistentTweetEmbeddingStore
.longestL2NormTweetEmbeddingStoreManhattan(
mhMtlsParams,
PersistentTweetEmbeddingStore.LogFavBased20m145k2020Dataset,
stats
).mapValues(_.toThrift)
buildMemCacheStore(rawStore, LogFavLongestL2EmbeddingTweet, Model20m145k2020)
}
val logFavBased20M145KUpdatedTweetEmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore =
PersistentTweetEmbeddingStore
.mostRecentTweetEmbeddingStoreManhattan(
mhMtlsParams,
PersistentTweetEmbeddingStore.LogFavBased20m145kUpdatedDataset,
stats
).mapValues(_.toThrift)
buildMemCacheStore(rawStore, LogFavBasedTweet, Model20m145kUpdated)
}
val logFavBased20M145K2020TweetEmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore =
PersistentTweetEmbeddingStore
.mostRecentTweetEmbeddingStoreManhattan(
mhMtlsParams,
PersistentTweetEmbeddingStore.LogFavBased20m145k2020Dataset,
stats
).mapValues(_.toThrift)
buildMemCacheStore(rawStore, LogFavBasedTweet, Model20m145k2020)
}
private def buildMemCacheStore(
rawStore: ReadableStore[TweetId, ThriftSimClustersEmbedding],
embeddingType: EmbeddingType,
modelVersion: ModelVersion
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val observedStore: ObservedReadableStore[TweetId, ThriftSimClustersEmbedding] =
ObservedReadableStore(
store = rawStore
)(stats.scope(embeddingType.name).scope(modelVersion.name))
val storeWithKeyMapping = observedStore.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(_, _, InternalId.TweetId(tweetId)) =>
tweetId
}
MemCacheConfig.buildMemCacheStoreForSimClustersEmbedding(
storeWithKeyMapping,
cacheClient,
embeddingType,
modelVersion,
stats
)
}
private val underlyingStores: Map[
(EmbeddingType, ModelVersion),
ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding]
] = Map(
// Tweet Embeddings
(LogFavBasedTweet, Model20m145kUpdated) -> logFavBased20M145KUpdatedTweetEmbeddingStore,
(LogFavBasedTweet, Model20m145k2020) -> logFavBased20M145K2020TweetEmbeddingStore,
(
LogFavLongestL2EmbeddingTweet,
Model20m145kUpdated) -> logFavBasedLongestL2Tweet20M145KUpdatedEmbeddingStore,
(
LogFavLongestL2EmbeddingTweet,
Model20m145k2020) -> logFavBasedLongestL2Tweet20M145K2020EmbeddingStore,
)
val tweetSimClustersEmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
SimClustersEmbeddingStore.buildWithDecider(
underlyingStores = underlyingStores,
decider = rmsDecider.decider,
statsReceiver = stats
)
}
}

View File

@ -0,0 +1,602 @@
package com.twitter.representation_manager.store
import com.twitter.contentrecommender.twistly
import com.twitter.finagle.memcached.Client
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.frigate.common.store.strato.StratoFetchableStore
import com.twitter.hermit.store.common.ObservedReadableStore
import com.twitter.representation_manager.common.MemCacheConfig
import com.twitter.representation_manager.common.RepresentationManagerDecider
import com.twitter.simclusters_v2.common.ModelVersions
import com.twitter.simclusters_v2.common.SimClustersEmbedding
import com.twitter.simclusters_v2.stores.SimClustersEmbeddingStore
import com.twitter.simclusters_v2.summingbird.stores.ProducerClusterEmbeddingReadableStores
import com.twitter.simclusters_v2.summingbird.stores.UserInterestedInReadableStore
import com.twitter.simclusters_v2.summingbird.stores.UserInterestedInReadableStore.getStore
import com.twitter.simclusters_v2.summingbird.stores.UserInterestedInReadableStore.modelVersionToDatasetMap
import com.twitter.simclusters_v2.summingbird.stores.UserInterestedInReadableStore.knownModelVersions
import com.twitter.simclusters_v2.summingbird.stores.UserInterestedInReadableStore.toSimClustersEmbedding
import com.twitter.simclusters_v2.thriftscala.ClustersUserIsInterestedIn
import com.twitter.simclusters_v2.thriftscala.EmbeddingType
import com.twitter.simclusters_v2.thriftscala.EmbeddingType._
import com.twitter.simclusters_v2.thriftscala.InternalId
import com.twitter.simclusters_v2.thriftscala.ModelVersion
import com.twitter.simclusters_v2.thriftscala.ModelVersion._
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding}
import com.twitter.storage.client.manhattan.kv.ManhattanKVClientMtlsParams
import com.twitter.storehaus.ReadableStore
import com.twitter.storehaus_internal.manhattan.Apollo
import com.twitter.storehaus_internal.manhattan.ManhattanCluster
import com.twitter.strato.client.{Client => StratoClient}
import com.twitter.strato.thrift.ScroogeConvImplicits._
import com.twitter.tweetypie.util.UserId
import com.twitter.util.Future
import javax.inject.Inject
class UserSimClustersEmbeddingStore @Inject() (
stratoClient: StratoClient,
cacheClient: Client,
globalStats: StatsReceiver,
mhMtlsParams: ManhattanKVClientMtlsParams,
rmsDecider: RepresentationManagerDecider) {
private val stats = globalStats.scope(this.getClass.getSimpleName)
private val favBasedProducer20M145KUpdatedEmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore = ProducerClusterEmbeddingReadableStores
.getProducerTopKSimClustersEmbeddingsStore(
mhMtlsParams
).mapValues { topSimClustersWithScore =>
ThriftSimClustersEmbedding(topSimClustersWithScore.topClusters)
}.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(_, _, InternalId.UserId(userId)) =>
userId
}
buildMemCacheStore(rawStore, FavBasedProducer, Model20m145kUpdated)
}
private val favBasedProducer20M145K2020EmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore = ProducerClusterEmbeddingReadableStores
.getProducerTopKSimClusters2020EmbeddingsStore(
mhMtlsParams
).mapValues { topSimClustersWithScore =>
ThriftSimClustersEmbedding(topSimClustersWithScore.topClusters)
}.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(_, _, InternalId.UserId(userId)) =>
userId
}
buildMemCacheStore(rawStore, FavBasedProducer, Model20m145k2020)
}
private val followBasedProducer20M145K2020EmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore = ProducerClusterEmbeddingReadableStores
.getProducerTopKSimClustersEmbeddingsByFollowStore(
mhMtlsParams
).mapValues { topSimClustersWithScore =>
ThriftSimClustersEmbedding(topSimClustersWithScore.topClusters)
}.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(_, _, InternalId.UserId(userId)) =>
userId
}
buildMemCacheStore(rawStore, FollowBasedProducer, Model20m145k2020)
}
private val logFavBasedApe20M145K2020EmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore = StratoFetchableStore
.withUnitView[SimClustersEmbeddingId, ThriftSimClustersEmbedding](
stratoClient,
"recommendations/simclusters_v2/embeddings/logFavBasedAPE20M145K2020")
.mapValues(embedding => SimClustersEmbedding(embedding, truncate = 50).toThrift)
buildMemCacheStore(rawStore, AggregatableLogFavBasedProducer, Model20m145k2020)
}
private val rawRelaxedLogFavBasedApe20M145K2020EmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
ThriftSimClustersEmbedding
] = {
StratoFetchableStore
.withUnitView[SimClustersEmbeddingId, ThriftSimClustersEmbedding](
stratoClient,
"recommendations/simclusters_v2/embeddings/logFavBasedAPERelaxedFavEngagementThreshold20M145K2020")
.mapValues(embedding => SimClustersEmbedding(embedding, truncate = 50).toThrift)
}
private val relaxedLogFavBasedApe20M145K2020EmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildMemCacheStore(
rawRelaxedLogFavBasedApe20M145K2020EmbeddingStore,
RelaxedAggregatableLogFavBasedProducer,
Model20m145k2020)
}
private val relaxedLogFavBasedApe20m145kUpdatedEmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore = rawRelaxedLogFavBasedApe20M145K2020EmbeddingStore
.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(
RelaxedAggregatableLogFavBasedProducer,
Model20m145kUpdated,
internalId) =>
SimClustersEmbeddingId(
RelaxedAggregatableLogFavBasedProducer,
Model20m145k2020,
internalId)
}
buildMemCacheStore(rawStore, RelaxedAggregatableLogFavBasedProducer, Model20m145kUpdated)
}
private val logFavBasedInterestedInFromAPE20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildUserInterestedInStore(
UserInterestedInReadableStore.defaultIIAPESimClustersEmbeddingStoreWithMtls,
LogFavBasedUserInterestedInFromAPE,
Model20m145k2020)
}
private val followBasedInterestedInFromAPE20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildUserInterestedInStore(
UserInterestedInReadableStore.defaultIIAPESimClustersEmbeddingStoreWithMtls,
FollowBasedUserInterestedInFromAPE,
Model20m145k2020)
}
private val favBasedUserInterestedIn20M145KUpdatedStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildUserInterestedInStore(
UserInterestedInReadableStore.defaultSimClustersEmbeddingStoreWithMtls,
FavBasedUserInterestedIn,
Model20m145kUpdated)
}
private val favBasedUserInterestedIn20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildUserInterestedInStore(
UserInterestedInReadableStore.defaultSimClustersEmbeddingStoreWithMtls,
FavBasedUserInterestedIn,
Model20m145k2020)
}
private val followBasedUserInterestedIn20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildUserInterestedInStore(
UserInterestedInReadableStore.defaultSimClustersEmbeddingStoreWithMtls,
FollowBasedUserInterestedIn,
Model20m145k2020)
}
private val logFavBasedUserInterestedIn20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildUserInterestedInStore(
UserInterestedInReadableStore.defaultSimClustersEmbeddingStoreWithMtls,
LogFavBasedUserInterestedIn,
Model20m145k2020)
}
private val favBasedUserInterestedInFromPE20M145KUpdatedStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildUserInterestedInStore(
UserInterestedInReadableStore.defaultIIPESimClustersEmbeddingStoreWithMtls,
FavBasedUserInterestedInFromPE,
Model20m145kUpdated)
}
private val twistlyUserInterestedInStore: ReadableStore[
SimClustersEmbeddingId,
ThriftSimClustersEmbedding
] = {
val interestedIn20M145KUpdatedStore = {
UserInterestedInReadableStore.defaultStoreWithMtls(
mhMtlsParams,
modelVersion = ModelVersions.Model20M145KUpdated
)
}
val interestedIn20M145K2020Store = {
UserInterestedInReadableStore.defaultStoreWithMtls(
mhMtlsParams,
modelVersion = ModelVersions.Model20M145K2020
)
}
val interestedInFromPE20M145KUpdatedStore = {
UserInterestedInReadableStore.defaultIIPEStoreWithMtls(
mhMtlsParams,
modelVersion = ModelVersions.Model20M145KUpdated)
}
val simClustersInterestedInStore: ReadableStore[
(UserId, ModelVersion),
ClustersUserIsInterestedIn
] = {
new ReadableStore[(UserId, ModelVersion), ClustersUserIsInterestedIn] {
override def get(k: (UserId, ModelVersion)): Future[Option[ClustersUserIsInterestedIn]] = {
k match {
case (userId, Model20m145kUpdated) =>
interestedIn20M145KUpdatedStore.get(userId)
case (userId, Model20m145k2020) =>
interestedIn20M145K2020Store.get(userId)
case _ =>
Future.None
}
}
}
}
val simClustersInterestedInFromProducerEmbeddingsStore: ReadableStore[
(UserId, ModelVersion),
ClustersUserIsInterestedIn
] = {
new ReadableStore[(UserId, ModelVersion), ClustersUserIsInterestedIn] {
override def get(k: (UserId, ModelVersion)): Future[Option[ClustersUserIsInterestedIn]] = {
k match {
case (userId, ModelVersion.Model20m145kUpdated) =>
interestedInFromPE20M145KUpdatedStore.get(userId)
case _ =>
Future.None
}
}
}
}
new twistly.interestedin.EmbeddingStore(
interestedInStore = simClustersInterestedInStore,
interestedInFromProducerEmbeddingStore = simClustersInterestedInFromProducerEmbeddingsStore,
statsReceiver = stats
).mapValues(_.toThrift)
}
private val userNextInterestedIn20m145k2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildUserInterestedInStore(
UserInterestedInReadableStore.defaultNextInterestedInStoreWithMtls,
UserNextInterestedIn,
Model20m145k2020)
}
private val filteredUserInterestedIn20m145kUpdatedStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildMemCacheStore(twistlyUserInterestedInStore, FilteredUserInterestedIn, Model20m145kUpdated)
}
private val filteredUserInterestedIn20m145k2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildMemCacheStore(twistlyUserInterestedInStore, FilteredUserInterestedIn, Model20m145k2020)
}
private val filteredUserInterestedInFromPE20m145kUpdatedStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildMemCacheStore(
twistlyUserInterestedInStore,
FilteredUserInterestedInFromPE,
Model20m145kUpdated)
}
private val unfilteredUserInterestedIn20m145kUpdatedStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildMemCacheStore(
twistlyUserInterestedInStore,
UnfilteredUserInterestedIn,
Model20m145kUpdated)
}
private val unfilteredUserInterestedIn20m145k2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
buildMemCacheStore(twistlyUserInterestedInStore, UnfilteredUserInterestedIn, Model20m145k2020)
}
// [Experimental] User InterestedIn, generated by aggregating IIAPE embedding from AddressBook
private val logFavBasedInterestedMaxpoolingAddressBookFromIIAPE20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val datasetName = "addressbook_sims_embedding_iiape_maxpooling"
val appId = "wtf_embedding_apollo"
buildUserInterestedInStoreGeneric(
simClustersEmbeddingStoreWithMtls,
LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE,
Model20m145k2020,
datasetName = datasetName,
appId = appId,
manhattanCluster = Apollo
)
}
private val logFavBasedInterestedAverageAddressBookFromIIAPE20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val datasetName = "addressbook_sims_embedding_iiape_average"
val appId = "wtf_embedding_apollo"
buildUserInterestedInStoreGeneric(
simClustersEmbeddingStoreWithMtls,
LogFavBasedUserInterestedAverageAddressBookFromIIAPE,
Model20m145k2020,
datasetName = datasetName,
appId = appId,
manhattanCluster = Apollo
)
}
private val logFavBasedUserInterestedBooktypeMaxpoolingAddressBookFromIIAPE20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val datasetName = "addressbook_sims_embedding_iiape_booktype_maxpooling"
val appId = "wtf_embedding_apollo"
buildUserInterestedInStoreGeneric(
simClustersEmbeddingStoreWithMtls,
LogFavBasedUserInterestedBooktypeMaxpoolingAddressBookFromIIAPE,
Model20m145k2020,
datasetName = datasetName,
appId = appId,
manhattanCluster = Apollo
)
}
private val logFavBasedUserInterestedLargestDimMaxpoolingAddressBookFromIIAPE20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val datasetName = "addressbook_sims_embedding_iiape_largestdim_maxpooling"
val appId = "wtf_embedding_apollo"
buildUserInterestedInStoreGeneric(
simClustersEmbeddingStoreWithMtls,
LogFavBasedUserInterestedLargestDimMaxpoolingAddressBookFromIIAPE,
Model20m145k2020,
datasetName = datasetName,
appId = appId,
manhattanCluster = Apollo
)
}
private val logFavBasedUserInterestedLouvainMaxpoolingAddressBookFromIIAPE20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val datasetName = "addressbook_sims_embedding_iiape_louvain_maxpooling"
val appId = "wtf_embedding_apollo"
buildUserInterestedInStoreGeneric(
simClustersEmbeddingStoreWithMtls,
LogFavBasedUserInterestedLouvainMaxpoolingAddressBookFromIIAPE,
Model20m145k2020,
datasetName = datasetName,
appId = appId,
manhattanCluster = Apollo
)
}
private val logFavBasedUserInterestedConnectedMaxpoolingAddressBookFromIIAPE20M145K2020Store: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val datasetName = "addressbook_sims_embedding_iiape_connected_maxpooling"
val appId = "wtf_embedding_apollo"
buildUserInterestedInStoreGeneric(
simClustersEmbeddingStoreWithMtls,
LogFavBasedUserInterestedConnectedMaxpoolingAddressBookFromIIAPE,
Model20m145k2020,
datasetName = datasetName,
appId = appId,
manhattanCluster = Apollo
)
}
/**
* Helper func to build a readable store for some UserInterestedIn embeddings with
* 1. A storeFunc from UserInterestedInReadableStore
* 2. EmbeddingType
* 3. ModelVersion
* 4. MemCacheConfig
* */
private def buildUserInterestedInStore(
storeFunc: (ManhattanKVClientMtlsParams, EmbeddingType, ModelVersion) => ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
],
embeddingType: EmbeddingType,
modelVersion: ModelVersion
): ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore = storeFunc(mhMtlsParams, embeddingType, modelVersion)
.mapValues(_.toThrift)
val observedStore = ObservedReadableStore(
store = rawStore
)(stats.scope(embeddingType.name).scope(modelVersion.name))
MemCacheConfig.buildMemCacheStoreForSimClustersEmbedding(
observedStore,
cacheClient,
embeddingType,
modelVersion,
stats
)
}
private def buildUserInterestedInStoreGeneric(
storeFunc: (ManhattanKVClientMtlsParams, EmbeddingType, ModelVersion, String, String,
ManhattanCluster) => ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
],
embeddingType: EmbeddingType,
modelVersion: ModelVersion,
datasetName: String,
appId: String,
manhattanCluster: ManhattanCluster
): ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
val rawStore =
storeFunc(mhMtlsParams, embeddingType, modelVersion, datasetName, appId, manhattanCluster)
.mapValues(_.toThrift)
val observedStore = ObservedReadableStore(
store = rawStore
)(stats.scope(embeddingType.name).scope(modelVersion.name))
MemCacheConfig.buildMemCacheStoreForSimClustersEmbedding(
observedStore,
cacheClient,
embeddingType,
modelVersion,
stats
)
}
private def simClustersEmbeddingStoreWithMtls(
mhMtlsParams: ManhattanKVClientMtlsParams,
embeddingType: EmbeddingType,
modelVersion: ModelVersion,
datasetName: String,
appId: String,
manhattanCluster: ManhattanCluster
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
if (!modelVersionToDatasetMap.contains(ModelVersions.toKnownForModelVersion(modelVersion))) {
throw new IllegalArgumentException(
"Unknown model version: " + modelVersion + ". Known model versions: " + knownModelVersions)
}
getStore(appId, mhMtlsParams, datasetName, manhattanCluster)
.composeKeyMapping[SimClustersEmbeddingId] {
case SimClustersEmbeddingId(theEmbeddingType, theModelVersion, InternalId.UserId(userId))
if theEmbeddingType == embeddingType && theModelVersion == modelVersion =>
userId
}.mapValues(toSimClustersEmbedding(_, embeddingType))
}
private def buildMemCacheStore(
rawStore: ReadableStore[SimClustersEmbeddingId, ThriftSimClustersEmbedding],
embeddingType: EmbeddingType,
modelVersion: ModelVersion
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val observedStore = ObservedReadableStore(
store = rawStore
)(stats.scope(embeddingType.name).scope(modelVersion.name))
MemCacheConfig.buildMemCacheStoreForSimClustersEmbedding(
observedStore,
cacheClient,
embeddingType,
modelVersion,
stats
)
}
private val underlyingStores: Map[
(EmbeddingType, ModelVersion),
ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding]
] = Map(
// KnownFor Embeddings
(FavBasedProducer, Model20m145kUpdated) -> favBasedProducer20M145KUpdatedEmbeddingStore,
(FavBasedProducer, Model20m145k2020) -> favBasedProducer20M145K2020EmbeddingStore,
(FollowBasedProducer, Model20m145k2020) -> followBasedProducer20M145K2020EmbeddingStore,
(AggregatableLogFavBasedProducer, Model20m145k2020) -> logFavBasedApe20M145K2020EmbeddingStore,
(
RelaxedAggregatableLogFavBasedProducer,
Model20m145kUpdated) -> relaxedLogFavBasedApe20m145kUpdatedEmbeddingStore,
(
RelaxedAggregatableLogFavBasedProducer,
Model20m145k2020) -> relaxedLogFavBasedApe20M145K2020EmbeddingStore,
// InterestedIn Embeddings
(
LogFavBasedUserInterestedInFromAPE,
Model20m145k2020) -> logFavBasedInterestedInFromAPE20M145K2020Store,
(
FollowBasedUserInterestedInFromAPE,
Model20m145k2020) -> followBasedInterestedInFromAPE20M145K2020Store,
(FavBasedUserInterestedIn, Model20m145kUpdated) -> favBasedUserInterestedIn20M145KUpdatedStore,
(FavBasedUserInterestedIn, Model20m145k2020) -> favBasedUserInterestedIn20M145K2020Store,
(FollowBasedUserInterestedIn, Model20m145k2020) -> followBasedUserInterestedIn20M145K2020Store,
(LogFavBasedUserInterestedIn, Model20m145k2020) -> logFavBasedUserInterestedIn20M145K2020Store,
(
FavBasedUserInterestedInFromPE,
Model20m145kUpdated) -> favBasedUserInterestedInFromPE20M145KUpdatedStore,
(FilteredUserInterestedIn, Model20m145kUpdated) -> filteredUserInterestedIn20m145kUpdatedStore,
(FilteredUserInterestedIn, Model20m145k2020) -> filteredUserInterestedIn20m145k2020Store,
(
FilteredUserInterestedInFromPE,
Model20m145kUpdated) -> filteredUserInterestedInFromPE20m145kUpdatedStore,
(
UnfilteredUserInterestedIn,
Model20m145kUpdated) -> unfilteredUserInterestedIn20m145kUpdatedStore,
(UnfilteredUserInterestedIn, Model20m145k2020) -> unfilteredUserInterestedIn20m145k2020Store,
(UserNextInterestedIn, Model20m145k2020) -> userNextInterestedIn20m145k2020Store,
(
LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE,
Model20m145k2020) -> logFavBasedInterestedMaxpoolingAddressBookFromIIAPE20M145K2020Store,
(
LogFavBasedUserInterestedAverageAddressBookFromIIAPE,
Model20m145k2020) -> logFavBasedInterestedAverageAddressBookFromIIAPE20M145K2020Store,
(
LogFavBasedUserInterestedBooktypeMaxpoolingAddressBookFromIIAPE,
Model20m145k2020) -> logFavBasedUserInterestedBooktypeMaxpoolingAddressBookFromIIAPE20M145K2020Store,
(
LogFavBasedUserInterestedLargestDimMaxpoolingAddressBookFromIIAPE,
Model20m145k2020) -> logFavBasedUserInterestedLargestDimMaxpoolingAddressBookFromIIAPE20M145K2020Store,
(
LogFavBasedUserInterestedLouvainMaxpoolingAddressBookFromIIAPE,
Model20m145k2020) -> logFavBasedUserInterestedLouvainMaxpoolingAddressBookFromIIAPE20M145K2020Store,
(
LogFavBasedUserInterestedConnectedMaxpoolingAddressBookFromIIAPE,
Model20m145k2020) -> logFavBasedUserInterestedConnectedMaxpoolingAddressBookFromIIAPE20M145K2020Store,
)
val userSimClustersEmbeddingStore: ReadableStore[
SimClustersEmbeddingId,
SimClustersEmbedding
] = {
SimClustersEmbeddingStore.buildWithDecider(
underlyingStores = underlyingStores,
decider = rmsDecider.decider,
statsReceiver = stats
)
}
}

View File

@ -0,0 +1,18 @@
create_thrift_libraries(
base_name = "thrift",
sources = [
"com/twitter/representation_manager/service.thrift",
],
platform = "java8",
tags = [
"bazel-compatible",
],
dependency_roots = [
"src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift",
],
generate_languages = [
"java",
"scala",
"strato",
],
)

View File

@ -0,0 +1,14 @@
namespace java com.twitter.representation_manager.thriftjava
#@namespace scala com.twitter.representation_manager.thriftscala
#@namespace strato com.twitter.representation_manager
include "com/twitter/simclusters_v2/online_store.thrift"
include "com/twitter/simclusters_v2/identifier.thrift"
/**
* A uniform column view for all kinds of SimClusters based embeddings.
**/
struct SimClustersEmbeddingView {
1: required identifier.EmbeddingType embeddingType
2: required online_store.ModelVersion modelVersion
}(persisted = 'false', hasPersonalData = 'false')

View File

@ -0,0 +1 @@
# This prevents SQ query from grabbing //:all since it traverses up once to find a BUILD

View File

@ -0,0 +1,5 @@
# Representation Scorer #
**Representation Scorer** (RSX) serves as a centralized scoring system, offering SimClusters or other embedding-based scoring solutions as machine learning features.
The Representation Scorer acquires user behavior data from the User Signal Service (USS) and extracts embeddings from the Representation Manager (RMS). It then calculates both pairwise and listwise features. These features are used at various stages, including candidate retrieval and ranking.

View File

@ -0,0 +1,8 @@
#!/bin/bash
export CANARY_CHECK_ROLE="representation-scorer"
export CANARY_CHECK_NAME="representation-scorer"
export CANARY_CHECK_INSTANCES="0-19"
python3 relevance-platform/tools/canary_check.py "$@"

View File

@ -0,0 +1,4 @@
#!/usr/bin/env bash
JOB=representation-scorer bazel run --ui_event_filters=-info,-stdout,-stderr --noshow_progress \
//relevance-platform/src/main/python/deploy -- "$@"

View File

@ -0,0 +1,66 @@
#!/bin/bash
set -o nounset
set -eu
DC="atla"
ROLE="$USER"
SERVICE="representation-scorer"
INSTANCE="0"
KEY="$DC/$ROLE/devel/$SERVICE/$INSTANCE"
while test $# -gt 0; do
case "$1" in
-h|--help)
echo "$0 Set up an ssh tunnel for $SERVICE remote debugging and disable aurora health checks"
echo " "
echo "See representation-scorer/README.md for details of how to use this script, and go/remote-debug for"
echo "general information about remote debugging in Aurora"
echo " "
echo "Default instance if called with no args:"
echo " $KEY"
echo " "
echo "Positional args:"
echo " $0 [datacentre] [role] [service_name] [instance]"
echo " "
echo "Options:"
echo " -h, --help show brief help"
exit 0
;;
*)
break
;;
esac
done
if [ -n "${1-}" ]; then
DC="$1"
fi
if [ -n "${2-}" ]; then
ROLE="$2"
fi
if [ -n "${3-}" ]; then
SERVICE="$3"
fi
if [ -n "${4-}" ]; then
INSTANCE="$4"
fi
KEY="$DC/$ROLE/devel/$SERVICE/$INSTANCE"
read -p "Set up remote debugger tunnel for $KEY? (y/n) " -r CONFIRM
if [[ ! $CONFIRM =~ ^[Yy]$ ]]; then
echo "Exiting, tunnel not created"
exit 1
fi
echo "Disabling health check and opening tunnel. Exit with control-c when you're finished"
CMD="aurora task ssh $KEY -c 'touch .healthchecksnooze' && aurora task ssh $KEY -L '5005:debug' --ssh-options '-N -S none -v '"
echo "Running $CMD"
eval "$CMD"

View File

@ -0,0 +1,39 @@
Representation Scorer (RSX)
###########################
Overview
========
Representation Scorer (RSX) is a StratoFed service which serves scores for pairs of entities (User, Tweet, Topic...) based on some representation of those entities. For example, it serves User-Tweet scores based on the cosine similarity of SimClusters embeddings for each of these. It aims to provide these with low latency and at high scale, to support applications such as scoring for ANN candidate generation and feature hydration via feature store.
Current use cases
-----------------
RSX currently serves traffic for the following use cases:
- User-Tweet similarity scores for Home ranking, using SimClusters embedding dot product
- Topic-Tweet similarity scores for topical tweet candidate generation and topic social proof, using SimClusters embedding cosine similarity and CERTO scores
- Tweet-Tweet and User-Tweet similarity scores for ANN candidate generation, using SimClusters embedding cosine similarity
- (in development) User-Tweet similarity scores for Home ranking, based on various aggregations of similarities with recent faves, retweets and follows performed by the user
Getting Started
===============
Fetching scores
---------------
Scores are served from the recommendations/representation_scorer/score column.
Using RSX for your application
------------------------------
RSX may be a good fit for your application if you need scores based on combinations of SimCluster embeddings for core nouns. We also plan to support other embeddings and scoring approaches in the future.
.. toctree::
:maxdepth: 2
:hidden:
index

View File

@ -0,0 +1,22 @@
jvm_binary(
name = "bin",
basename = "representation-scorer",
main = "com.twitter.representationscorer.RepresentationScorerFedServerMain",
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"finatra/inject/inject-logback/src/main/scala",
"loglens/loglens-logback/src/main/scala/com/twitter/loglens/logback",
"representation-scorer/server/src/main/resources",
"representation-scorer/server/src/main/scala/com/twitter/representationscorer",
"twitter-server/logback-classic/src/main/scala",
],
)
# Aurora Workflows build phase convention requires a jvm_app named with ${project-name}-app
jvm_app(
name = "representation-scorer-app",
archive = "zip",
binary = ":bin",
tags = ["bazel-compatible"],
)

View File

@ -0,0 +1,9 @@
resources(
sources = [
"*.xml",
"*.yml",
"com/twitter/slo/slo.json",
"config/*.yml",
],
tags = ["bazel-compatible"],
)

View File

@ -0,0 +1,55 @@
{
"servers": [
{
"name": "strato",
"indicators": [
{
"id": "success_rate_3m",
"indicator_type": "SuccessRateIndicator",
"duration": 3,
"duration_unit": "MINUTES"
}, {
"id": "latency_3m_p99",
"indicator_type": "LatencyIndicator",
"duration": 3,
"duration_unit": "MINUTES",
"percentile": 0.99
}
],
"objectives": [
{
"indicator": "success_rate_3m",
"objective_type": "SuccessRateObjective",
"operator": ">=",
"threshold": 0.995
},
{
"indicator": "latency_3m_p99",
"objective_type": "LatencyObjective",
"operator": "<=",
"threshold": 50
}
],
"long_term_objectives": [
{
"id": "success_rate_28_days",
"objective_type": "SuccessRateObjective",
"operator": ">=",
"threshold": 0.993,
"duration": 28,
"duration_unit": "DAYS"
},
{
"id": "latency_p99_28_days",
"objective_type": "LatencyObjective",
"operator": "<=",
"threshold": 60,
"duration": 28,
"duration_unit": "DAYS",
"percentile": 0.99
}
]
}
],
"@version": 1
}

View File

@ -0,0 +1,155 @@
enableLogFavBasedApeEntity20M145KUpdatedEmbeddingCachedStore:
comment: "Enable to use the non-empty store for logFavBasedApeEntity20M145KUpdatedEmbeddingCachedStore (from 0% to 100%). 0 means use EMPTY readable store for all requests."
default_availability: 0
enableLogFavBasedApeEntity20M145K2020EmbeddingCachedStore:
comment: "Enable to use the non-empty store for logFavBasedApeEntity20M145K2020EmbeddingCachedStore (from 0% to 100%). 0 means use EMPTY readable store for all requests."
default_availability: 0
representation-scorer_forward_dark_traffic:
comment: "Defines the percentage of traffic to forward to diffy-proxy. Set to 0 to disable dark traffic forwarding"
default_availability: 0
"representation-scorer_load_shed_non_prod_callers":
comment: "Discard traffic from all non-prod callers"
default_availability: 0
enable_log_fav_based_tweet_embedding_20m145k2020_timeouts:
comment: "If enabled, set a timeout on calls to the logFavBased20M145K2020TweetEmbeddingStore"
default_availability: 0
log_fav_based_tweet_embedding_20m145k2020_timeout_value_millis:
comment: "The value of this decider defines the timeout (in milliseconds) to use on calls to the logFavBased20M145K2020TweetEmbeddingStore, i.e. 1.50% is 150ms. Only applied if enable_log_fav_based_tweet_embedding_20m145k2020_timeouts is true"
default_availability: 2000
enable_log_fav_based_tweet_embedding_20m145kUpdated_timeouts:
comment: "If enabled, set a timeout on calls to the logFavBased20M145KUpdatedTweetEmbeddingStore"
default_availability: 0
log_fav_based_tweet_embedding_20m145kUpdated_timeout_value_millis:
comment: "The value of this decider defines the timeout (in milliseconds) to use on calls to the logFavBased20M145KUpdatedTweetEmbeddingStore, i.e. 1.50% is 150ms. Only applied if enable_log_fav_based_tweet_embedding_20m145kUpdated_timeouts is true"
default_availability: 2000
enable_cluster_tweet_index_store_timeouts:
comment: "If enabled, set a timeout on calls to the ClusterTweetIndexStore"
default_availability: 0
cluster_tweet_index_store_timeout_value_millis:
comment: "The value of this decider defines the timeout (in milliseconds) to use on calls to the ClusterTweetIndexStore, i.e. 1.50% is 150ms. Only applied if enable_cluster_tweet_index_store_timeouts is true"
default_availability: 2000
representation_scorer_fetch_signal_share:
comment: "If enabled, fetches share signals from USS"
default_availability: 0
representation_scorer_fetch_signal_reply:
comment: "If enabled, fetches reply signals from USS"
default_availability: 0
representation_scorer_fetch_signal_original_tweet:
comment: "If enabled, fetches original tweet signals from USS"
default_availability: 0
representation_scorer_fetch_signal_video_playback:
comment: "If enabled, fetches video playback signals from USS"
default_availability: 0
representation_scorer_fetch_signal_block:
comment: "If enabled, fetches account block signals from USS"
default_availability: 0
representation_scorer_fetch_signal_mute:
comment: "If enabled, fetches account mute signals from USS"
default_availability: 0
representation_scorer_fetch_signal_report:
comment: "If enabled, fetches tweet report signals from USS"
default_availability: 0
representation_scorer_fetch_signal_dont_like:
comment: "If enabled, fetches tweet don't like signals from USS"
default_availability: 0
representation_scorer_fetch_signal_see_fewer:
comment: "If enabled, fetches tweet see fewer signals from USS"
default_availability: 0
# To create a new decider, add here with the same format and caller's details : "representation-scorer_load_shed_by_caller_id_twtr:{{role}}:{{name}}:{{environment}}:{{cluster}}"
# All the deciders below are generated by this script - ./strato/bin/fed deciders ./ --service-role=representation-scorer --service-name=representation-scorer
# If you need to run the script and paste the output, add only the prod deciders here. Non-prod ones are being taken care of by representation-scorer_load_shed_non_prod_callers
"representation-scorer_load_shed_by_caller_id_all":
comment: "Reject all traffic from caller id: all"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:frigate:frigate-pushservice-canary:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:frigate:frigate-pushservice-canary:prod:atla"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:frigate:frigate-pushservice-canary:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:frigate:frigate-pushservice-canary:prod:pdxa"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:frigate:frigate-pushservice-send:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:frigate:frigate-pushservice-send:prod:atla"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:frigate:frigate-pushservice:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:frigate:frigate-pushservice:prod:atla"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:frigate:frigate-pushservice:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:frigate:frigate-pushservice:prod:pdxa"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:frigate:frigate-pushservice:staging:atla":
comment: "Reject all traffic from caller id: twtr:svc:frigate:frigate-pushservice:staging:atla"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:frigate:frigate-pushservice:staging:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:frigate:frigate-pushservice:staging:pdxa"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:home-scorer:home-scorer:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:home-scorer:home-scorer:prod:atla"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:home-scorer:home-scorer:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:home-scorer:home-scorer:prod:pdxa"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:stratostore:stratoapi:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:stratostore:stratoapi:prod:atla"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:stratostore:stratoserver:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:stratostore:stratoserver:prod:atla"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:stratostore:stratoserver:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:stratostore:stratoserver:prod:pdxa"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:timelinescorer:timelinescorer:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:timelinescorer:timelinescorer:prod:atla"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:timelinescorer:timelinescorer:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:timelinescorer:timelinescorer:prod:pdxa"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:topic-social-proof:topic-social-proof:prod:atla":
comment: "Reject all traffic from caller id: twtr:svc:topic-social-proof:topic-social-proof:prod:atla"
default_availability: 0
"representation-scorer_load_shed_by_caller_id_twtr:svc:topic-social-proof:topic-social-proof:prod:pdxa":
comment: "Reject all traffic from caller id: twtr:svc:topic-social-proof:topic-social-proof:prod:pdxa"
default_availability: 0
"enable_sim_clusters_embedding_store_timeouts":
comment: "If enabled, set a timeout on calls to the SimClustersEmbeddingStore"
default_availability: 10000
sim_clusters_embedding_store_timeout_value_millis:
comment: "The value of this decider defines the timeout (in milliseconds) to use on calls to the SimClustersEmbeddingStore, i.e. 1.50% is 150ms. Only applied if enable_sim_clusters_embedding_store_timeouts is true"
default_availability: 2000

View File

@ -0,0 +1,165 @@
<configuration>
<shutdownHook class="ch.qos.logback.core.hook.DelayingShutdownHook"/>
<!-- ===================================================== -->
<!-- Service Config -->
<!-- ===================================================== -->
<property name="DEFAULT_SERVICE_PATTERN"
value="%-16X{traceId} %-12X{clientId:--} %-16X{method} %-25logger{0} %msg"/>
<property name="DEFAULT_ACCESS_PATTERN"
value="%msg"/>
<!-- ===================================================== -->
<!-- Common Config -->
<!-- ===================================================== -->
<!-- JUL/JDK14 to Logback bridge -->
<contextListener class="ch.qos.logback.classic.jul.LevelChangePropagator">
<resetJUL>true</resetJUL>
</contextListener>
<!-- ====================================================================================== -->
<!-- NOTE: The following appenders use a simple TimeBasedRollingPolicy configuration. -->
<!-- You may want to consider using a more advanced SizeAndTimeBasedRollingPolicy. -->
<!-- See: https://logback.qos.ch/manual/appenders.html#SizeAndTimeBasedRollingPolicy -->
<!-- ====================================================================================== -->
<!-- Service Log (rollover daily, keep maximum of 21 days of gzip compressed logs) -->
<appender name="SERVICE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${log.service.output}</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<!-- daily rollover -->
<fileNamePattern>${log.service.output}.%d.gz</fileNamePattern>
<!-- the maximum total size of all the log files -->
<totalSizeCap>3GB</totalSizeCap>
<!-- keep maximum 21 days' worth of history -->
<maxHistory>21</maxHistory>
<cleanHistoryOnStart>true</cleanHistoryOnStart>
</rollingPolicy>
<encoder>
<pattern>%date %.-3level ${DEFAULT_SERVICE_PATTERN}%n</pattern>
</encoder>
</appender>
<!-- Access Log (rollover daily, keep maximum of 21 days of gzip compressed logs) -->
<appender name="ACCESS" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${log.access.output}</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<!-- daily rollover -->
<fileNamePattern>${log.access.output}.%d.gz</fileNamePattern>
<!-- the maximum total size of all the log files -->
<totalSizeCap>100MB</totalSizeCap>
<!-- keep maximum 7 days' worth of history -->
<maxHistory>7</maxHistory>
<cleanHistoryOnStart>true</cleanHistoryOnStart>
</rollingPolicy>
<encoder>
<pattern>${DEFAULT_ACCESS_PATTERN}%n</pattern>
</encoder>
</appender>
<!--LogLens -->
<appender name="LOGLENS" class="com.twitter.loglens.logback.LoglensAppender">
<mdcAdditionalContext>true</mdcAdditionalContext>
<category>${log.lens.category}</category>
<index>${log.lens.index}</index>
<tag>${log.lens.tag}/service</tag>
<encoder>
<pattern>%msg</pattern>
</encoder>
</appender>
<!-- LogLens Access -->
<appender name="LOGLENS-ACCESS" class="com.twitter.loglens.logback.LoglensAppender">
<mdcAdditionalContext>true</mdcAdditionalContext>
<category>${log.lens.category}</category>
<index>${log.lens.index}</index>
<tag>${log.lens.tag}/access</tag>
<encoder>
<pattern>%msg</pattern>
</encoder>
</appender>
<!-- Pipeline Execution Logs -->
<appender name="ALLOW-LISTED-PIPELINE-EXECUTIONS" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>allow_listed_pipeline_executions.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<!-- daily rollover -->
<fileNamePattern>allow_listed_pipeline_executions.log.%d.gz</fileNamePattern>
<!-- the maximum total size of all the log files -->
<totalSizeCap>100MB</totalSizeCap>
<!-- keep maximum 7 days' worth of history -->
<maxHistory>7</maxHistory>
<cleanHistoryOnStart>true</cleanHistoryOnStart>
</rollingPolicy>
<encoder>
<pattern>%date %.-3level ${DEFAULT_SERVICE_PATTERN}%n</pattern>
</encoder>
</appender>
<!-- ===================================================== -->
<!-- Primary Async Appenders -->
<!-- ===================================================== -->
<property name="async_queue_size" value="${queue.size:-50000}"/>
<property name="async_max_flush_time" value="${max.flush.time:-0}"/>
<appender name="ASYNC-SERVICE" class="com.twitter.inject.logback.AsyncAppender">
<queueSize>${async_queue_size}</queueSize>
<maxFlushTime>${async_max_flush_time}</maxFlushTime>
<appender-ref ref="SERVICE"/>
</appender>
<appender name="ASYNC-ACCESS" class="com.twitter.inject.logback.AsyncAppender">
<queueSize>${async_queue_size}</queueSize>
<maxFlushTime>${async_max_flush_time}</maxFlushTime>
<appender-ref ref="ACCESS"/>
</appender>
<appender name="ASYNC-ALLOW-LISTED-PIPELINE-EXECUTIONS" class="com.twitter.inject.logback.AsyncAppender">
<queueSize>${async_queue_size}</queueSize>
<maxFlushTime>${async_max_flush_time}</maxFlushTime>
<appender-ref ref="ALLOW-LISTED-PIPELINE-EXECUTIONS"/>
</appender>
<appender name="ASYNC-LOGLENS" class="com.twitter.inject.logback.AsyncAppender">
<queueSize>${async_queue_size}</queueSize>
<maxFlushTime>${async_max_flush_time}</maxFlushTime>
<appender-ref ref="LOGLENS"/>
</appender>
<appender name="ASYNC-LOGLENS-ACCESS" class="com.twitter.inject.logback.AsyncAppender">
<queueSize>${async_queue_size}</queueSize>
<maxFlushTime>${async_max_flush_time}</maxFlushTime>
<appender-ref ref="LOGLENS-ACCESS"/>
</appender>
<!-- ===================================================== -->
<!-- Package Config -->
<!-- ===================================================== -->
<!-- Per-Package Config -->
<logger name="com.twitter" level="INHERITED"/>
<logger name="com.twitter.wilyns" level="INHERITED"/>
<logger name="com.twitter.configbus.client.file" level="INHERITED"/>
<logger name="com.twitter.finagle.mux" level="INHERITED"/>
<logger name="com.twitter.finagle.serverset2" level="INHERITED"/>
<logger name="com.twitter.logging.ScribeHandler" level="INHERITED"/>
<logger name="com.twitter.zookeeper.client.internal" level="INHERITED"/>
<!-- Root Config -->
<!-- For all logs except access logs, disable logging below log_level level by default. This can be overriden in the per-package loggers, and dynamically in the admin panel of individual instances. -->
<root level="${log_level:-INFO}">
<appender-ref ref="ASYNC-SERVICE"/>
<appender-ref ref="ASYNC-LOGLENS"/>
</root>
<!-- Access Logging -->
<!-- Access logs are turned off by default -->
<logger name="com.twitter.finatra.thrift.filters.AccessLoggingFilter" level="OFF" additivity="false">
<appender-ref ref="ASYNC-ACCESS"/>
<appender-ref ref="ASYNC-LOGLENS-ACCESS"/>
</logger>
</configuration>

View File

@ -0,0 +1,13 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"finagle-internal/slo/src/main/scala/com/twitter/finagle/slo",
"finatra/inject/inject-thrift-client",
"representation-scorer/server/src/main/scala/com/twitter/representationscorer/columns",
"strato/src/main/scala/com/twitter/strato/fed",
"strato/src/main/scala/com/twitter/strato/fed/server",
"twitter-server-internal/src/main/scala",
],
)

View File

@ -0,0 +1,38 @@
package com.twitter.representationscorer
import com.google.inject.Module
import com.twitter.inject.thrift.modules.ThriftClientIdModule
import com.twitter.representationscorer.columns.ListScoreColumn
import com.twitter.representationscorer.columns.ScoreColumn
import com.twitter.representationscorer.columns.SimClustersRecentEngagementSimilarityColumn
import com.twitter.representationscorer.columns.SimClustersRecentEngagementSimilarityUserTweetEdgeColumn
import com.twitter.representationscorer.modules.CacheModule
import com.twitter.representationscorer.modules.EmbeddingStoreModule
import com.twitter.representationscorer.modules.RMSConfigModule
import com.twitter.representationscorer.modules.TimerModule
import com.twitter.representationscorer.twistlyfeatures.UserSignalServiceRecentEngagementsClientModule
import com.twitter.strato.fed._
import com.twitter.strato.fed.server._
object RepresentationScorerFedServerMain extends RepresentationScorerFedServer
trait RepresentationScorerFedServer extends StratoFedServer {
override def dest: String = "/s/representation-scorer/representation-scorer"
override val modules: Seq[Module] =
Seq(
CacheModule,
ThriftClientIdModule,
UserSignalServiceRecentEngagementsClientModule,
TimerModule,
RMSConfigModule,
EmbeddingStoreModule
)
override def columns: Seq[Class[_ <: StratoFed.Column]] =
Seq(
classOf[ListScoreColumn],
classOf[ScoreColumn],
classOf[SimClustersRecentEngagementSimilarityUserTweetEdgeColumn],
classOf[SimClustersRecentEngagementSimilarityColumn]
)
}

View File

@ -0,0 +1,16 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"content-recommender/thrift/src/main/thrift:thrift-scala",
"finatra/inject/inject-core/src/main/scala",
"representation-scorer/server/src/main/scala/com/twitter/representationscorer/common",
"representation-scorer/server/src/main/scala/com/twitter/representationscorer/modules",
"representation-scorer/server/src/main/scala/com/twitter/representationscorer/scorestore",
"representation-scorer/server/src/main/scala/com/twitter/representationscorer/twistlyfeatures",
"representation-scorer/server/src/main/thrift:thrift-scala",
"strato/src/main/scala/com/twitter/strato/fed",
"strato/src/main/scala/com/twitter/strato/fed/server",
],
)

View File

@ -0,0 +1,13 @@
package com.twitter.representationscorer.columns
import com.twitter.strato.config.{ContactInfo => StratoContactInfo}
object Info {
val contactInfo: StratoContactInfo = StratoContactInfo(
description = "Please contact Relevance Platform team for more details",
contactEmail = "no-reply@twitter.com",
ldapGroup = "representation-scorer-admins",
jiraProject = "JIRA",
links = Seq("http://go.twitter.biz/rsx-runbook")
)
}

View File

@ -0,0 +1,116 @@
package com.twitter.representationscorer.columns
import com.twitter.representationscorer.thriftscala.ListScoreId
import com.twitter.representationscorer.thriftscala.ListScoreResponse
import com.twitter.representationscorer.scorestore.ScoreStore
import com.twitter.representationscorer.thriftscala.ScoreResult
import com.twitter.simclusters_v2.common.SimClustersEmbeddingId.LongInternalId
import com.twitter.simclusters_v2.common.SimClustersEmbeddingId.LongSimClustersEmbeddingId
import com.twitter.simclusters_v2.thriftscala.Score
import com.twitter.simclusters_v2.thriftscala.ScoreId
import com.twitter.simclusters_v2.thriftscala.ScoreInternalId
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingPairScoreId
import com.twitter.stitch
import com.twitter.stitch.Stitch
import com.twitter.strato.catalog.OpMetadata
import com.twitter.strato.config.ContactInfo
import com.twitter.strato.config.Policy
import com.twitter.strato.data.Conv
import com.twitter.strato.data.Description.PlainText
import com.twitter.strato.data.Lifecycle
import com.twitter.strato.fed._
import com.twitter.strato.thrift.ScroogeConv
import com.twitter.util.Future
import com.twitter.util.Return
import com.twitter.util.Throw
import javax.inject.Inject
class ListScoreColumn @Inject() (scoreStore: ScoreStore)
extends StratoFed.Column("recommendations/representation_scorer/listScore")
with StratoFed.Fetch.Stitch {
override val policy: Policy = Common.rsxReadPolicy
override type Key = ListScoreId
override type View = Unit
override type Value = ListScoreResponse
override val keyConv: Conv[Key] = ScroogeConv.fromStruct[ListScoreId]
override val viewConv: Conv[View] = Conv.ofType
override val valueConv: Conv[Value] = ScroogeConv.fromStruct[ListScoreResponse]
override val contactInfo: ContactInfo = Info.contactInfo
override val metadata: OpMetadata = OpMetadata(
lifecycle = Some(Lifecycle.Production),
description = Some(
PlainText(
"Scoring for multiple candidate entities against a single target entity"
))
)
override def fetch(key: Key, view: View): Stitch[Result[Value]] = {
val target = SimClustersEmbeddingId(
embeddingType = key.targetEmbeddingType,
modelVersion = key.modelVersion,
internalId = key.targetId
)
val scoreIds = key.candidateIds.map { candidateId =>
val candidate = SimClustersEmbeddingId(
embeddingType = key.candidateEmbeddingType,
modelVersion = key.modelVersion,
internalId = candidateId
)
ScoreId(
algorithm = key.algorithm,
internalId = ScoreInternalId.SimClustersEmbeddingPairScoreId(
SimClustersEmbeddingPairScoreId(target, candidate)
)
)
}
Stitch
.callFuture {
val (keys: Iterable[ScoreId], vals: Iterable[Future[Option[Score]]]) =
scoreStore.uniformScoringStore.multiGet(scoreIds.toSet).unzip
val results: Future[Iterable[Option[Score]]] = Future.collectToTry(vals.toSeq) map {
tryOptVals =>
tryOptVals map {
case Return(Some(v)) => Some(v)
case Return(None) => None
case Throw(_) => None
}
}
val scoreMap: Future[Map[Long, Double]] = results.map { scores =>
keys
.zip(scores).collect {
case (
ScoreId(
_,
ScoreInternalId.SimClustersEmbeddingPairScoreId(
SimClustersEmbeddingPairScoreId(
_,
LongSimClustersEmbeddingId(candidateId)))),
Some(score)) =>
(candidateId, score.score)
}.toMap
}
scoreMap
}
.map { (scores: Map[Long, Double]) =>
val orderedScores = key.candidateIds.collect {
case LongInternalId(id) => ScoreResult(scores.get(id))
case _ =>
// This will return None scores for candidates which don't have Long ids, but that's fine:
// at the moment we're only scoring for Tweets
ScoreResult(None)
}
found(ListScoreResponse(orderedScores))
}
.handle {
case stitch.NotFound => missing
}
}
}

View File

@ -0,0 +1,48 @@
package com.twitter.representationscorer.columns
import com.twitter.contentrecommender.thriftscala.ScoringResponse
import com.twitter.representationscorer.scorestore.ScoreStore
import com.twitter.simclusters_v2.thriftscala.ScoreId
import com.twitter.stitch
import com.twitter.stitch.Stitch
import com.twitter.strato.config.ContactInfo
import com.twitter.strato.config.Policy
import com.twitter.strato.catalog.OpMetadata
import com.twitter.strato.data.Conv
import com.twitter.strato.data.Lifecycle
import com.twitter.strato.data.Description.PlainText
import com.twitter.strato.fed._
import com.twitter.strato.thrift.ScroogeConv
import javax.inject.Inject
class ScoreColumn @Inject() (scoreStore: ScoreStore)
extends StratoFed.Column("recommendations/representation_scorer/score")
with StratoFed.Fetch.Stitch {
override val policy: Policy = Common.rsxReadPolicy
override type Key = ScoreId
override type View = Unit
override type Value = ScoringResponse
override val keyConv: Conv[Key] = ScroogeConv.fromStruct[ScoreId]
override val viewConv: Conv[View] = Conv.ofType
override val valueConv: Conv[Value] = ScroogeConv.fromStruct[ScoringResponse]
override val contactInfo: ContactInfo = Info.contactInfo
override val metadata: OpMetadata = OpMetadata(
lifecycle = Some(Lifecycle.Production),
description = Some(PlainText(
"The Uniform Scoring Endpoint in Representation Scorer for the Content-Recommender." +
" TDD: http://go/representation-scorer-tdd Guideline: http://go/uniform-scoring-guideline"))
)
override def fetch(key: Key, view: View): Stitch[Result[Value]] =
scoreStore
.uniformScoringStoreStitch(key)
.map(score => found(ScoringResponse(Some(score))))
.handle {
case stitch.NotFound => missing
}
}

View File

@ -0,0 +1,52 @@
package com.twitter.representationscorer.columns
import com.twitter.representationscorer.common.TweetId
import com.twitter.representationscorer.common.UserId
import com.twitter.representationscorer.thriftscala.RecentEngagementSimilaritiesResponse
import com.twitter.representationscorer.twistlyfeatures.Scorer
import com.twitter.stitch
import com.twitter.stitch.Stitch
import com.twitter.strato.catalog.OpMetadata
import com.twitter.strato.config.ContactInfo
import com.twitter.strato.config.Policy
import com.twitter.strato.data.Conv
import com.twitter.strato.data.Description.PlainText
import com.twitter.strato.data.Lifecycle
import com.twitter.strato.fed._
import com.twitter.strato.thrift.ScroogeConv
import javax.inject.Inject
class SimClustersRecentEngagementSimilarityColumn @Inject() (scorer: Scorer)
extends StratoFed.Column(
"recommendations/representation_scorer/simClustersRecentEngagementSimilarity")
with StratoFed.Fetch.Stitch {
override val policy: Policy = Common.rsxReadPolicy
override type Key = (UserId, Seq[TweetId])
override type View = Unit
override type Value = RecentEngagementSimilaritiesResponse
override val keyConv: Conv[Key] = Conv.ofType[(Long, Seq[Long])]
override val viewConv: Conv[View] = Conv.ofType
override val valueConv: Conv[Value] =
ScroogeConv.fromStruct[RecentEngagementSimilaritiesResponse]
override val contactInfo: ContactInfo = Info.contactInfo
override val metadata: OpMetadata = OpMetadata(
lifecycle = Some(Lifecycle.Production),
description = Some(
PlainText(
"User-Tweet scores based on the user's recent engagements for multiple tweets."
))
)
override def fetch(key: Key, view: View): Stitch[Result[Value]] =
scorer
.get(key._1, key._2)
.map(results => found(RecentEngagementSimilaritiesResponse(results)))
.handle {
case stitch.NotFound => missing
}
}

View File

@ -0,0 +1,52 @@
package com.twitter.representationscorer.columns
import com.twitter.representationscorer.common.TweetId
import com.twitter.representationscorer.common.UserId
import com.twitter.representationscorer.thriftscala.SimClustersRecentEngagementSimilarities
import com.twitter.representationscorer.twistlyfeatures.Scorer
import com.twitter.stitch
import com.twitter.stitch.Stitch
import com.twitter.strato.catalog.OpMetadata
import com.twitter.strato.config.ContactInfo
import com.twitter.strato.config.Policy
import com.twitter.strato.data.Conv
import com.twitter.strato.data.Description.PlainText
import com.twitter.strato.data.Lifecycle
import com.twitter.strato.fed._
import com.twitter.strato.thrift.ScroogeConv
import javax.inject.Inject
class SimClustersRecentEngagementSimilarityUserTweetEdgeColumn @Inject() (scorer: Scorer)
extends StratoFed.Column(
"recommendations/representation_scorer/simClustersRecentEngagementSimilarity.UserTweetEdge")
with StratoFed.Fetch.Stitch {
override val policy: Policy = Common.rsxReadPolicy
override type Key = (UserId, TweetId)
override type View = Unit
override type Value = SimClustersRecentEngagementSimilarities
override val keyConv: Conv[Key] = Conv.ofType[(Long, Long)]
override val viewConv: Conv[View] = Conv.ofType
override val valueConv: Conv[Value] =
ScroogeConv.fromStruct[SimClustersRecentEngagementSimilarities]
override val contactInfo: ContactInfo = Info.contactInfo
override val metadata: OpMetadata = OpMetadata(
lifecycle = Some(Lifecycle.Production),
description = Some(
PlainText(
"User-Tweet scores based on the user's recent engagements"
))
)
override def fetch(key: Key, view: View): Stitch[Result[Value]] =
scorer
.get(key._1, key._2)
.map(found(_))
.handle {
case stitch.NotFound => missing
}
}

View File

@ -0,0 +1,9 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"decider/src/main/scala",
"src/scala/com/twitter/simclusters_v2/common",
],
)

View File

@ -0,0 +1,7 @@
package com.twitter.representationscorer
object DeciderConstants {
val enableSimClustersEmbeddingStoreTimeouts = "enable_sim_clusters_embedding_store_timeouts"
val simClustersEmbeddingStoreTimeoutValueMillis =
"sim_clusters_embedding_store_timeout_value_millis"
}

View File

@ -0,0 +1,27 @@
package com.twitter.representationscorer.common
import com.twitter.decider.Decider
import com.twitter.decider.RandomRecipient
import com.twitter.decider.Recipient
import com.twitter.simclusters_v2.common.DeciderGateBuilderWithIdHashing
import javax.inject.Inject
import javax.inject.Singleton
@Singleton
case class RepresentationScorerDecider @Inject() (decider: Decider) {
val deciderGateBuilder = new DeciderGateBuilderWithIdHashing(decider)
def isAvailable(feature: String, recipient: Option[Recipient]): Boolean = {
decider.isAvailable(feature, recipient)
}
/**
* When useRandomRecipient is set to false, the decider is either completely on or off.
* When useRandomRecipient is set to true, the decider is on for the specified % of traffic.
*/
def isAvailable(feature: String, useRandomRecipient: Boolean = true): Boolean = {
if (useRandomRecipient) isAvailable(feature, Some(RandomRecipient))
else isAvailable(feature, None)
}
}

View File

@ -0,0 +1,6 @@
package com.twitter.representationscorer
package object common {
type UserId = Long
type TweetId = Long
}

View File

@ -0,0 +1,19 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"finagle-internal/mtls/src/main/scala/com/twitter/finagle/mtls/authentication",
"finagle/finagle-stats",
"finatra/inject/inject-core/src/main/scala",
"representation-manager/client/src/main/scala/com/twitter/representation_manager",
"representation-manager/client/src/main/scala/com/twitter/representation_manager/config",
"representation-manager/server/src/main/scala/com/twitter/representation_manager/migration",
"representation-scorer/server/src/main/scala/com/twitter/representationscorer/common",
"servo/util",
"src/scala/com/twitter/simclusters_v2/stores",
"src/scala/com/twitter/storehaus_internal/memcache",
"src/scala/com/twitter/storehaus_internal/util",
"src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala",
],
)

View File

@ -0,0 +1,34 @@
package com.twitter.representationscorer.modules
import com.google.inject.Provides
import com.twitter.finagle.memcached.Client
import javax.inject.Singleton
import com.twitter.conversions.DurationOps._
import com.twitter.inject.TwitterModule
import com.twitter.finagle.mtls.authentication.ServiceIdentifier
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.storehaus_internal.memcache.MemcacheStore
import com.twitter.storehaus_internal.util.ClientName
import com.twitter.storehaus_internal.util.ZkEndPoint
object CacheModule extends TwitterModule {
private val cacheDest = flag[String]("cache_module.dest", "Path to memcache service")
private val timeout = flag[Int]("memcache.timeout", "Memcache client timeout")
private val retries = flag[Int]("memcache.retries", "Memcache timeout retries")
@Singleton
@Provides
def providesCache(
serviceIdentifier: ServiceIdentifier,
stats: StatsReceiver
): Client =
MemcacheStore.memcachedClient(
name = ClientName("memcache_representation_manager"),
dest = ZkEndPoint(cacheDest()),
timeout = timeout().milliseconds,
retries = retries(),
statsReceiver = stats.scope("cache_client"),
serviceIdentifier = serviceIdentifier
)
}

View File

@ -0,0 +1,100 @@
package com.twitter.representationscorer.modules
import com.google.inject.Provides
import com.twitter.decider.Decider
import com.twitter.finagle.memcached.{Client => MemcachedClient}
import com.twitter.finagle.mtls.authentication.ServiceIdentifier
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.finagle.thrift.ClientId
import com.twitter.hermit.store.common.ObservedReadableStore
import com.twitter.inject.TwitterModule
import com.twitter.relevance_platform.common.readablestore.ReadableStoreWithTimeout
import com.twitter.representation_manager.migration.LegacyRMS
import com.twitter.representationscorer.DeciderConstants
import com.twitter.simclusters_v2.common.SimClustersEmbedding
import com.twitter.simclusters_v2.stores.SimClustersEmbeddingStore
import com.twitter.simclusters_v2.thriftscala.EmbeddingType
import com.twitter.simclusters_v2.thriftscala.EmbeddingType._
import com.twitter.simclusters_v2.thriftscala.ModelVersion
import com.twitter.simclusters_v2.thriftscala.ModelVersion._
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.storehaus.ReadableStore
import com.twitter.util.Timer
import javax.inject.Singleton
object EmbeddingStoreModule extends TwitterModule {
@Singleton
@Provides
def providesEmbeddingStore(
memCachedClient: MemcachedClient,
serviceIdentifier: ServiceIdentifier,
clientId: ClientId,
timer: Timer,
decider: Decider,
stats: StatsReceiver
): ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val cacheHashKeyPrefix: String = "RMS"
val embeddingStoreClient = new LegacyRMS(
serviceIdentifier,
memCachedClient,
stats,
decider,
clientId,
timer,
cacheHashKeyPrefix
)
val underlyingStores: Map[
(EmbeddingType, ModelVersion),
ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding]
] = Map(
// Tweet Embeddings
(
LogFavBasedTweet,
Model20m145k2020) -> embeddingStoreClient.logFavBased20M145K2020TweetEmbeddingStore,
(
LogFavLongestL2EmbeddingTweet,
Model20m145k2020) -> embeddingStoreClient.logFavBasedLongestL2Tweet20M145K2020EmbeddingStore,
// InterestedIn Embeddings
(
LogFavBasedUserInterestedInFromAPE,
Model20m145k2020) -> embeddingStoreClient.LogFavBasedInterestedInFromAPE20M145K2020Store,
(
FavBasedUserInterestedIn,
Model20m145k2020) -> embeddingStoreClient.favBasedUserInterestedIn20M145K2020Store,
// Author Embeddings
(
FavBasedProducer,
Model20m145k2020) -> embeddingStoreClient.favBasedProducer20M145K2020EmbeddingStore,
// Entity Embeddings
(
LogFavBasedKgoApeTopic,
Model20m145k2020) -> embeddingStoreClient.logFavBasedApeEntity20M145K2020EmbeddingCachedStore,
(FavTfgTopic, Model20m145k2020) -> embeddingStoreClient.favBasedTfgTopicEmbedding2020Store,
)
val simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] = {
val underlying: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] =
SimClustersEmbeddingStore.buildWithDecider(
underlyingStores = underlyingStores,
decider = decider,
statsReceiver = stats.scope("simClusters_embeddings_store_deciderable")
)
val underlyingWithTimeout: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding] =
new ReadableStoreWithTimeout(
rs = underlying,
decider = decider,
enableTimeoutDeciderKey = DeciderConstants.enableSimClustersEmbeddingStoreTimeouts,
timeoutValueKey = DeciderConstants.simClustersEmbeddingStoreTimeoutValueMillis,
timer = timer,
statsReceiver = stats.scope("simClusters_embedding_store_timeouts")
)
ObservedReadableStore(
store = underlyingWithTimeout
)(stats.scope("simClusters_embeddings_store"))
}
simClustersEmbeddingStore
}
}

View File

@ -0,0 +1,63 @@
package com.twitter.representationscorer.modules
import com.google.inject.Provides
import com.twitter.conversions.DurationOps._
import com.twitter.inject.TwitterModule
import com.twitter.representation_manager.config.ClientConfig
import com.twitter.representation_manager.config.EnabledInMemoryCacheParams
import com.twitter.representation_manager.config.InMemoryCacheParams
import com.twitter.simclusters_v2.thriftscala.EmbeddingType
import com.twitter.simclusters_v2.thriftscala.EmbeddingType._
import com.twitter.simclusters_v2.thriftscala.ModelVersion
import com.twitter.simclusters_v2.thriftscala.ModelVersion._
import javax.inject.Singleton
object RMSConfigModule extends TwitterModule {
def getCacheName(embedingType: EmbeddingType, modelVersion: ModelVersion): String =
s"${embedingType.name}_${modelVersion.name}_in_mem_cache"
@Singleton
@Provides
def providesRMSClientConfig: ClientConfig = {
val cacheParamsMap: Map[
(EmbeddingType, ModelVersion),
InMemoryCacheParams
] = Map(
// Tweet Embeddings
(LogFavBasedTweet, Model20m145k2020) -> EnabledInMemoryCacheParams(
ttl = 10.minutes,
maxKeys = 1048575, // 800MB
cacheName = getCacheName(LogFavBasedTweet, Model20m145k2020)),
(LogFavLongestL2EmbeddingTweet, Model20m145k2020) -> EnabledInMemoryCacheParams(
ttl = 5.minute,
maxKeys = 1048575, // 800MB
cacheName = getCacheName(LogFavLongestL2EmbeddingTweet, Model20m145k2020)),
// User - KnownFor Embeddings
(FavBasedProducer, Model20m145k2020) -> EnabledInMemoryCacheParams(
ttl = 1.day,
maxKeys = 500000, // 400MB
cacheName = getCacheName(FavBasedProducer, Model20m145k2020)),
// User - InterestedIn Embeddings
(LogFavBasedUserInterestedInFromAPE, Model20m145k2020) -> EnabledInMemoryCacheParams(
ttl = 6.hours,
maxKeys = 262143,
cacheName = getCacheName(LogFavBasedUserInterestedInFromAPE, Model20m145k2020)),
(FavBasedUserInterestedIn, Model20m145k2020) -> EnabledInMemoryCacheParams(
ttl = 6.hours,
maxKeys = 262143,
cacheName = getCacheName(FavBasedUserInterestedIn, Model20m145k2020)),
// Topic Embeddings
(FavTfgTopic, Model20m145k2020) -> EnabledInMemoryCacheParams(
ttl = 12.hours,
maxKeys = 262143, // 200MB
cacheName = getCacheName(FavTfgTopic, Model20m145k2020)),
(LogFavBasedKgoApeTopic, Model20m145k2020) -> EnabledInMemoryCacheParams(
ttl = 6.hours,
maxKeys = 262143,
cacheName = getCacheName(LogFavBasedKgoApeTopic, Model20m145k2020)),
)
new ClientConfig(inMemCacheParamsOverrides = cacheParamsMap)
}
}

View File

@ -0,0 +1,13 @@
package com.twitter.representationscorer.modules
import com.google.inject.Provides
import com.twitter.finagle.util.DefaultTimer
import com.twitter.inject.TwitterModule
import com.twitter.util.Timer
import javax.inject.Singleton
object TimerModule extends TwitterModule {
@Singleton
@Provides
def providesTimer: Timer = DefaultTimer
}

View File

@ -0,0 +1,19 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"frigate/frigate-common/src/main/scala/com/twitter/frigate/common/util",
"hermit/hermit-core/src/main/scala/com/twitter/hermit/store/common",
"relevance-platform/src/main/scala/com/twitter/relevance_platform/common/injection",
"representation-manager/client/src/main/scala/com/twitter/representation_manager",
"representation-manager/client/src/main/scala/com/twitter/representation_manager/config",
"representation-scorer/server/src/main/scala/com/twitter/representationscorer/common",
"src/scala/com/twitter/simclusters_v2/score",
"src/scala/com/twitter/topic_recos/common",
"src/scala/com/twitter/topic_recos/stores",
"src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift-scala",
"src/thrift/com/twitter/topic_recos:topic_recos-thrift-scala",
"stitch/stitch-storehaus",
],
)

View File

@ -0,0 +1,168 @@
package com.twitter.representationscorer.scorestore
import com.twitter.bijection.scrooge.BinaryScalaCodec
import com.twitter.conversions.DurationOps._
import com.twitter.finagle.memcached.Client
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.hashing.KeyHasher
import com.twitter.hermit.store.common.ObservedCachedReadableStore
import com.twitter.hermit.store.common.ObservedMemcachedReadableStore
import com.twitter.hermit.store.common.ObservedReadableStore
import com.twitter.relevance_platform.common.injection.LZ4Injection
import com.twitter.simclusters_v2.common.SimClustersEmbedding
import com.twitter.simclusters_v2.score.ScoreFacadeStore
import com.twitter.simclusters_v2.score.SimClustersEmbeddingPairScoreStore
import com.twitter.simclusters_v2.thriftscala.EmbeddingType.FavTfgTopic
import com.twitter.simclusters_v2.thriftscala.EmbeddingType.LogFavBasedKgoApeTopic
import com.twitter.simclusters_v2.thriftscala.EmbeddingType.LogFavBasedTweet
import com.twitter.simclusters_v2.thriftscala.ModelVersion.Model20m145kUpdated
import com.twitter.simclusters_v2.thriftscala.Score
import com.twitter.simclusters_v2.thriftscala.ScoreId
import com.twitter.simclusters_v2.thriftscala.ScoringAlgorithm
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.stitch.storehaus.StitchOfReadableStore
import com.twitter.storehaus.ReadableStore
import com.twitter.strato.client.{Client => StratoClient}
import com.twitter.topic_recos.stores.CertoTweetTopicScoresStore
import javax.inject.Inject
import javax.inject.Singleton
@Singleton()
class ScoreStore @Inject() (
simClustersEmbeddingStore: ReadableStore[SimClustersEmbeddingId, SimClustersEmbedding],
stratoClient: StratoClient,
representationScorerCacheClient: Client,
stats: StatsReceiver) {
private val keyHasher = KeyHasher.FNV1A_64
private val statsReceiver = stats.scope("score_store")
/** ** Score Store *****/
private val simClustersEmbeddingCosineSimilarityScoreStore =
ObservedReadableStore(
SimClustersEmbeddingPairScoreStore
.buildCosineSimilarityStore(simClustersEmbeddingStore)
.toThriftStore
)(statsReceiver.scope("simClusters_embedding_cosine_similarity_score_store"))
private val simClustersEmbeddingDotProductScoreStore =
ObservedReadableStore(
SimClustersEmbeddingPairScoreStore
.buildDotProductStore(simClustersEmbeddingStore)
.toThriftStore
)(statsReceiver.scope("simClusters_embedding_dot_product_score_store"))
private val simClustersEmbeddingJaccardSimilarityScoreStore =
ObservedReadableStore(
SimClustersEmbeddingPairScoreStore
.buildJaccardSimilarityStore(simClustersEmbeddingStore)
.toThriftStore
)(statsReceiver.scope("simClusters_embedding_jaccard_similarity_score_store"))
private val simClustersEmbeddingEuclideanDistanceScoreStore =
ObservedReadableStore(
SimClustersEmbeddingPairScoreStore
.buildEuclideanDistanceStore(simClustersEmbeddingStore)
.toThriftStore
)(statsReceiver.scope("simClusters_embedding_euclidean_distance_score_store"))
private val simClustersEmbeddingManhattanDistanceScoreStore =
ObservedReadableStore(
SimClustersEmbeddingPairScoreStore
.buildManhattanDistanceStore(simClustersEmbeddingStore)
.toThriftStore
)(statsReceiver.scope("simClusters_embedding_manhattan_distance_score_store"))
private val simClustersEmbeddingLogCosineSimilarityScoreStore =
ObservedReadableStore(
SimClustersEmbeddingPairScoreStore
.buildLogCosineSimilarityStore(simClustersEmbeddingStore)
.toThriftStore
)(statsReceiver.scope("simClusters_embedding_log_cosine_similarity_score_store"))
private val simClustersEmbeddingExpScaledCosineSimilarityScoreStore =
ObservedReadableStore(
SimClustersEmbeddingPairScoreStore
.buildExpScaledCosineSimilarityStore(simClustersEmbeddingStore)
.toThriftStore
)(statsReceiver.scope("simClusters_embedding_exp_scaled_cosine_similarity_score_store"))
// Use the default setting
private val topicTweetRankingScoreStore =
TopicTweetRankingScoreStore.buildTopicTweetRankingStore(
FavTfgTopic,
LogFavBasedKgoApeTopic,
LogFavBasedTweet,
Model20m145kUpdated,
consumerEmbeddingMultiplier = 1.0,
producerEmbeddingMultiplier = 1.0
)
private val topicTweetsCortexThresholdStore = TopicTweetsCosineSimilarityAggregateStore(
TopicTweetsCosineSimilarityAggregateStore.DefaultScoreKeys,
statsReceiver.scope("topic_tweets_cortex_threshold_store")
)
val topicTweetCertoScoreStore: ObservedCachedReadableStore[ScoreId, Score] = {
val underlyingStore = ObservedReadableStore(
TopicTweetCertoScoreStore(CertoTweetTopicScoresStore.prodStore(stratoClient))
)(statsReceiver.scope("topic_tweet_certo_score_store"))
val memcachedStore = ObservedMemcachedReadableStore
.fromCacheClient(
backingStore = underlyingStore,
cacheClient = representationScorerCacheClient,
ttl = 10.minutes
)(
valueInjection = LZ4Injection.compose(BinaryScalaCodec(Score)),
statsReceiver = statsReceiver.scope("topic_tweet_certo_store_memcache"),
keyToString = { k: ScoreId =>
s"certocs:${keyHasher.hashKey(k.toString.getBytes)}"
}
)
ObservedCachedReadableStore.from[ScoreId, Score](
memcachedStore,
ttl = 5.minutes,
maxKeys = 1000000,
cacheName = "topic_tweet_certo_store_cache",
windowSize = 10000L
)(statsReceiver.scope("topic_tweet_certo_store_cache"))
}
val uniformScoringStore: ReadableStore[ScoreId, Score] =
ScoreFacadeStore.buildWithMetrics(
readableStores = Map(
ScoringAlgorithm.PairEmbeddingCosineSimilarity ->
simClustersEmbeddingCosineSimilarityScoreStore,
ScoringAlgorithm.PairEmbeddingDotProduct ->
simClustersEmbeddingDotProductScoreStore,
ScoringAlgorithm.PairEmbeddingJaccardSimilarity ->
simClustersEmbeddingJaccardSimilarityScoreStore,
ScoringAlgorithm.PairEmbeddingEuclideanDistance ->
simClustersEmbeddingEuclideanDistanceScoreStore,
ScoringAlgorithm.PairEmbeddingManhattanDistance ->
simClustersEmbeddingManhattanDistanceScoreStore,
ScoringAlgorithm.PairEmbeddingLogCosineSimilarity ->
simClustersEmbeddingLogCosineSimilarityScoreStore,
ScoringAlgorithm.PairEmbeddingExpScaledCosineSimilarity ->
simClustersEmbeddingExpScaledCosineSimilarityScoreStore,
// Certo normalized cosine score between topic-tweet pairs
ScoringAlgorithm.CertoNormalizedCosineScore
-> topicTweetCertoScoreStore,
// Certo normalized dot-product score between topic-tweet pairs
ScoringAlgorithm.CertoNormalizedDotProductScore
-> topicTweetCertoScoreStore
),
aggregatedStores = Map(
ScoringAlgorithm.WeightedSumTopicTweetRanking ->
topicTweetRankingScoreStore,
ScoringAlgorithm.CortexTopicTweetLabel ->
topicTweetsCortexThresholdStore,
),
statsReceiver = stats
)
val uniformScoringStoreStitch: ScoreId => com.twitter.stitch.Stitch[Score] =
StitchOfReadableStore(uniformScoringStore)
}

View File

@ -0,0 +1,106 @@
package com.twitter.representationscorer.scorestore
import com.twitter.simclusters_v2.common.TweetId
import com.twitter.simclusters_v2.thriftscala.ScoreInternalId.GenericPairScoreId
import com.twitter.simclusters_v2.thriftscala.ScoringAlgorithm.CertoNormalizedDotProductScore
import com.twitter.simclusters_v2.thriftscala.ScoringAlgorithm.CertoNormalizedCosineScore
import com.twitter.simclusters_v2.thriftscala.InternalId
import com.twitter.simclusters_v2.thriftscala.TopicId
import com.twitter.simclusters_v2.thriftscala.{Score => ThriftScore}
import com.twitter.simclusters_v2.thriftscala.{ScoreId => ThriftScoreId}
import com.twitter.storehaus.FutureOps
import com.twitter.storehaus.ReadableStore
import com.twitter.topic_recos.thriftscala.Scores
import com.twitter.topic_recos.thriftscala.TopicToScores
import com.twitter.util.Future
/**
* Score store to get Certo <topic, tweet> scores.
* Currently, the store supports two Scoring Algorithms (i.e., two types of Certo scores):
* 1. NormalizedDotProduct
* 2. NormalizedCosine
* Querying with corresponding scoring algorithms results in different Certo scores.
*/
case class TopicTweetCertoScoreStore(certoStratoStore: ReadableStore[TweetId, TopicToScores])
extends ReadableStore[ThriftScoreId, ThriftScore] {
override def multiGet[K1 <: ThriftScoreId](ks: Set[K1]): Map[K1, Future[Option[ThriftScore]]] = {
val tweetIds =
ks.map(_.internalId).collect {
case GenericPairScoreId(scoreId) =>
((scoreId.id1, scoreId.id2): @annotation.nowarn(
"msg=may not be exhaustive|max recursion depth")) match {
case (InternalId.TweetId(tweetId), _) => tweetId
case (_, InternalId.TweetId(tweetId)) => tweetId
}
}
val result = for {
certoScores <- Future.collect(certoStratoStore.multiGet(tweetIds))
} yield {
ks.map { k =>
(k.algorithm, k.internalId) match {
case (CertoNormalizedDotProductScore, GenericPairScoreId(scoreId)) =>
(scoreId.id1, scoreId.id2) match {
case (InternalId.TweetId(tweetId), InternalId.TopicId(topicId)) =>
(
k,
extractScore(
tweetId,
topicId,
certoScores,
_.followerL2NormalizedDotProduct8HrHalfLife))
case (InternalId.TopicId(topicId), InternalId.TweetId(tweetId)) =>
(
k,
extractScore(
tweetId,
topicId,
certoScores,
_.followerL2NormalizedDotProduct8HrHalfLife))
case _ => (k, None)
}
case (CertoNormalizedCosineScore, GenericPairScoreId(scoreId)) =>
(scoreId.id1, scoreId.id2) match {
case (InternalId.TweetId(tweetId), InternalId.TopicId(topicId)) =>
(
k,
extractScore(
tweetId,
topicId,
certoScores,
_.followerL2NormalizedCosineSimilarity8HrHalfLife))
case (InternalId.TopicId(topicId), InternalId.TweetId(tweetId)) =>
(
k,
extractScore(
tweetId,
topicId,
certoScores,
_.followerL2NormalizedCosineSimilarity8HrHalfLife))
case _ => (k, None)
}
case _ => (k, None)
}
}.toMap
}
FutureOps.liftValues(ks, result)
}
/**
* Given tweetToCertoScores, extract certain Certo score between the given tweetId and topicId.
* The Certo score of interest is specified using scoreExtractor.
*/
def extractScore(
tweetId: TweetId,
topicId: TopicId,
tweetToCertoScores: Map[TweetId, Option[TopicToScores]],
scoreExtractor: Scores => Double
): Option[ThriftScore] = {
tweetToCertoScores.get(tweetId).flatMap {
case Some(topicToScores) =>
topicToScores.topicToScores.flatMap(_.get(topicId).map(scoreExtractor).map(ThriftScore(_)))
case _ => Some(ThriftScore(0.0))
}
}
}

View File

@ -0,0 +1,48 @@
package com.twitter.representationscorer.scorestore
import com.twitter.simclusters_v2.score.WeightedSumAggregatedScoreStore
import com.twitter.simclusters_v2.score.WeightedSumAggregatedScoreStore.WeightedSumAggregatedScoreParameter
import com.twitter.simclusters_v2.thriftscala.{EmbeddingType, ModelVersion, ScoringAlgorithm}
object TopicTweetRankingScoreStore {
val producerEmbeddingScoreMultiplier = 1.0
val consumerEmbeddingScoreMultiplier = 1.0
/**
* Build the scoring store for TopicTweet Ranking based on Default Multipliers.
* If you want to compare the ranking between different multipliers, register a new
* ScoringAlgorithm and let the upstream uses different scoringAlgorithm by params.
*/
def buildTopicTweetRankingStore(
consumerEmbeddingType: EmbeddingType,
producerEmbeddingType: EmbeddingType,
tweetEmbeddingType: EmbeddingType,
modelVersion: ModelVersion,
consumerEmbeddingMultiplier: Double = consumerEmbeddingScoreMultiplier,
producerEmbeddingMultiplier: Double = producerEmbeddingScoreMultiplier
): WeightedSumAggregatedScoreStore = {
WeightedSumAggregatedScoreStore(
List(
WeightedSumAggregatedScoreParameter(
ScoringAlgorithm.PairEmbeddingCosineSimilarity,
consumerEmbeddingMultiplier,
WeightedSumAggregatedScoreStore.genericPairScoreIdToSimClustersEmbeddingPairScoreId(
consumerEmbeddingType,
tweetEmbeddingType,
modelVersion
)
),
WeightedSumAggregatedScoreParameter(
ScoringAlgorithm.PairEmbeddingCosineSimilarity,
producerEmbeddingMultiplier,
WeightedSumAggregatedScoreStore.genericPairScoreIdToSimClustersEmbeddingPairScoreId(
producerEmbeddingType,
tweetEmbeddingType,
modelVersion
)
)
)
)
}
}

View File

@ -0,0 +1,148 @@
package com.twitter.representationscorer.scorestore
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.frigate.common.util.StatsUtil
import com.twitter.representationscorer.scorestore.TopicTweetsCosineSimilarityAggregateStore.ScoreKey
import com.twitter.simclusters_v2.common.TweetId
import com.twitter.simclusters_v2.score.AggregatedScoreStore
import com.twitter.simclusters_v2.thriftscala.ScoreInternalId.GenericPairScoreId
import com.twitter.simclusters_v2.thriftscala.ScoringAlgorithm.CortexTopicTweetLabel
import com.twitter.simclusters_v2.thriftscala.{
EmbeddingType,
InternalId,
ModelVersion,
ScoreInternalId,
ScoringAlgorithm,
SimClustersEmbeddingId,
TopicId,
Score => ThriftScore,
ScoreId => ThriftScoreId,
SimClustersEmbeddingPairScoreId => ThriftSimClustersEmbeddingPairScoreId
}
import com.twitter.storehaus.ReadableStore
import com.twitter.topic_recos.common.Configs.{DefaultModelVersion, MinCosineSimilarityScore}
import com.twitter.topic_recos.common._
import com.twitter.util.Future
/**
* Calculates the cosine similarity scores of arbitrary combinations of TopicEmbeddings and
* TweetEmbeddings.
* The class has 2 uses:
* 1. For internal uses. TSP will call this store to fetch the raw scores for (topic, tweet) with
* all available embedding types. We calculate all the scores here, so the caller can do filtering
* & score caching on their side. This will make it possible to DDG different embedding scores.
*
* 2. For external calls from Cortex. We return true (or 1.0) for any given (topic, tweet) if their
* cosine similarity passes the threshold for any of the embedding types.
* The expected input type is
* ScoreId(
* PairEmbeddingCosineSimilarity,
* GenericPairScoreId(TopicId, TweetId)
* )
*/
case class TopicTweetsCosineSimilarityAggregateStore(
scoreKeys: Seq[ScoreKey],
statsReceiver: StatsReceiver)
extends AggregatedScoreStore {
def toCortexScore(scoresMap: Map[ScoreKey, Double]): Double = {
val passThreshold = scoresMap.exists {
case (_, score) => score >= MinCosineSimilarityScore
}
if (passThreshold) 1.0 else 0.0
}
/**
* To be called by Cortex through Unified Score API ONLY. Calculates all possible (topic, tweet),
* return 1.0 if any of the embedding scores passes the minimum threshold.
*
* Expect a GenericPairScoreId(PairEmbeddingCosineSimilarity, (TopicId, TweetId)) as input
*/
override def get(k: ThriftScoreId): Future[Option[ThriftScore]] = {
StatsUtil.trackOptionStats(statsReceiver) {
(k.algorithm, k.internalId) match {
case (CortexTopicTweetLabel, GenericPairScoreId(genericPairScoreId)) =>
(genericPairScoreId.id1, genericPairScoreId.id2) match {
case (InternalId.TopicId(topicId), InternalId.TweetId(tweetId)) =>
TopicTweetsCosineSimilarityAggregateStore
.getRawScoresMap(topicId, tweetId, scoreKeys, scoreFacadeStore)
.map { scoresMap => Some(ThriftScore(toCortexScore(scoresMap))) }
case (InternalId.TweetId(tweetId), InternalId.TopicId(topicId)) =>
TopicTweetsCosineSimilarityAggregateStore
.getRawScoresMap(topicId, tweetId, scoreKeys, scoreFacadeStore)
.map { scoresMap => Some(ThriftScore(toCortexScore(scoresMap))) }
case _ =>
Future.None
// Do not accept other InternalId combinations
}
case _ =>
// Do not accept other Id types for now
Future.None
}
}
}
}
object TopicTweetsCosineSimilarityAggregateStore {
val TopicEmbeddingTypes: Seq[EmbeddingType] =
Seq(
EmbeddingType.FavTfgTopic,
EmbeddingType.LogFavBasedKgoApeTopic
)
// Add the new embedding types if want to test the new Tweet embedding performance.
val TweetEmbeddingTypes: Seq[EmbeddingType] = Seq(EmbeddingType.LogFavBasedTweet)
val ModelVersions: Seq[ModelVersion] =
Seq(DefaultModelVersion)
val DefaultScoreKeys: Seq[ScoreKey] = {
for {
modelVersion <- ModelVersions
topicEmbeddingType <- TopicEmbeddingTypes
tweetEmbeddingType <- TweetEmbeddingTypes
} yield {
ScoreKey(
topicEmbeddingType = topicEmbeddingType,
tweetEmbeddingType = tweetEmbeddingType,
modelVersion = modelVersion
)
}
}
case class ScoreKey(
topicEmbeddingType: EmbeddingType,
tweetEmbeddingType: EmbeddingType,
modelVersion: ModelVersion)
def getRawScoresMap(
topicId: TopicId,
tweetId: TweetId,
scoreKeys: Seq[ScoreKey],
uniformScoringStore: ReadableStore[ThriftScoreId, ThriftScore]
): Future[Map[ScoreKey, Double]] = {
val scoresMapFut = scoreKeys.map { key =>
val scoreInternalId = ScoreInternalId.SimClustersEmbeddingPairScoreId(
ThriftSimClustersEmbeddingPairScoreId(
buildTopicEmbedding(topicId, key.topicEmbeddingType, key.modelVersion),
SimClustersEmbeddingId(
key.tweetEmbeddingType,
key.modelVersion,
InternalId.TweetId(tweetId))
))
val scoreFut = uniformScoringStore
.get(
ThriftScoreId(
algorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity, // Hard code as cosine sim
internalId = scoreInternalId
))
key -> scoreFut
}.toMap
Future
.collect(scoresMapFut).map(_.collect {
case (key, Some(ThriftScore(score))) =>
(key, score)
})
}
}

View File

@ -0,0 +1,20 @@
scala_library(
compiler_option_sets = ["fatal_warnings"],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"3rdparty/jvm/com/github/ben-manes/caffeine",
"finatra/inject/inject-core/src/main/scala",
"representation-scorer/server/src/main/scala/com/twitter/representationscorer/common",
"representation-scorer/server/src/main/scala/com/twitter/representationscorer/scorestore",
"representation-scorer/server/src/main/thrift:thrift-scala",
"src/thrift/com/twitter/twistly:twistly-scala",
"stitch/stitch-core",
"stitch/stitch-core:cache",
"strato/config/columns/recommendations/twistly:twistly-strato-client",
"strato/config/columns/recommendations/user-signal-service:user-signal-service-strato-client",
"strato/src/main/scala/com/twitter/strato/client",
"user-signal-service/thrift/src/main/thrift:thrift-scala",
"util/util-core",
],
)

View File

@ -0,0 +1,65 @@
package com.twitter.representationscorer.twistlyfeatures
import com.twitter.conversions.DurationOps._
import com.twitter.util.Duration
import com.twitter.util.Time
case class Engagements(
favs7d: Seq[UserSignal] = Nil,
retweets7d: Seq[UserSignal] = Nil,
follows30d: Seq[UserSignal] = Nil,
shares7d: Seq[UserSignal] = Nil,
replies7d: Seq[UserSignal] = Nil,
originalTweets7d: Seq[UserSignal] = Nil,
videoPlaybacks7d: Seq[UserSignal] = Nil,
block30d: Seq[UserSignal] = Nil,
mute30d: Seq[UserSignal] = Nil,
report30d: Seq[UserSignal] = Nil,
dontlike30d: Seq[UserSignal] = Nil,
seeFewer30d: Seq[UserSignal] = Nil) {
import Engagements._
private val now = Time.now
private val oneDayAgo = (now - OneDaySpan).inMillis
private val sevenDaysAgo = (now - SevenDaysSpan).inMillis
// All ids from the signals grouped by type (tweetIds, userIds, etc)
val tweetIds: Seq[Long] =
(favs7d ++ retweets7d ++ shares7d
++ replies7d ++ originalTweets7d ++ videoPlaybacks7d
++ report30d ++ dontlike30d ++ seeFewer30d)
.map(_.targetId)
val authorIds: Seq[Long] = (follows30d ++ block30d ++ mute30d).map(_.targetId)
// Tweet signals
val dontlike7d: Seq[UserSignal] = dontlike30d.filter(_.timestamp > sevenDaysAgo)
val seeFewer7d: Seq[UserSignal] = seeFewer30d.filter(_.timestamp > sevenDaysAgo)
val favs1d: Seq[UserSignal] = favs7d.filter(_.timestamp > oneDayAgo)
val retweets1d: Seq[UserSignal] = retweets7d.filter(_.timestamp > oneDayAgo)
val shares1d: Seq[UserSignal] = shares7d.filter(_.timestamp > oneDayAgo)
val replies1d: Seq[UserSignal] = replies7d.filter(_.timestamp > oneDayAgo)
val originalTweets1d: Seq[UserSignal] = originalTweets7d.filter(_.timestamp > oneDayAgo)
val videoPlaybacks1d: Seq[UserSignal] = videoPlaybacks7d.filter(_.timestamp > oneDayAgo)
val dontlike1d: Seq[UserSignal] = dontlike7d.filter(_.timestamp > oneDayAgo)
val seeFewer1d: Seq[UserSignal] = seeFewer7d.filter(_.timestamp > oneDayAgo)
// User signals
val follows7d: Seq[UserSignal] = follows30d.filter(_.timestamp > sevenDaysAgo)
val block7d: Seq[UserSignal] = block30d.filter(_.timestamp > sevenDaysAgo)
val mute7d: Seq[UserSignal] = mute30d.filter(_.timestamp > sevenDaysAgo)
val report7d: Seq[UserSignal] = report30d.filter(_.timestamp > sevenDaysAgo)
val block1d: Seq[UserSignal] = block7d.filter(_.timestamp > oneDayAgo)
val mute1d: Seq[UserSignal] = mute7d.filter(_.timestamp > oneDayAgo)
val report1d: Seq[UserSignal] = report7d.filter(_.timestamp > oneDayAgo)
}
object Engagements {
val OneDaySpan: Duration = 1.days
val SevenDaysSpan: Duration = 7.days
val ThirtyDaysSpan: Duration = 30.days
}
case class UserSignal(targetId: Long, timestamp: Long)

View File

@ -0,0 +1,3 @@
package com.twitter.representationscorer.twistlyfeatures
case class ScoreResult(id: Long, score: Option[Double])

View File

@ -0,0 +1,474 @@
package com.twitter.representationscorer.twistlyfeatures
import com.twitter.finagle.stats.Counter
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.representationscorer.common.TweetId
import com.twitter.representationscorer.common.UserId
import com.twitter.representationscorer.scorestore.ScoreStore
import com.twitter.representationscorer.thriftscala.SimClustersRecentEngagementSimilarities
import com.twitter.simclusters_v2.thriftscala.EmbeddingType
import com.twitter.simclusters_v2.thriftscala.InternalId
import com.twitter.simclusters_v2.thriftscala.ModelVersion
import com.twitter.simclusters_v2.thriftscala.ScoreId
import com.twitter.simclusters_v2.thriftscala.ScoreInternalId
import com.twitter.simclusters_v2.thriftscala.ScoringAlgorithm
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingId
import com.twitter.simclusters_v2.thriftscala.SimClustersEmbeddingPairScoreId
import com.twitter.stitch.Stitch
import javax.inject.Inject
class Scorer @Inject() (
fetchEngagementsFromUSS: Long => Stitch[Engagements],
scoreStore: ScoreStore,
stats: StatsReceiver) {
import Scorer._
private val scoreStats = stats.scope("score")
private val scoreCalculationStats = scoreStats.scope("calculation")
private val scoreResultStats = scoreStats.scope("result")
private val scoresNonEmptyCounter = scoreResultStats.scope("all").counter("nonEmpty")
private val scoresNonZeroCounter = scoreResultStats.scope("all").counter("nonZero")
private val tweetScoreStats = scoreCalculationStats.scope("tweetScore").stat("latency")
private val userScoreStats = scoreCalculationStats.scope("userScore").stat("latency")
private val favNonZero = scoreResultStats.scope("favs").counter("nonZero")
private val favNonEmpty = scoreResultStats.scope("favs").counter("nonEmpty")
private val retweetsNonZero = scoreResultStats.scope("retweets").counter("nonZero")
private val retweetsNonEmpty = scoreResultStats.scope("retweets").counter("nonEmpty")
private val followsNonZero = scoreResultStats.scope("follows").counter("nonZero")
private val followsNonEmpty = scoreResultStats.scope("follows").counter("nonEmpty")
private val sharesNonZero = scoreResultStats.scope("shares").counter("nonZero")
private val sharesNonEmpty = scoreResultStats.scope("shares").counter("nonEmpty")
private val repliesNonZero = scoreResultStats.scope("replies").counter("nonZero")
private val repliesNonEmpty = scoreResultStats.scope("replies").counter("nonEmpty")
private val originalTweetsNonZero = scoreResultStats.scope("originalTweets").counter("nonZero")
private val originalTweetsNonEmpty = scoreResultStats.scope("originalTweets").counter("nonEmpty")
private val videoViewsNonZero = scoreResultStats.scope("videoViews").counter("nonZero")
private val videoViewsNonEmpty = scoreResultStats.scope("videoViews").counter("nonEmpty")
private val blockNonZero = scoreResultStats.scope("block").counter("nonZero")
private val blockNonEmpty = scoreResultStats.scope("block").counter("nonEmpty")
private val muteNonZero = scoreResultStats.scope("mute").counter("nonZero")
private val muteNonEmpty = scoreResultStats.scope("mute").counter("nonEmpty")
private val reportNonZero = scoreResultStats.scope("report").counter("nonZero")
private val reportNonEmpty = scoreResultStats.scope("report").counter("nonEmpty")
private val dontlikeNonZero = scoreResultStats.scope("dontlike").counter("nonZero")
private val dontlikeNonEmpty = scoreResultStats.scope("dontlike").counter("nonEmpty")
private val seeFewerNonZero = scoreResultStats.scope("seeFewer").counter("nonZero")
private val seeFewerNonEmpty = scoreResultStats.scope("seeFewer").counter("nonEmpty")
private def getTweetScores(
candidateTweetId: TweetId,
sourceTweetIds: Seq[TweetId]
): Stitch[Seq[ScoreResult]] = {
val getScoresStitch = Stitch.traverse(sourceTweetIds) { sourceTweetId =>
scoreStore
.uniformScoringStoreStitch(getTweetScoreId(sourceTweetId, candidateTweetId))
.liftNotFoundToOption
.map(score => ScoreResult(sourceTweetId, score.map(_.score)))
}
Stitch.time(getScoresStitch).flatMap {
case (tryResult, duration) =>
tweetScoreStats.add(duration.inMillis)
Stitch.const(tryResult)
}
}
private def getUserScores(
tweetId: TweetId,
authorIds: Seq[UserId]
): Stitch[Seq[ScoreResult]] = {
val getScoresStitch = Stitch.traverse(authorIds) { authorId =>
scoreStore
.uniformScoringStoreStitch(getAuthorScoreId(authorId, tweetId))
.liftNotFoundToOption
.map(score => ScoreResult(authorId, score.map(_.score)))
}
Stitch.time(getScoresStitch).flatMap {
case (tryResult, duration) =>
userScoreStats.add(duration.inMillis)
Stitch.const(tryResult)
}
}
/**
* Get the [[SimClustersRecentEngagementSimilarities]] result containing the similarity
* features for the given userId-TweetId.
*/
def get(
userId: UserId,
tweetId: TweetId
): Stitch[SimClustersRecentEngagementSimilarities] = {
get(userId, Seq(tweetId)).map(x => x.head)
}
/**
* Get a list of [[SimClustersRecentEngagementSimilarities]] results containing the similarity
* features for the given tweets of the user Id.
* Guaranteed to be the same number/order as requested.
*/
def get(
userId: UserId,
tweetIds: Seq[TweetId]
): Stitch[Seq[SimClustersRecentEngagementSimilarities]] = {
fetchEngagementsFromUSS(userId)
.flatMap(engagements => {
// For each tweet received in the request, compute the similarity scores between them
// and the user signals fetched from USS.
Stitch
.join(
Stitch.traverse(tweetIds)(id => getTweetScores(id, engagements.tweetIds)),
Stitch.traverse(tweetIds)(id => getUserScores(id, engagements.authorIds)),
)
.map {
case (tweetScoresSeq, userScoreSeq) =>
// All seq have = size because when scores don't exist, they are returned as Option
(tweetScoresSeq, userScoreSeq).zipped.map { (tweetScores, userScores) =>
computeSimilarityScoresPerTweet(
engagements,
tweetScores.groupBy(_.id),
userScores.groupBy(_.id))
}
}
})
}
/**
*
* Computes the [[SimClustersRecentEngagementSimilarities]]
* using the given tweet-tweet and user-tweet scores in TweetScoresMap
* and the user signals in [[Engagements]].
*/
private def computeSimilarityScoresPerTweet(
engagements: Engagements,
tweetScores: Map[TweetId, Seq[ScoreResult]],
authorScores: Map[UserId, Seq[ScoreResult]]
): SimClustersRecentEngagementSimilarities = {
val favs7d = engagements.favs7d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val favs1d = engagements.favs1d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val retweets7d = engagements.retweets7d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val retweets1d = engagements.retweets1d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val follows30d = engagements.follows30d.view
.flatMap(s => authorScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val follows7d = engagements.follows7d.view
.flatMap(s => authorScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val shares7d = engagements.shares7d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val shares1d = engagements.shares1d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val replies7d = engagements.replies7d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val replies1d = engagements.replies1d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val originalTweets7d = engagements.originalTweets7d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val originalTweets1d = engagements.originalTweets1d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val videoViews7d = engagements.videoPlaybacks7d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val videoViews1d = engagements.videoPlaybacks1d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val block30d = engagements.block30d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val block7d = engagements.block7d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val block1d = engagements.block1d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val mute30d = engagements.mute30d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val mute7d = engagements.mute7d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val mute1d = engagements.mute1d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val report30d = engagements.report30d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val report7d = engagements.report7d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val report1d = engagements.report1d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val dontlike30d = engagements.dontlike30d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val dontlike7d = engagements.dontlike7d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val dontlike1d = engagements.dontlike1d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val seeFewer30d = engagements.seeFewer30d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val seeFewer7d = engagements.seeFewer7d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val seeFewer1d = engagements.seeFewer1d.view
.flatMap(s => tweetScores.get(s.targetId))
.flatten.flatMap(_.score)
.force
val result = SimClustersRecentEngagementSimilarities(
fav1dLast10Max = max(favs1d),
fav1dLast10Avg = avg(favs1d),
fav7dLast10Max = max(favs7d),
fav7dLast10Avg = avg(favs7d),
retweet1dLast10Max = max(retweets1d),
retweet1dLast10Avg = avg(retweets1d),
retweet7dLast10Max = max(retweets7d),
retweet7dLast10Avg = avg(retweets7d),
follow7dLast10Max = max(follows7d),
follow7dLast10Avg = avg(follows7d),
follow30dLast10Max = max(follows30d),
follow30dLast10Avg = avg(follows30d),
share1dLast10Max = max(shares1d),
share1dLast10Avg = avg(shares1d),
share7dLast10Max = max(shares7d),
share7dLast10Avg = avg(shares7d),
reply1dLast10Max = max(replies1d),
reply1dLast10Avg = avg(replies1d),
reply7dLast10Max = max(replies7d),
reply7dLast10Avg = avg(replies7d),
originalTweet1dLast10Max = max(originalTweets1d),
originalTweet1dLast10Avg = avg(originalTweets1d),
originalTweet7dLast10Max = max(originalTweets7d),
originalTweet7dLast10Avg = avg(originalTweets7d),
videoPlayback1dLast10Max = max(videoViews1d),
videoPlayback1dLast10Avg = avg(videoViews1d),
videoPlayback7dLast10Max = max(videoViews7d),
videoPlayback7dLast10Avg = avg(videoViews7d),
block1dLast10Max = max(block1d),
block1dLast10Avg = avg(block1d),
block7dLast10Max = max(block7d),
block7dLast10Avg = avg(block7d),
block30dLast10Max = max(block30d),
block30dLast10Avg = avg(block30d),
mute1dLast10Max = max(mute1d),
mute1dLast10Avg = avg(mute1d),
mute7dLast10Max = max(mute7d),
mute7dLast10Avg = avg(mute7d),
mute30dLast10Max = max(mute30d),
mute30dLast10Avg = avg(mute30d),
report1dLast10Max = max(report1d),
report1dLast10Avg = avg(report1d),
report7dLast10Max = max(report7d),
report7dLast10Avg = avg(report7d),
report30dLast10Max = max(report30d),
report30dLast10Avg = avg(report30d),
dontlike1dLast10Max = max(dontlike1d),
dontlike1dLast10Avg = avg(dontlike1d),
dontlike7dLast10Max = max(dontlike7d),
dontlike7dLast10Avg = avg(dontlike7d),
dontlike30dLast10Max = max(dontlike30d),
dontlike30dLast10Avg = avg(dontlike30d),
seeFewer1dLast10Max = max(seeFewer1d),
seeFewer1dLast10Avg = avg(seeFewer1d),
seeFewer7dLast10Max = max(seeFewer7d),
seeFewer7dLast10Avg = avg(seeFewer7d),
seeFewer30dLast10Max = max(seeFewer30d),
seeFewer30dLast10Avg = avg(seeFewer30d),
)
trackStats(result)
result
}
private def trackStats(result: SimClustersRecentEngagementSimilarities): Unit = {
val scores = Seq(
result.fav7dLast10Max,
result.retweet7dLast10Max,
result.follow30dLast10Max,
result.share1dLast10Max,
result.share7dLast10Max,
result.reply7dLast10Max,
result.originalTweet7dLast10Max,
result.videoPlayback7dLast10Max,
result.block30dLast10Max,
result.mute30dLast10Max,
result.report30dLast10Max,
result.dontlike30dLast10Max,
result.seeFewer30dLast10Max
)
val nonEmpty = scores.exists(_.isDefined)
val nonZero = scores.exists { case Some(score) if score > 0 => true; case _ => false }
if (nonEmpty) {
scoresNonEmptyCounter.incr()
}
if (nonZero) {
scoresNonZeroCounter.incr()
}
// We use the largest window of a given type of score,
// because the largest window is inclusive of smaller windows.
trackSignalStats(favNonEmpty, favNonZero, result.fav7dLast10Avg)
trackSignalStats(retweetsNonEmpty, retweetsNonZero, result.retweet7dLast10Avg)
trackSignalStats(followsNonEmpty, followsNonZero, result.follow30dLast10Avg)
trackSignalStats(sharesNonEmpty, sharesNonZero, result.share7dLast10Avg)
trackSignalStats(repliesNonEmpty, repliesNonZero, result.reply7dLast10Avg)
trackSignalStats(originalTweetsNonEmpty, originalTweetsNonZero, result.originalTweet7dLast10Avg)
trackSignalStats(videoViewsNonEmpty, videoViewsNonZero, result.videoPlayback7dLast10Avg)
trackSignalStats(blockNonEmpty, blockNonZero, result.block30dLast10Avg)
trackSignalStats(muteNonEmpty, muteNonZero, result.mute30dLast10Avg)
trackSignalStats(reportNonEmpty, reportNonZero, result.report30dLast10Avg)
trackSignalStats(dontlikeNonEmpty, dontlikeNonZero, result.dontlike30dLast10Avg)
trackSignalStats(seeFewerNonEmpty, seeFewerNonZero, result.seeFewer30dLast10Avg)
}
private def trackSignalStats(nonEmpty: Counter, nonZero: Counter, score: Option[Double]): Unit = {
if (score.nonEmpty) {
nonEmpty.incr()
if (score.get > 0)
nonZero.incr()
}
}
}
object Scorer {
def avg(s: Traversable[Double]): Option[Double] =
if (s.isEmpty) None else Some(s.sum / s.size)
def max(s: Traversable[Double]): Option[Double] =
if (s.isEmpty) None else Some(s.foldLeft(0.0D) { (curr, _max) => math.max(curr, _max) })
private def getAuthorScoreId(
userId: UserId,
tweetId: TweetId
) = {
ScoreId(
algorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity,
internalId = ScoreInternalId.SimClustersEmbeddingPairScoreId(
SimClustersEmbeddingPairScoreId(
SimClustersEmbeddingId(
internalId = InternalId.UserId(userId),
modelVersion = ModelVersion.Model20m145k2020,
embeddingType = EmbeddingType.FavBasedProducer
),
SimClustersEmbeddingId(
internalId = InternalId.TweetId(tweetId),
modelVersion = ModelVersion.Model20m145k2020,
embeddingType = EmbeddingType.LogFavBasedTweet
)
))
)
}
private def getTweetScoreId(
sourceTweetId: TweetId,
candidateTweetId: TweetId
) = {
ScoreId(
algorithm = ScoringAlgorithm.PairEmbeddingCosineSimilarity,
internalId = ScoreInternalId.SimClustersEmbeddingPairScoreId(
SimClustersEmbeddingPairScoreId(
SimClustersEmbeddingId(
internalId = InternalId.TweetId(sourceTweetId),
modelVersion = ModelVersion.Model20m145k2020,
embeddingType = EmbeddingType.LogFavLongestL2EmbeddingTweet
),
SimClustersEmbeddingId(
internalId = InternalId.TweetId(candidateTweetId),
modelVersion = ModelVersion.Model20m145k2020,
embeddingType = EmbeddingType.LogFavBasedTweet
)
))
)
}
}

View File

@ -0,0 +1,155 @@
package com.twitter.representationscorer.twistlyfeatures
import com.twitter.decider.SimpleRecipient
import com.twitter.finagle.stats.Stat
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.representationscorer.common._
import com.twitter.representationscorer.twistlyfeatures.Engagements._
import com.twitter.simclusters_v2.common.SimClustersEmbeddingId.LongInternalId
import com.twitter.stitch.Stitch
import com.twitter.strato.generated.client.recommendations.user_signal_service.SignalsClientColumn
import com.twitter.strato.generated.client.recommendations.user_signal_service.SignalsClientColumn.Value
import com.twitter.usersignalservice.thriftscala.BatchSignalRequest
import com.twitter.usersignalservice.thriftscala.SignalRequest
import com.twitter.usersignalservice.thriftscala.SignalType
import com.twitter.util.Time
import scala.collection.mutable.ArrayBuffer
import com.twitter.usersignalservice.thriftscala.ClientIdentifier
class UserSignalServiceRecentEngagementsClient(
stratoClient: SignalsClientColumn,
decider: RepresentationScorerDecider,
stats: StatsReceiver) {
import UserSignalServiceRecentEngagementsClient._
private val signalStats = stats.scope("user-signal-service", "signal")
private val signalTypeStats: Map[SignalType, Stat] =
SignalType.list.map(s => (s, signalStats.scope(s.name).stat("size"))).toMap
def get(userId: UserId): Stitch[Engagements] = {
val request = buildRequest(userId)
stratoClient.fetcher.fetch(request).map(_.v).lowerFromOption().map { response =>
val now = Time.now
val sevenDaysAgo = now - SevenDaysSpan
val thirtyDaysAgo = now - ThirtyDaysSpan
Engagements(
favs7d = getUserSignals(response, SignalType.TweetFavorite, sevenDaysAgo),
retweets7d = getUserSignals(response, SignalType.Retweet, sevenDaysAgo),
follows30d = getUserSignals(response, SignalType.AccountFollowWithDelay, thirtyDaysAgo),
shares7d = getUserSignals(response, SignalType.TweetShareV1, sevenDaysAgo),
replies7d = getUserSignals(response, SignalType.Reply, sevenDaysAgo),
originalTweets7d = getUserSignals(response, SignalType.OriginalTweet, sevenDaysAgo),
videoPlaybacks7d =
getUserSignals(response, SignalType.VideoView90dPlayback50V1, sevenDaysAgo),
block30d = getUserSignals(response, SignalType.AccountBlock, thirtyDaysAgo),
mute30d = getUserSignals(response, SignalType.AccountMute, thirtyDaysAgo),
report30d = getUserSignals(response, SignalType.TweetReport, thirtyDaysAgo),
dontlike30d = getUserSignals(response, SignalType.TweetDontLike, thirtyDaysAgo),
seeFewer30d = getUserSignals(response, SignalType.TweetSeeFewer, thirtyDaysAgo),
)
}
}
private def getUserSignals(
response: Value,
signalType: SignalType,
earliestValidTimestamp: Time
): Seq[UserSignal] = {
val signals = response.signalResponse
.getOrElse(signalType, Seq.empty)
.view
.filter(_.timestamp > earliestValidTimestamp.inMillis)
.map(s => s.targetInternalId.collect { case LongInternalId(id) => (id, s.timestamp) })
.collect { case Some((id, engagedAt)) => UserSignal(id, engagedAt) }
.take(EngagementsToScore)
.force
signalTypeStats(signalType).add(signals.size)
signals
}
private def buildRequest(userId: Long) = {
val recipient = Some(SimpleRecipient(userId))
// Signals RSX always fetches
val requestSignals = ArrayBuffer(
SignalRequestFav,
SignalRequestRetweet,
SignalRequestFollow
)
// Signals under experimentation. We use individual deciders to disable them if necessary.
// If experiments are successful, they will become permanent.
if (decider.isAvailable(FetchSignalShareDeciderKey, recipient))
requestSignals.append(SignalRequestShare)
if (decider.isAvailable(FetchSignalReplyDeciderKey, recipient))
requestSignals.append(SignalRequestReply)
if (decider.isAvailable(FetchSignalOriginalTweetDeciderKey, recipient))
requestSignals.append(SignalRequestOriginalTweet)
if (decider.isAvailable(FetchSignalVideoPlaybackDeciderKey, recipient))
requestSignals.append(SignalRequestVideoPlayback)
if (decider.isAvailable(FetchSignalBlockDeciderKey, recipient))
requestSignals.append(SignalRequestBlock)
if (decider.isAvailable(FetchSignalMuteDeciderKey, recipient))
requestSignals.append(SignalRequestMute)
if (decider.isAvailable(FetchSignalReportDeciderKey, recipient))
requestSignals.append(SignalRequestReport)
if (decider.isAvailable(FetchSignalDontlikeDeciderKey, recipient))
requestSignals.append(SignalRequestDontlike)
if (decider.isAvailable(FetchSignalSeeFewerDeciderKey, recipient))
requestSignals.append(SignalRequestSeeFewer)
BatchSignalRequest(userId, requestSignals, Some(ClientIdentifier.RepresentationScorerHome))
}
}
object UserSignalServiceRecentEngagementsClient {
val FetchSignalShareDeciderKey = "representation_scorer_fetch_signal_share"
val FetchSignalReplyDeciderKey = "representation_scorer_fetch_signal_reply"
val FetchSignalOriginalTweetDeciderKey = "representation_scorer_fetch_signal_original_tweet"
val FetchSignalVideoPlaybackDeciderKey = "representation_scorer_fetch_signal_video_playback"
val FetchSignalBlockDeciderKey = "representation_scorer_fetch_signal_block"
val FetchSignalMuteDeciderKey = "representation_scorer_fetch_signal_mute"
val FetchSignalReportDeciderKey = "representation_scorer_fetch_signal_report"
val FetchSignalDontlikeDeciderKey = "representation_scorer_fetch_signal_dont_like"
val FetchSignalSeeFewerDeciderKey = "representation_scorer_fetch_signal_see_fewer"
val EngagementsToScore = 10
private val engagementsToScoreOpt: Option[Long] = Some(EngagementsToScore)
val SignalRequestFav: SignalRequest =
SignalRequest(engagementsToScoreOpt, SignalType.TweetFavorite)
val SignalRequestRetweet: SignalRequest = SignalRequest(engagementsToScoreOpt, SignalType.Retweet)
val SignalRequestFollow: SignalRequest =
SignalRequest(engagementsToScoreOpt, SignalType.AccountFollowWithDelay)
// New experimental signals
val SignalRequestShare: SignalRequest =
SignalRequest(engagementsToScoreOpt, SignalType.TweetShareV1)
val SignalRequestReply: SignalRequest = SignalRequest(engagementsToScoreOpt, SignalType.Reply)
val SignalRequestOriginalTweet: SignalRequest =
SignalRequest(engagementsToScoreOpt, SignalType.OriginalTweet)
val SignalRequestVideoPlayback: SignalRequest =
SignalRequest(engagementsToScoreOpt, SignalType.VideoView90dPlayback50V1)
// Negative signals
val SignalRequestBlock: SignalRequest =
SignalRequest(engagementsToScoreOpt, SignalType.AccountBlock)
val SignalRequestMute: SignalRequest =
SignalRequest(engagementsToScoreOpt, SignalType.AccountMute)
val SignalRequestReport: SignalRequest =
SignalRequest(engagementsToScoreOpt, SignalType.TweetReport)
val SignalRequestDontlike: SignalRequest =
SignalRequest(engagementsToScoreOpt, SignalType.TweetDontLike)
val SignalRequestSeeFewer: SignalRequest =
SignalRequest(engagementsToScoreOpt, SignalType.TweetSeeFewer)
}

View File

@ -0,0 +1,57 @@
package com.twitter.representationscorer.twistlyfeatures
import com.github.benmanes.caffeine.cache.Caffeine
import com.twitter.stitch.cache.EvictingCache
import com.google.inject.Provides
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.inject.TwitterModule
import com.twitter.representationscorer.common.RepresentationScorerDecider
import com.twitter.stitch.Stitch
import com.twitter.stitch.cache.ConcurrentMapCache
import com.twitter.stitch.cache.MemoizeQuery
import com.twitter.strato.client.Client
import com.twitter.strato.generated.client.recommendations.user_signal_service.SignalsClientColumn
import java.util.concurrent.ConcurrentMap
import java.util.concurrent.TimeUnit
import javax.inject.Singleton
object UserSignalServiceRecentEngagementsClientModule extends TwitterModule {
@Singleton
@Provides
def provide(
client: Client,
decider: RepresentationScorerDecider,
statsReceiver: StatsReceiver
): Long => Stitch[Engagements] = {
val stratoClient = new SignalsClientColumn(client)
/*
This cache holds a users recent engagements for a short period of time, such that batched requests
for multiple (userid, tweetid) pairs don't all need to fetch them.
[1] Caffeine cache keys/values must be objects, so we cannot use the `Long` primitive directly.
The boxed java.lang.Long works as a key, since it is an object. In most situations the compiler
can see where auto(un)boxing can occur. However, here we seem to need some wrapper functions
with explicit types to allow the boxing to happen.
*/
val mapCache: ConcurrentMap[java.lang.Long, Stitch[Engagements]] =
Caffeine
.newBuilder()
.expireAfterWrite(5, TimeUnit.SECONDS)
.maximumSize(
1000 // We estimate 5M unique users in a 5m period - with 2k RSX instances, assume that one will see < 1k in a 5s period
)
.build[java.lang.Long, Stitch[Engagements]]
.asMap
statsReceiver.provideGauge("ussRecentEngagementsClient", "cache_size") { mapCache.size.toFloat }
val engagementsClient =
new UserSignalServiceRecentEngagementsClient(stratoClient, decider, statsReceiver)
val f = (l: java.lang.Long) => engagementsClient.get(l) // See note [1] above
val cachedCall = MemoizeQuery(f, EvictingCache.lazily(new ConcurrentMapCache(mapCache)))
(l: Long) => cachedCall(l) // see note [1] above
}
}

View File

@ -0,0 +1,20 @@
create_thrift_libraries(
base_name = "thrift",
sources = [
"com/twitter/representationscorer/service.thrift",
],
platform = "java8",
tags = [
"bazel-compatible",
],
dependency_roots = [
"src/thrift/com/twitter/simclusters_v2:simclusters_v2-thrift",
],
generate_languages = [
"java",
"scala",
"strato",
],
provides_java_name = "representationscorer-service-thrift-java",
provides_scala_name = "representationscorer-service-thrift-scala",
)

View File

@ -0,0 +1,106 @@
namespace java com.twitter.representationscorer.thriftjava
#@namespace scala com.twitter.representationscorer.thriftscala
#@namespace strato com.twitter.representationscorer
include "com/twitter/simclusters_v2/identifier.thrift"
include "com/twitter/simclusters_v2/online_store.thrift"
include "com/twitter/simclusters_v2/score.thrift"
struct SimClustersRecentEngagementSimilarities {
// All scores computed using cosine similarity
// 1 - 1000 Positive Signals
1: optional double fav1dLast10Max // max score from last 10 faves in the last 1 day
2: optional double fav1dLast10Avg // avg score from last 10 faves in the last 1 day
3: optional double fav7dLast10Max // max score from last 10 faves in the last 7 days
4: optional double fav7dLast10Avg // avg score from last 10 faves in the last 7 days
5: optional double retweet1dLast10Max // max score from last 10 retweets in the last 1 days
6: optional double retweet1dLast10Avg // avg score from last 10 retweets in the last 1 days
7: optional double retweet7dLast10Max // max score from last 10 retweets in the last 7 days
8: optional double retweet7dLast10Avg // avg score from last 10 retweets in the last 7 days
9: optional double follow7dLast10Max // max score from the last 10 follows in the last 7 days
10: optional double follow7dLast10Avg // avg score from the last 10 follows in the last 7 days
11: optional double follow30dLast10Max // max score from the last 10 follows in the last 30 days
12: optional double follow30dLast10Avg // avg score from the last 10 follows in the last 30 days
13: optional double share1dLast10Max // max score from last 10 shares in the last 1 day
14: optional double share1dLast10Avg // avg score from last 10 shares in the last 1 day
15: optional double share7dLast10Max // max score from last 10 shares in the last 7 days
16: optional double share7dLast10Avg // avg score from last 10 shares in the last 7 days
17: optional double reply1dLast10Max // max score from last 10 replies in the last 1 day
18: optional double reply1dLast10Avg // avg score from last 10 replies in the last 1 day
19: optional double reply7dLast10Max // max score from last 10 replies in the last 7 days
20: optional double reply7dLast10Avg // avg score from last 10 replies in the last 7 days
21: optional double originalTweet1dLast10Max // max score from last 10 original tweets in the last 1 day
22: optional double originalTweet1dLast10Avg // avg score from last 10 original tweets in the last 1 day
23: optional double originalTweet7dLast10Max // max score from last 10 original tweets in the last 7 days
24: optional double originalTweet7dLast10Avg // avg score from last 10 original tweets in the last 7 days
25: optional double videoPlayback1dLast10Max // max score from last 10 video playback50 in the last 1 day
26: optional double videoPlayback1dLast10Avg // avg score from last 10 video playback50 in the last 1 day
27: optional double videoPlayback7dLast10Max // max score from last 10 video playback50 in the last 7 days
28: optional double videoPlayback7dLast10Avg // avg score from last 10 video playback50 in the last 7 days
// 1001 - 2000 Implicit Signals
// 2001 - 3000 Negative Signals
// Block Series
2001: optional double block1dLast10Avg
2002: optional double block1dLast10Max
2003: optional double block7dLast10Avg
2004: optional double block7dLast10Max
2005: optional double block30dLast10Avg
2006: optional double block30dLast10Max
// Mute Series
2101: optional double mute1dLast10Avg
2102: optional double mute1dLast10Max
2103: optional double mute7dLast10Avg
2104: optional double mute7dLast10Max
2105: optional double mute30dLast10Avg
2106: optional double mute30dLast10Max
// Report Series
2201: optional double report1dLast10Avg
2202: optional double report1dLast10Max
2203: optional double report7dLast10Avg
2204: optional double report7dLast10Max
2205: optional double report30dLast10Avg
2206: optional double report30dLast10Max
// Dontlike
2301: optional double dontlike1dLast10Avg
2302: optional double dontlike1dLast10Max
2303: optional double dontlike7dLast10Avg
2304: optional double dontlike7dLast10Max
2305: optional double dontlike30dLast10Avg
2306: optional double dontlike30dLast10Max
// SeeFewer
2401: optional double seeFewer1dLast10Avg
2402: optional double seeFewer1dLast10Max
2403: optional double seeFewer7dLast10Avg
2404: optional double seeFewer7dLast10Max
2405: optional double seeFewer30dLast10Avg
2406: optional double seeFewer30dLast10Max
}(persisted='true', hasPersonalData = 'true')
/*
* List score API
*/
struct ListScoreId {
1: required score.ScoringAlgorithm algorithm
2: required online_store.ModelVersion modelVersion
3: required identifier.EmbeddingType targetEmbeddingType
4: required identifier.InternalId targetId
5: required identifier.EmbeddingType candidateEmbeddingType
6: required list<identifier.InternalId> candidateIds
}(hasPersonalData = 'true')
struct ScoreResult {
// This api does not communicate why a score is missing. For example, it may be unavailable
// because the referenced entities do not exist (e.g. the embedding was not found) or because
// timeouts prevented us from calculating it.
1: optional double score
}
struct ListScoreResponse {
1: required list<ScoreResult> scores // Guaranteed to be the same number/order as requested
}
struct RecentEngagementSimilaritiesResponse {
1: required list<SimClustersRecentEngagementSimilarities> results // Guaranteed to be the same number/order as requested
}

View File

@ -0,0 +1,68 @@
package com.twitter.timelines.prediction.common.aggregates
import com.twitter.ml.api.Feature
import com.twitter.ml.api.FeatureContext
import com.twitter.ml.api.ITransform
import com.twitter.ml.api.constant.SharedFeatures
import java.lang.{Double => JDouble}
import com.twitter.timelines.prediction.common.adapters.AdapterConsumer
import com.twitter.timelines.prediction.common.adapters.EngagementLabelFeaturesDataRecordUtils
import com.twitter.ml.api.DataRecord
import com.twitter.ml.api.RichDataRecord
import com.twitter.timelines.suggests.common.engagement.thriftscala.EngagementType
import com.twitter.timelines.suggests.common.engagement.thriftscala.Engagement
import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures
import com.twitter.timelines.prediction.features.common.CombinedFeatures
/**
* To transfrom BCE events UUA data records that contain only continuous dwell time to datarecords that contain corresponding binary label features
* The UUA datarecords inputted would have USER_ID, SOURCE_TWEET_ID,TIMESTAMP and
* 0 or one of (TWEET_DETAIL_DWELL_TIME_MS, PROFILE_DWELL_TIME_MS, FULLSCREEN_VIDEO_DWELL_TIME_MS) features.
* We will use the different engagement TIME_MS to differentiate different engagements,
* and then re-use the function in EngagementTypeConverte to add the binary label to the datarecord.
**/
object BCELabelTransformFromUUADataRecord extends ITransform {
val dwellTimeFeatureToEngagementMap = Map(
TimelinesSharedFeatures.TWEET_DETAIL_DWELL_TIME_MS -> EngagementType.TweetDetailDwell,
TimelinesSharedFeatures.PROFILE_DWELL_TIME_MS -> EngagementType.ProfileDwell,
TimelinesSharedFeatures.FULLSCREEN_VIDEO_DWELL_TIME_MS -> EngagementType.FullscreenVideoDwell
)
def dwellFeatureToEngagement(
rdr: RichDataRecord,
dwellTimeFeature: Feature[JDouble],
engagementType: EngagementType
): Option[Engagement] = {
if (rdr.hasFeature(dwellTimeFeature)) {
Some(
Engagement(
engagementType = engagementType,
timestampMs = rdr.getFeatureValue(SharedFeatures.TIMESTAMP),
weight = Some(rdr.getFeatureValue(dwellTimeFeature))
))
} else {
None
}
}
override def transformContext(featureContext: FeatureContext): FeatureContext = {
featureContext.addFeatures(
(CombinedFeatures.TweetDetailDwellEngagements ++ CombinedFeatures.ProfileDwellEngagements ++ CombinedFeatures.FullscreenVideoDwellEngagements).toSeq: _*)
}
override def transform(record: DataRecord): Unit = {
val rdr = new RichDataRecord(record)
val engagements = dwellTimeFeatureToEngagementMap
.map {
case (dwellTimeFeature, engagementType) =>
dwellFeatureToEngagement(rdr, dwellTimeFeature, engagementType)
}.flatten.toSeq
// Re-use BCE( behavior client events) label conversion in EngagementTypeConverter to align with BCE labels generation for offline training data
EngagementLabelFeaturesDataRecordUtils.setDwellTimeFeatures(
rdr,
Some(engagements),
AdapterConsumer.Combined)
}
}

View File

@ -0,0 +1,353 @@
create_datasets(
base_name = "original_author_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/original_author_aggregates/1556496000000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.OriginalAuthor",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "twitter_wide_user_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/twitter_wide_user_aggregates/1556496000000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.TwitterWideUser",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "twitter_wide_user_author_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/twitter_wide_user_author_aggregates/1556323200000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.TwitterWideUserAuthor",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "user_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_aggregates/1556150400000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.User",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "user_author_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_author_aggregates/1556064000000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserAuthor",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "aggregates_canary",
fallback_path = "gs://user.timelines.dp.gcp.twttr.net//canaries/processed/aggregates_v2/user_aggregates/1622851200000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.User",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "user_engager_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_engager_aggregates/1556496000000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserEngager",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "user_original_author_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_original_author_aggregates/1556496000000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserOriginalAuthor",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "author_topic_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/author_topic_aggregates/1589932800000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.AuthorTopic",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "user_topic_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_topic_aggregates/1590278400000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserTopic",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "user_inferred_topic_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_inferred_topic_aggregates/1599696000000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserInferredTopic",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "user_mention_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_mention_aggregates/1556582400000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserMention",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "user_request_dow_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_request_dow_aggregates/1556236800000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserRequestDow",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "user_request_hour_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_request_hour_aggregates/1556150400000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserRequestHour",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "user_list_aggregates",
fallback_path = "viewfs://hadoop-proc2-nn.atla.twitter.com/user/timelines/processed/aggregates_v2/user_list_aggregates/1590624000000",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserList",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
create_datasets(
base_name = "user_media_understanding_annotation_aggregates",
key_type = "com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey",
platform = "java8",
role = "timelines",
scala_schema = "com.twitter.timelines.prediction.common.aggregates.TimelinesAggregationKeyValInjections.UserMediaUnderstandingAnnotation",
segment_type = "snapshot",
tags = ["bazel-compatible"],
val_type = "(com.twitter.summingbird.batch.BatchID, com.twitter.ml.api.DataRecord)",
scala_dependencies = [
":injections",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
],
)
scala_library(
sources = [
"BCELabelTransformFromUUADataRecord.scala",
"FeatureSelectorConfig.scala",
"RecapUserFeatureAggregation.scala",
"RectweetUserFeatureAggregation.scala",
"TimelinesAggregationConfig.scala",
"TimelinesAggregationConfigDetails.scala",
"TimelinesAggregationConfigTrait.scala",
"TimelinesAggregationSources.scala",
],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
":aggregates_canary-scala",
":author_topic_aggregates-scala",
":original_author_aggregates-scala",
":twitter_wide_user_aggregates-scala",
":twitter_wide_user_author_aggregates-scala",
":user_aggregates-scala",
":user_author_aggregates-scala",
":user_engager_aggregates-scala",
":user_inferred_topic_aggregates-scala",
":user_list_aggregates-scala",
":user_media_understanding_annotation_aggregates-scala",
":user_mention_aggregates-scala",
":user_original_author_aggregates-scala",
":user_request_dow_aggregates-scala",
":user_request_hour_aggregates-scala",
":user_topic_aggregates-scala",
"src/java/com/twitter/ml/api:api-base",
"src/java/com/twitter/ml/api/constant",
"src/java/com/twitter/ml/api/matcher",
"src/scala/com/twitter/common/text/util",
"src/scala/com/twitter/dal/client/dataset",
"src/scala/com/twitter/frigate/data_pipeline/features_aggregated/core",
"src/scala/com/twitter/scalding_internal/multiformat/format",
"src/scala/com/twitter/timelines/prediction/common/adapters:engagement-converter",
"src/scala/com/twitter/timelines/prediction/features/client_log_event",
"src/scala/com/twitter/timelines/prediction/features/common",
"src/scala/com/twitter/timelines/prediction/features/engagement_features",
"src/scala/com/twitter/timelines/prediction/features/escherbird",
"src/scala/com/twitter/timelines/prediction/features/itl",
"src/scala/com/twitter/timelines/prediction/features/list_features",
"src/scala/com/twitter/timelines/prediction/features/p_home_latest",
"src/scala/com/twitter/timelines/prediction/features/real_graph",
"src/scala/com/twitter/timelines/prediction/features/recap",
"src/scala/com/twitter/timelines/prediction/features/request_context",
"src/scala/com/twitter/timelines/prediction/features/simcluster",
"src/scala/com/twitter/timelines/prediction/features/time_features",
"src/scala/com/twitter/timelines/prediction/transform/filter",
"src/thrift/com/twitter/timelines/suggests/common:engagement-scala",
"timelines/data_processing/ad_hoc/recap/data_record_preparation:recap_data_records_agg_minimal-java",
"util/util-core:scala",
],
)
scala_library(
name = "injections",
sources = [
"FeatureSelectorConfig.scala",
"RecapUserFeatureAggregation.scala",
"RectweetUserFeatureAggregation.scala",
"TimelinesAggregationConfigDetails.scala",
"TimelinesAggregationConfigTrait.scala",
"TimelinesAggregationKeyValInjections.scala",
"TimelinesAggregationSources.scala",
],
platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
"src/java/com/twitter/ml/api:api-base",
"src/java/com/twitter/ml/api/constant",
"src/java/com/twitter/ml/api/matcher",
"src/scala/com/twitter/common/text/util",
"src/scala/com/twitter/dal/client/dataset",
"src/scala/com/twitter/frigate/data_pipeline/features_aggregated/core",
"src/scala/com/twitter/scalding_internal/multiformat/format",
"src/scala/com/twitter/timelines/prediction/features/client_log_event",
"src/scala/com/twitter/timelines/prediction/features/common",
"src/scala/com/twitter/timelines/prediction/features/engagement_features",
"src/scala/com/twitter/timelines/prediction/features/escherbird",
"src/scala/com/twitter/timelines/prediction/features/itl",
"src/scala/com/twitter/timelines/prediction/features/list_features",
"src/scala/com/twitter/timelines/prediction/features/p_home_latest",
"src/scala/com/twitter/timelines/prediction/features/real_graph",
"src/scala/com/twitter/timelines/prediction/features/recap",
"src/scala/com/twitter/timelines/prediction/features/request_context",
"src/scala/com/twitter/timelines/prediction/features/semantic_core_features",
"src/scala/com/twitter/timelines/prediction/features/simcluster",
"src/scala/com/twitter/timelines/prediction/features/time_features",
"src/scala/com/twitter/timelines/prediction/transform/filter",
"timelines/data_processing/ad_hoc/recap/data_record_preparation:recap_data_records_agg_minimal-java",
"util/util-core:scala",
],
)

View File

@ -0,0 +1,121 @@
package com.twitter.timelines.prediction.common.aggregates
import com.twitter.ml.api.matcher.FeatureMatcher
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.TypedAggregateGroup
import scala.collection.JavaConverters._
object FeatureSelectorConfig {
val BasePairsToStore = Seq(
("twitter_wide_user_aggregate.pair", "*"),
("twitter_wide_user_author_aggregate.pair", "*"),
("user_aggregate_v5.continuous.pair", "*"),
("user_aggregate_v7.pair", "*"),
("user_author_aggregate_v2.pair", "recap.earlybird.*"),
("user_author_aggregate_v2.pair", "recap.searchfeature.*"),
("user_author_aggregate_v2.pair", "recap.tweetfeature.embeds*"),
("user_author_aggregate_v2.pair", "recap.tweetfeature.link_count*"),
("user_author_aggregate_v2.pair", "engagement_features.in_network.*"),
("user_author_aggregate_v2.pair", "recap.tweetfeature.is_reply.*"),
("user_author_aggregate_v2.pair", "recap.tweetfeature.is_retweet.*"),
("user_author_aggregate_v2.pair", "recap.tweetfeature.num_mentions.*"),
("user_author_aggregate_v5.pair", "*"),
("user_author_aggregate_tweetsource_v1.pair", "*"),
("user_engager_aggregate.pair", "*"),
("user_mention_aggregate.pair", "*"),
("user_request_context_aggregate.dow.pair", "*"),
("user_request_context_aggregate.hour.pair", "*"),
("user_aggregate_v6.pair", "*"),
("user_original_author_aggregate_v1.pair", "*"),
("user_original_author_aggregate_v2.pair", "*"),
("original_author_aggregate_v1.pair", "*"),
("original_author_aggregate_v2.pair", "*"),
("author_topic_aggregate.pair", "*"),
("user_list_aggregate.pair", "*"),
("user_topic_aggregate.pair", "*"),
("user_topic_aggregate_v2.pair", "*"),
("user_inferred_topic_aggregate.pair", "*"),
("user_inferred_topic_aggregate_v2.pair", "*"),
("user_media_annotation_aggregate.pair", "*"),
("user_media_annotation_aggregate.pair", "*"),
("user_author_good_click_aggregate.pair", "*"),
("user_engager_good_click_aggregate.pair", "*")
)
val PairsToStore = BasePairsToStore ++ Seq(
("user_aggregate_v2.pair", "*"),
("user_aggregate_v5.boolean.pair", "*"),
("user_aggregate_tweetsource_v1.pair", "*"),
)
val LabelsToStore = Seq(
"any_label",
"recap.engagement.is_favorited",
"recap.engagement.is_retweeted",
"recap.engagement.is_replied",
"recap.engagement.is_open_linked",
"recap.engagement.is_profile_clicked",
"recap.engagement.is_clicked",
"recap.engagement.is_photo_expanded",
"recap.engagement.is_video_playback_50",
"recap.engagement.is_video_quality_viewed",
"recap.engagement.is_replied_reply_impressed_by_author",
"recap.engagement.is_replied_reply_favorited_by_author",
"recap.engagement.is_replied_reply_replied_by_author",
"recap.engagement.is_report_tweet_clicked",
"recap.engagement.is_block_clicked",
"recap.engagement.is_mute_clicked",
"recap.engagement.is_dont_like",
"recap.engagement.is_good_clicked_convo_desc_favorited_or_replied",
"recap.engagement.is_good_clicked_convo_desc_v2",
"itl.engagement.is_favorited",
"itl.engagement.is_retweeted",
"itl.engagement.is_replied",
"itl.engagement.is_open_linked",
"itl.engagement.is_profile_clicked",
"itl.engagement.is_clicked",
"itl.engagement.is_photo_expanded",
"itl.engagement.is_video_playback_50"
)
val PairGlobsToStore = for {
(prefix, suffix) <- PairsToStore
label <- LabelsToStore
} yield FeatureMatcher.glob(prefix + "." + label + "." + suffix)
val BaseAggregateV2FeatureSelector = FeatureMatcher
.none()
.or(
FeatureMatcher.glob("meta.user_id"),
FeatureMatcher.glob("meta.author_id"),
FeatureMatcher.glob("entities.original_author_id"),
FeatureMatcher.glob("entities.topic_id"),
FeatureMatcher
.glob("entities.inferred_topic_ids" + TypedAggregateGroup.SparseFeatureSuffix),
FeatureMatcher.glob("timelines.meta.list_id"),
FeatureMatcher.glob("list.id"),
FeatureMatcher
.glob("engagement_features.user_ids.public" + TypedAggregateGroup.SparseFeatureSuffix),
FeatureMatcher
.glob("entities.users.mentioned_screen_names" + TypedAggregateGroup.SparseFeatureSuffix),
FeatureMatcher.glob("user_aggregate_v2.pair.recap.engagement.is_dont_like.*"),
FeatureMatcher.glob("user_author_aggregate_v2.pair.any_label.recap.tweetfeature.has_*"),
FeatureMatcher.glob("request_context.country_code"),
FeatureMatcher.glob("request_context.timestamp_gmt_dow"),
FeatureMatcher.glob("request_context.timestamp_gmt_hour"),
FeatureMatcher.glob(
"semantic_core.media_understanding.high_recall.non_sensitive.entity_ids" + TypedAggregateGroup.SparseFeatureSuffix)
)
val AggregatesV2ProdFeatureSelector = BaseAggregateV2FeatureSelector
.orList(PairGlobsToStore.asJava)
val ReducedPairGlobsToStore = (for {
(prefix, suffix) <- BasePairsToStore
label <- LabelsToStore
} yield FeatureMatcher.glob(prefix + "." + label + "." + suffix)) ++ Seq(
FeatureMatcher.glob("user_aggregate_v2.pair.any_label.*"),
FeatureMatcher.glob("user_aggregate_v2.pair.recap.engagement.is_favorited.*"),
FeatureMatcher.glob("user_aggregate_v2.pair.recap.engagement.is_photo_expanded.*"),
FeatureMatcher.glob("user_aggregate_v2.pair.recap.engagement.is_profile_clicked.*")
)
}

View File

@ -0,0 +1,6 @@
## Timelines Aggregation Jobs
This directory contains the specific definition of aggregate jobs that generate features used by the Heavy Ranker.
The primary files of interest are [`TimelinesAggregationConfigDetails.scala`](TimelinesAggregationConfigDetails.scala), which contains the defintion for the batch aggregate jobs and [`real_time/TimelinesOnlineAggregationConfigBase.scala`](real_time/TimelinesOnlineAggregationConfigBase.scala) which contains the definitions for the real time aggregate jobs.
The aggregation framework that these jobs are based on is [here](../../../../../../../../timelines/data_processing/ml_util/aggregation_framework).

View File

@ -0,0 +1,415 @@
package com.twitter.timelines.prediction.common.aggregates
import com.twitter.ml.api.Feature
import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures
import com.twitter.timelines.prediction.features.engagement_features.EngagementDataRecordFeatures
import com.twitter.timelines.prediction.features.real_graph.RealGraphDataRecordFeatures
import com.twitter.timelines.prediction.features.recap.RecapFeatures
import com.twitter.timelines.prediction.features.time_features.TimeDataRecordFeatures
object RecapUserFeatureAggregation {
val RecapFeaturesForAggregation: Set[Feature[_]] =
Set(
RecapFeatures.HAS_IMAGE,
RecapFeatures.HAS_VIDEO,
RecapFeatures.FROM_MUTUAL_FOLLOW,
RecapFeatures.HAS_CARD,
RecapFeatures.HAS_NEWS,
RecapFeatures.REPLY_COUNT,
RecapFeatures.FAV_COUNT,
RecapFeatures.RETWEET_COUNT,
RecapFeatures.BLENDER_SCORE,
RecapFeatures.CONVERSATIONAL_COUNT,
RecapFeatures.IS_BUSINESS_SCORE,
RecapFeatures.CONTAINS_MEDIA,
RecapFeatures.RETWEET_SEARCHER,
RecapFeatures.REPLY_SEARCHER,
RecapFeatures.MENTION_SEARCHER,
RecapFeatures.REPLY_OTHER,
RecapFeatures.RETWEET_OTHER,
RecapFeatures.MATCH_UI_LANG,
RecapFeatures.MATCH_SEARCHER_MAIN_LANG,
RecapFeatures.MATCH_SEARCHER_LANGS,
RecapFeatures.TWEET_COUNT_FROM_USER_IN_SNAPSHOT,
RecapFeatures.TEXT_SCORE,
RealGraphDataRecordFeatures.NUM_RETWEETS_EWMA,
RealGraphDataRecordFeatures.NUM_RETWEETS_NON_ZERO_DAYS,
RealGraphDataRecordFeatures.NUM_RETWEETS_ELAPSED_DAYS,
RealGraphDataRecordFeatures.NUM_RETWEETS_DAYS_SINCE_LAST,
RealGraphDataRecordFeatures.NUM_FAVORITES_EWMA,
RealGraphDataRecordFeatures.NUM_FAVORITES_NON_ZERO_DAYS,
RealGraphDataRecordFeatures.NUM_FAVORITES_ELAPSED_DAYS,
RealGraphDataRecordFeatures.NUM_FAVORITES_DAYS_SINCE_LAST,
RealGraphDataRecordFeatures.NUM_MENTIONS_EWMA,
RealGraphDataRecordFeatures.NUM_MENTIONS_NON_ZERO_DAYS,
RealGraphDataRecordFeatures.NUM_MENTIONS_ELAPSED_DAYS,
RealGraphDataRecordFeatures.NUM_MENTIONS_DAYS_SINCE_LAST,
RealGraphDataRecordFeatures.NUM_TWEET_CLICKS_EWMA,
RealGraphDataRecordFeatures.NUM_TWEET_CLICKS_NON_ZERO_DAYS,
RealGraphDataRecordFeatures.NUM_TWEET_CLICKS_ELAPSED_DAYS,
RealGraphDataRecordFeatures.NUM_TWEET_CLICKS_DAYS_SINCE_LAST,
RealGraphDataRecordFeatures.NUM_PROFILE_VIEWS_EWMA,
RealGraphDataRecordFeatures.NUM_PROFILE_VIEWS_NON_ZERO_DAYS,
RealGraphDataRecordFeatures.NUM_PROFILE_VIEWS_ELAPSED_DAYS,
RealGraphDataRecordFeatures.NUM_PROFILE_VIEWS_DAYS_SINCE_LAST,
RealGraphDataRecordFeatures.TOTAL_DWELL_TIME_EWMA,
RealGraphDataRecordFeatures.TOTAL_DWELL_TIME_NON_ZERO_DAYS,
RealGraphDataRecordFeatures.TOTAL_DWELL_TIME_ELAPSED_DAYS,
RealGraphDataRecordFeatures.TOTAL_DWELL_TIME_DAYS_SINCE_LAST,
RealGraphDataRecordFeatures.NUM_INSPECTED_TWEETS_EWMA,
RealGraphDataRecordFeatures.NUM_INSPECTED_TWEETS_NON_ZERO_DAYS,
RealGraphDataRecordFeatures.NUM_INSPECTED_TWEETS_ELAPSED_DAYS,
RealGraphDataRecordFeatures.NUM_INSPECTED_TWEETS_DAYS_SINCE_LAST
)
val RecapLabelsForAggregation: Set[Feature.Binary] =
Set(
RecapFeatures.IS_FAVORITED,
RecapFeatures.IS_RETWEETED,
RecapFeatures.IS_CLICKED,
RecapFeatures.IS_PROFILE_CLICKED,
RecapFeatures.IS_OPEN_LINKED
)
val DwellDuration: Set[Feature[_]] =
Set(
TimelinesSharedFeatures.DWELL_TIME_MS,
)
val UserFeaturesV2: Set[Feature[_]] = RecapFeaturesForAggregation ++ Set(
RecapFeatures.HAS_VINE,
RecapFeatures.HAS_PERISCOPE,
RecapFeatures.HAS_PRO_VIDEO,
RecapFeatures.HAS_VISIBLE_LINK,
RecapFeatures.BIDIRECTIONAL_FAV_COUNT,
RecapFeatures.UNIDIRECTIONAL_FAV_COUNT,
RecapFeatures.BIDIRECTIONAL_REPLY_COUNT,
RecapFeatures.UNIDIRECTIONAL_REPLY_COUNT,
RecapFeatures.BIDIRECTIONAL_RETWEET_COUNT,
RecapFeatures.UNIDIRECTIONAL_RETWEET_COUNT,
RecapFeatures.EMBEDS_URL_COUNT,
RecapFeatures.EMBEDS_IMPRESSION_COUNT,
RecapFeatures.VIDEO_VIEW_COUNT,
RecapFeatures.IS_RETWEET,
RecapFeatures.IS_REPLY,
RecapFeatures.IS_EXTENDED_REPLY,
RecapFeatures.HAS_LINK,
RecapFeatures.HAS_TREND,
RecapFeatures.LINK_LANGUAGE,
RecapFeatures.NUM_HASHTAGS,
RecapFeatures.NUM_MENTIONS,
RecapFeatures.IS_SENSITIVE,
RecapFeatures.HAS_MULTIPLE_MEDIA,
RecapFeatures.USER_REP,
RecapFeatures.FAV_COUNT_V2,
RecapFeatures.RETWEET_COUNT_V2,
RecapFeatures.REPLY_COUNT_V2,
RecapFeatures.LINK_COUNT,
EngagementDataRecordFeatures.InNetworkFavoritesCount,
EngagementDataRecordFeatures.InNetworkRetweetsCount,
EngagementDataRecordFeatures.InNetworkRepliesCount
)
val UserAuthorFeaturesV2: Set[Feature[_]] = Set(
RecapFeatures.HAS_IMAGE,
RecapFeatures.HAS_VINE,
RecapFeatures.HAS_PERISCOPE,
RecapFeatures.HAS_PRO_VIDEO,
RecapFeatures.HAS_VIDEO,
RecapFeatures.HAS_CARD,
RecapFeatures.HAS_NEWS,
RecapFeatures.HAS_VISIBLE_LINK,
RecapFeatures.REPLY_COUNT,
RecapFeatures.FAV_COUNT,
RecapFeatures.RETWEET_COUNT,
RecapFeatures.BLENDER_SCORE,
RecapFeatures.CONVERSATIONAL_COUNT,
RecapFeatures.IS_BUSINESS_SCORE,
RecapFeatures.CONTAINS_MEDIA,
RecapFeatures.RETWEET_SEARCHER,
RecapFeatures.REPLY_SEARCHER,
RecapFeatures.MENTION_SEARCHER,
RecapFeatures.REPLY_OTHER,
RecapFeatures.RETWEET_OTHER,
RecapFeatures.MATCH_UI_LANG,
RecapFeatures.MATCH_SEARCHER_MAIN_LANG,
RecapFeatures.MATCH_SEARCHER_LANGS,
RecapFeatures.TWEET_COUNT_FROM_USER_IN_SNAPSHOT,
RecapFeatures.TEXT_SCORE,
RecapFeatures.BIDIRECTIONAL_FAV_COUNT,
RecapFeatures.UNIDIRECTIONAL_FAV_COUNT,
RecapFeatures.BIDIRECTIONAL_REPLY_COUNT,
RecapFeatures.UNIDIRECTIONAL_REPLY_COUNT,
RecapFeatures.BIDIRECTIONAL_RETWEET_COUNT,
RecapFeatures.UNIDIRECTIONAL_RETWEET_COUNT,
RecapFeatures.EMBEDS_URL_COUNT,
RecapFeatures.EMBEDS_IMPRESSION_COUNT,
RecapFeatures.VIDEO_VIEW_COUNT,
RecapFeatures.IS_RETWEET,
RecapFeatures.IS_REPLY,
RecapFeatures.HAS_LINK,
RecapFeatures.HAS_TREND,
RecapFeatures.LINK_LANGUAGE,
RecapFeatures.NUM_HASHTAGS,
RecapFeatures.NUM_MENTIONS,
RecapFeatures.IS_SENSITIVE,
RecapFeatures.HAS_MULTIPLE_MEDIA,
RecapFeatures.FAV_COUNT_V2,
RecapFeatures.RETWEET_COUNT_V2,
RecapFeatures.REPLY_COUNT_V2,
RecapFeatures.LINK_COUNT,
EngagementDataRecordFeatures.InNetworkFavoritesCount,
EngagementDataRecordFeatures.InNetworkRetweetsCount,
EngagementDataRecordFeatures.InNetworkRepliesCount
)
val UserAuthorFeaturesV2Count: Set[Feature[_]] = Set(
RecapFeatures.HAS_IMAGE,
RecapFeatures.HAS_VINE,
RecapFeatures.HAS_PERISCOPE,
RecapFeatures.HAS_PRO_VIDEO,
RecapFeatures.HAS_VIDEO,
RecapFeatures.HAS_CARD,
RecapFeatures.HAS_NEWS,
RecapFeatures.HAS_VISIBLE_LINK,
RecapFeatures.FAV_COUNT,
RecapFeatures.CONTAINS_MEDIA,
RecapFeatures.RETWEET_SEARCHER,
RecapFeatures.REPLY_SEARCHER,
RecapFeatures.MENTION_SEARCHER,
RecapFeatures.REPLY_OTHER,
RecapFeatures.RETWEET_OTHER,
RecapFeatures.MATCH_UI_LANG,
RecapFeatures.MATCH_SEARCHER_MAIN_LANG,
RecapFeatures.MATCH_SEARCHER_LANGS,
RecapFeatures.IS_RETWEET,
RecapFeatures.IS_REPLY,
RecapFeatures.HAS_LINK,
RecapFeatures.HAS_TREND,
RecapFeatures.IS_SENSITIVE,
RecapFeatures.HAS_MULTIPLE_MEDIA,
EngagementDataRecordFeatures.InNetworkFavoritesCount
)
val UserTopicFeaturesV2Count: Set[Feature[_]] = Set(
RecapFeatures.HAS_IMAGE,
RecapFeatures.HAS_VIDEO,
RecapFeatures.HAS_CARD,
RecapFeatures.HAS_NEWS,
RecapFeatures.FAV_COUNT,
RecapFeatures.CONTAINS_MEDIA,
RecapFeatures.RETWEET_SEARCHER,
RecapFeatures.REPLY_SEARCHER,
RecapFeatures.MENTION_SEARCHER,
RecapFeatures.REPLY_OTHER,
RecapFeatures.RETWEET_OTHER,
RecapFeatures.MATCH_UI_LANG,
RecapFeatures.MATCH_SEARCHER_MAIN_LANG,
RecapFeatures.MATCH_SEARCHER_LANGS,
RecapFeatures.IS_RETWEET,
RecapFeatures.IS_REPLY,
RecapFeatures.HAS_LINK,
RecapFeatures.HAS_TREND,
RecapFeatures.IS_SENSITIVE,
EngagementDataRecordFeatures.InNetworkFavoritesCount,
EngagementDataRecordFeatures.InNetworkRetweetsCount,
TimelinesSharedFeatures.NUM_CAPS,
TimelinesSharedFeatures.ASPECT_RATIO_DEN,
TimelinesSharedFeatures.NUM_NEWLINES,
TimelinesSharedFeatures.IS_360,
TimelinesSharedFeatures.IS_MANAGED,
TimelinesSharedFeatures.IS_MONETIZABLE,
TimelinesSharedFeatures.HAS_SELECTED_PREVIEW_IMAGE,
TimelinesSharedFeatures.HAS_TITLE,
TimelinesSharedFeatures.HAS_DESCRIPTION,
TimelinesSharedFeatures.HAS_VISIT_SITE_CALL_TO_ACTION,
TimelinesSharedFeatures.HAS_WATCH_NOW_CALL_TO_ACTION
)
val UserFeaturesV5Continuous: Set[Feature[_]] = Set(
TimelinesSharedFeatures.QUOTE_COUNT,
TimelinesSharedFeatures.VISIBLE_TOKEN_RATIO,
TimelinesSharedFeatures.WEIGHTED_FAV_COUNT,
TimelinesSharedFeatures.WEIGHTED_RETWEET_COUNT,
TimelinesSharedFeatures.WEIGHTED_REPLY_COUNT,
TimelinesSharedFeatures.WEIGHTED_QUOTE_COUNT,
TimelinesSharedFeatures.EMBEDS_IMPRESSION_COUNT_V2,
TimelinesSharedFeatures.EMBEDS_URL_COUNT_V2,
TimelinesSharedFeatures.DECAYED_FAVORITE_COUNT,
TimelinesSharedFeatures.DECAYED_RETWEET_COUNT,
TimelinesSharedFeatures.DECAYED_REPLY_COUNT,
TimelinesSharedFeatures.DECAYED_QUOTE_COUNT,
TimelinesSharedFeatures.FAKE_FAVORITE_COUNT,
TimelinesSharedFeatures.FAKE_RETWEET_COUNT,
TimelinesSharedFeatures.FAKE_REPLY_COUNT,
TimelinesSharedFeatures.FAKE_QUOTE_COUNT,
TimeDataRecordFeatures.LAST_FAVORITE_SINCE_CREATION_HRS,
TimeDataRecordFeatures.LAST_RETWEET_SINCE_CREATION_HRS,
TimeDataRecordFeatures.LAST_REPLY_SINCE_CREATION_HRS,
TimeDataRecordFeatures.LAST_QUOTE_SINCE_CREATION_HRS,
TimeDataRecordFeatures.TIME_SINCE_LAST_FAVORITE_HRS,
TimeDataRecordFeatures.TIME_SINCE_LAST_RETWEET_HRS,
TimeDataRecordFeatures.TIME_SINCE_LAST_REPLY_HRS,
TimeDataRecordFeatures.TIME_SINCE_LAST_QUOTE_HRS
)
val UserFeaturesV5Boolean: Set[Feature[_]] = Set(
TimelinesSharedFeatures.LABEL_ABUSIVE_FLAG,
TimelinesSharedFeatures.LABEL_ABUSIVE_HI_RCL_FLAG,
TimelinesSharedFeatures.LABEL_DUP_CONTENT_FLAG,
TimelinesSharedFeatures.LABEL_NSFW_HI_PRC_FLAG,
TimelinesSharedFeatures.LABEL_NSFW_HI_RCL_FLAG,
TimelinesSharedFeatures.LABEL_SPAM_FLAG,
TimelinesSharedFeatures.LABEL_SPAM_HI_RCL_FLAG,
TimelinesSharedFeatures.PERISCOPE_EXISTS,
TimelinesSharedFeatures.PERISCOPE_IS_LIVE,
TimelinesSharedFeatures.PERISCOPE_HAS_BEEN_FEATURED,
TimelinesSharedFeatures.PERISCOPE_IS_CURRENTLY_FEATURED,
TimelinesSharedFeatures.PERISCOPE_IS_FROM_QUALITY_SOURCE,
TimelinesSharedFeatures.HAS_QUOTE
)
val UserAuthorFeaturesV5: Set[Feature[_]] = Set(
TimelinesSharedFeatures.HAS_QUOTE,
TimelinesSharedFeatures.LABEL_ABUSIVE_FLAG,
TimelinesSharedFeatures.LABEL_ABUSIVE_HI_RCL_FLAG,
TimelinesSharedFeatures.LABEL_DUP_CONTENT_FLAG,
TimelinesSharedFeatures.LABEL_NSFW_HI_PRC_FLAG,
TimelinesSharedFeatures.LABEL_NSFW_HI_RCL_FLAG,
TimelinesSharedFeatures.LABEL_SPAM_FLAG,
TimelinesSharedFeatures.LABEL_SPAM_HI_RCL_FLAG
)
val UserTweetSourceFeaturesV1Continuous: Set[Feature[_]] = Set(
TimelinesSharedFeatures.NUM_CAPS,
TimelinesSharedFeatures.NUM_WHITESPACES,
TimelinesSharedFeatures.TWEET_LENGTH,
TimelinesSharedFeatures.ASPECT_RATIO_DEN,
TimelinesSharedFeatures.ASPECT_RATIO_NUM,
TimelinesSharedFeatures.BIT_RATE,
TimelinesSharedFeatures.HEIGHT_1,
TimelinesSharedFeatures.HEIGHT_2,
TimelinesSharedFeatures.HEIGHT_3,
TimelinesSharedFeatures.HEIGHT_4,
TimelinesSharedFeatures.VIDEO_DURATION,
TimelinesSharedFeatures.WIDTH_1,
TimelinesSharedFeatures.WIDTH_2,
TimelinesSharedFeatures.WIDTH_3,
TimelinesSharedFeatures.WIDTH_4,
TimelinesSharedFeatures.NUM_MEDIA_TAGS
)
val UserTweetSourceFeaturesV1Boolean: Set[Feature[_]] = Set(
TimelinesSharedFeatures.HAS_QUESTION,
TimelinesSharedFeatures.RESIZE_METHOD_1,
TimelinesSharedFeatures.RESIZE_METHOD_2,
TimelinesSharedFeatures.RESIZE_METHOD_3,
TimelinesSharedFeatures.RESIZE_METHOD_4
)
val UserTweetSourceFeaturesV2Continuous: Set[Feature[_]] = Set(
TimelinesSharedFeatures.NUM_EMOJIS,
TimelinesSharedFeatures.NUM_EMOTICONS,
TimelinesSharedFeatures.NUM_NEWLINES,
TimelinesSharedFeatures.NUM_STICKERS,
TimelinesSharedFeatures.NUM_FACES,
TimelinesSharedFeatures.NUM_COLOR_PALLETTE_ITEMS,
TimelinesSharedFeatures.VIEW_COUNT,
TimelinesSharedFeatures.TWEET_LENGTH_TYPE
)
val UserTweetSourceFeaturesV2Boolean: Set[Feature[_]] = Set(
TimelinesSharedFeatures.IS_360,
TimelinesSharedFeatures.IS_MANAGED,
TimelinesSharedFeatures.IS_MONETIZABLE,
TimelinesSharedFeatures.IS_EMBEDDABLE,
TimelinesSharedFeatures.HAS_SELECTED_PREVIEW_IMAGE,
TimelinesSharedFeatures.HAS_TITLE,
TimelinesSharedFeatures.HAS_DESCRIPTION,
TimelinesSharedFeatures.HAS_VISIT_SITE_CALL_TO_ACTION,
TimelinesSharedFeatures.HAS_WATCH_NOW_CALL_TO_ACTION
)
val UserAuthorTweetSourceFeaturesV1: Set[Feature[_]] = Set(
TimelinesSharedFeatures.HAS_QUESTION,
TimelinesSharedFeatures.TWEET_LENGTH,
TimelinesSharedFeatures.VIDEO_DURATION,
TimelinesSharedFeatures.NUM_MEDIA_TAGS
)
val UserAuthorTweetSourceFeaturesV2: Set[Feature[_]] = Set(
TimelinesSharedFeatures.NUM_CAPS,
TimelinesSharedFeatures.NUM_WHITESPACES,
TimelinesSharedFeatures.ASPECT_RATIO_DEN,
TimelinesSharedFeatures.ASPECT_RATIO_NUM,
TimelinesSharedFeatures.BIT_RATE,
TimelinesSharedFeatures.TWEET_LENGTH_TYPE,
TimelinesSharedFeatures.NUM_EMOJIS,
TimelinesSharedFeatures.NUM_EMOTICONS,
TimelinesSharedFeatures.NUM_NEWLINES,
TimelinesSharedFeatures.NUM_STICKERS,
TimelinesSharedFeatures.NUM_FACES,
TimelinesSharedFeatures.IS_360,
TimelinesSharedFeatures.IS_MANAGED,
TimelinesSharedFeatures.IS_MONETIZABLE,
TimelinesSharedFeatures.HAS_SELECTED_PREVIEW_IMAGE,
TimelinesSharedFeatures.HAS_TITLE,
TimelinesSharedFeatures.HAS_DESCRIPTION,
TimelinesSharedFeatures.HAS_VISIT_SITE_CALL_TO_ACTION,
TimelinesSharedFeatures.HAS_WATCH_NOW_CALL_TO_ACTION
)
val UserAuthorTweetSourceFeaturesV2Count: Set[Feature[_]] = Set(
TimelinesSharedFeatures.NUM_CAPS,
TimelinesSharedFeatures.ASPECT_RATIO_DEN,
TimelinesSharedFeatures.NUM_NEWLINES,
TimelinesSharedFeatures.IS_360,
TimelinesSharedFeatures.IS_MANAGED,
TimelinesSharedFeatures.IS_MONETIZABLE,
TimelinesSharedFeatures.HAS_SELECTED_PREVIEW_IMAGE,
TimelinesSharedFeatures.HAS_TITLE,
TimelinesSharedFeatures.HAS_DESCRIPTION,
TimelinesSharedFeatures.HAS_VISIT_SITE_CALL_TO_ACTION,
TimelinesSharedFeatures.HAS_WATCH_NOW_CALL_TO_ACTION
)
val LabelsV2: Set[Feature.Binary] = RecapLabelsForAggregation ++ Set(
RecapFeatures.IS_REPLIED,
RecapFeatures.IS_PHOTO_EXPANDED,
RecapFeatures.IS_VIDEO_PLAYBACK_50
)
val TwitterWideFeatures: Set[Feature[_]] = Set(
RecapFeatures.IS_REPLY,
TimelinesSharedFeatures.HAS_QUOTE,
RecapFeatures.HAS_MENTION,
RecapFeatures.HAS_HASHTAG,
RecapFeatures.HAS_LINK,
RecapFeatures.HAS_CARD,
RecapFeatures.CONTAINS_MEDIA
)
val TwitterWideLabels: Set[Feature.Binary] = Set(
RecapFeatures.IS_FAVORITED,
RecapFeatures.IS_RETWEETED,
RecapFeatures.IS_REPLIED
)
val ReciprocalLabels: Set[Feature.Binary] = Set(
RecapFeatures.IS_REPLIED_REPLY_IMPRESSED_BY_AUTHOR,
RecapFeatures.IS_REPLIED_REPLY_REPLIED_BY_AUTHOR,
RecapFeatures.IS_REPLIED_REPLY_FAVORITED_BY_AUTHOR
)
val NegativeEngagementLabels: Set[Feature.Binary] = Set(
RecapFeatures.IS_REPORT_TWEET_CLICKED,
RecapFeatures.IS_BLOCK_CLICKED,
RecapFeatures.IS_MUTE_CLICKED,
RecapFeatures.IS_DONT_LIKE
)
val GoodClickLabels: Set[Feature.Binary] = Set(
RecapFeatures.IS_GOOD_CLICKED_CONVO_DESC_V1,
RecapFeatures.IS_GOOD_CLICKED_CONVO_DESC_V2,
)
}

View File

@ -0,0 +1,52 @@
package com.twitter.timelines.prediction.common.aggregates
import com.twitter.ml.api.Feature
import com.twitter.timelines.prediction.features.engagement_features.EngagementDataRecordFeatures
import com.twitter.timelines.prediction.features.itl.ITLFeatures
object RectweetUserFeatureAggregation {
val RectweetLabelsForAggregation: Set[Feature.Binary] =
Set(
ITLFeatures.IS_FAVORITED,
ITLFeatures.IS_RETWEETED,
ITLFeatures.IS_REPLIED,
ITLFeatures.IS_CLICKED,
ITLFeatures.IS_PROFILE_CLICKED,
ITLFeatures.IS_OPEN_LINKED,
ITLFeatures.IS_PHOTO_EXPANDED,
ITLFeatures.IS_VIDEO_PLAYBACK_50
)
val TweetFeatures: Set[Feature[_]] = Set(
ITLFeatures.HAS_IMAGE,
ITLFeatures.HAS_CARD,
ITLFeatures.HAS_NEWS,
ITLFeatures.REPLY_COUNT,
ITLFeatures.FAV_COUNT,
ITLFeatures.REPLY_COUNT,
ITLFeatures.RETWEET_COUNT,
ITLFeatures.MATCHES_UI_LANG,
ITLFeatures.MATCHES_SEARCHER_MAIN_LANG,
ITLFeatures.MATCHES_SEARCHER_LANGS,
ITLFeatures.TEXT_SCORE,
ITLFeatures.LINK_LANGUAGE,
ITLFeatures.NUM_HASHTAGS,
ITLFeatures.NUM_MENTIONS,
ITLFeatures.IS_SENSITIVE,
ITLFeatures.HAS_VIDEO,
ITLFeatures.HAS_LINK,
ITLFeatures.HAS_VISIBLE_LINK,
EngagementDataRecordFeatures.InNetworkFavoritesCount
// nice to have, but currently not hydrated in the RecommendedTweet payload
//EngagementDataRecordFeatures.InNetworkRetweetsCount,
//EngagementDataRecordFeatures.InNetworkRepliesCount
)
val ReciprocalLabels: Set[Feature.Binary] = Set(
ITLFeatures.IS_REPLIED_REPLY_IMPRESSED_BY_AUTHOR,
ITLFeatures.IS_REPLIED_REPLY_REPLIED_BY_AUTHOR,
ITLFeatures.IS_REPLIED_REPLY_FAVORITED_BY_AUTHOR,
ITLFeatures.IS_REPLIED_REPLY_RETWEETED_BY_AUTHOR,
ITLFeatures.IS_REPLIED_REPLY_QUOTED_BY_AUTHOR
)
}

View File

@ -0,0 +1,80 @@
package com.twitter.timelines.prediction.common.aggregates
import com.twitter.dal.client.dataset.KeyValDALDataset
import com.twitter.ml.api.DataRecord
import com.twitter.ml.api.FeatureContext
import com.twitter.scalding_internal.multiformat.format.keyval
import com.twitter.summingbird.batch.BatchID
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.conversion.CombineCountsPolicy
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregateStore
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationKey
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.OfflineAggregateDataRecordStore
import scala.collection.JavaConverters._
object TimelinesAggregationConfig extends TimelinesAggregationConfigTrait {
override def outputHdfsPath: String = "/user/timelines/processed/aggregates_v2"
def storeToDatasetMap: Map[String, KeyValDALDataset[
keyval.KeyVal[AggregationKey, (BatchID, DataRecord)]
]] = Map(
AuthorTopicAggregateStore -> AuthorTopicAggregatesScalaDataset,
UserTopicAggregateStore -> UserTopicAggregatesScalaDataset,
UserInferredTopicAggregateStore -> UserInferredTopicAggregatesScalaDataset,
UserAggregateStore -> UserAggregatesScalaDataset,
UserAuthorAggregateStore -> UserAuthorAggregatesScalaDataset,
UserOriginalAuthorAggregateStore -> UserOriginalAuthorAggregatesScalaDataset,
OriginalAuthorAggregateStore -> OriginalAuthorAggregatesScalaDataset,
UserEngagerAggregateStore -> UserEngagerAggregatesScalaDataset,
UserMentionAggregateStore -> UserMentionAggregatesScalaDataset,
TwitterWideUserAggregateStore -> TwitterWideUserAggregatesScalaDataset,
TwitterWideUserAuthorAggregateStore -> TwitterWideUserAuthorAggregatesScalaDataset,
UserRequestHourAggregateStore -> UserRequestHourAggregatesScalaDataset,
UserRequestDowAggregateStore -> UserRequestDowAggregatesScalaDataset,
UserListAggregateStore -> UserListAggregatesScalaDataset,
UserMediaUnderstandingAnnotationAggregateStore -> UserMediaUnderstandingAnnotationAggregatesScalaDataset,
)
override def mkPhysicalStore(store: AggregateStore): AggregateStore = store match {
case s: OfflineAggregateDataRecordStore =>
s.toOfflineAggregateDataRecordStoreWithDAL(storeToDatasetMap(s.name))
case _ => throw new IllegalArgumentException("Unsupported logical dataset type.")
}
object CombineCountPolicies {
val EngagerCountsPolicy: CombineCountsPolicy = mkCountsPolicy("user_engager_aggregate")
val EngagerGoodClickCountsPolicy: CombineCountsPolicy = mkCountsPolicy(
"user_engager_good_click_aggregate")
val RectweetEngagerCountsPolicy: CombineCountsPolicy =
mkCountsPolicy("rectweet_user_engager_aggregate")
val MentionCountsPolicy: CombineCountsPolicy = mkCountsPolicy("user_mention_aggregate")
val RectweetSimclustersTweetCountsPolicy: CombineCountsPolicy =
mkCountsPolicy("rectweet_user_simcluster_tweet_aggregate")
val UserInferredTopicCountsPolicy: CombineCountsPolicy =
mkCountsPolicy("user_inferred_topic_aggregate")
val UserInferredTopicV2CountsPolicy: CombineCountsPolicy =
mkCountsPolicy("user_inferred_topic_aggregate_v2")
val UserMediaUnderstandingAnnotationCountsPolicy: CombineCountsPolicy =
mkCountsPolicy("user_media_annotation_aggregate")
private[this] def mkCountsPolicy(prefix: String): CombineCountsPolicy = {
val features = TimelinesAggregationConfig.aggregatesToCompute
.filter(_.aggregatePrefix == prefix)
.flatMap(_.allOutputFeatures)
CombineCountsPolicy(
topK = 2,
aggregateContextToPrecompute = new FeatureContext(features.asJava),
hardLimit = Some(20)
)
}
}
}
object TimelinesAggregationCanaryConfig extends TimelinesAggregationConfigTrait {
override def outputHdfsPath: String = "/user/timelines/canaries/processed/aggregates_v2"
override def mkPhysicalStore(store: AggregateStore): AggregateStore = store match {
case s: OfflineAggregateDataRecordStore =>
s.toOfflineAggregateDataRecordStoreWithDAL(dalDataset = AggregatesCanaryScalaDataset)
case _ => throw new IllegalArgumentException("Unsupported logical dataset type.")
}
}

View File

@ -0,0 +1,579 @@
package com.twitter.timelines.prediction.common.aggregates
import com.twitter.conversions.DurationOps._
import com.twitter.ml.api.constant.SharedFeatures.AUTHOR_ID
import com.twitter.ml.api.constant.SharedFeatures.USER_ID
import com.twitter.timelines.data_processing.ml_util.aggregation_framework._
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.metrics._
import com.twitter.timelines.data_processing.ml_util.transforms.DownsampleTransform
import com.twitter.timelines.data_processing.ml_util.transforms.RichRemoveAuthorIdZero
import com.twitter.timelines.data_processing.ml_util.transforms.RichRemoveUserIdZero
import com.twitter.timelines.prediction.features.common.TimelinesSharedFeatures
import com.twitter.timelines.prediction.features.engagement_features.EngagementDataRecordFeatures
import com.twitter.timelines.prediction.features.engagement_features.EngagementDataRecordFeatures.RichUnifyPublicEngagersTransform
import com.twitter.timelines.prediction.features.list_features.ListFeatures
import com.twitter.timelines.prediction.features.recap.RecapFeatures
import com.twitter.timelines.prediction.features.request_context.RequestContextFeatures
import com.twitter.timelines.prediction.features.semantic_core_features.SemanticCoreFeatures
import com.twitter.timelines.prediction.transform.filter.FilterInNetworkTransform
import com.twitter.timelines.prediction.transform.filter.FilterImageTweetTransform
import com.twitter.timelines.prediction.transform.filter.FilterVideoTweetTransform
import com.twitter.timelines.prediction.transform.filter.FilterOutImageVideoTweetTransform
import com.twitter.util.Duration
trait TimelinesAggregationConfigDetails extends Serializable {
import TimelinesAggregationSources._
def outputHdfsPath: String
/**
* Converts the given logical store to a physical store. The reason we do not specify the
* physical store directly with the [[AggregateGroup]] is because of a cyclic dependency when
* create physical stores that are DalDataset with PersonalDataType annotations derived from
* the [[AggregateGroup]].
*
*/
def mkPhysicalStore(store: AggregateStore): AggregateStore
def defaultMaxKvSourceFailures: Int = 100
val timelinesOfflineAggregateSink = new OfflineStoreCommonConfig {
override def apply(startDate: String) = OfflineAggregateStoreCommonConfig(
outputHdfsPathPrefix = outputHdfsPath,
dummyAppId = "timelines_aggregates_v2_ro",
dummyDatasetPrefix = "timelines_aggregates_v2_ro",
startDate = startDate
)
}
val UserAggregateStore = "user_aggregates"
val UserAuthorAggregateStore = "user_author_aggregates"
val UserOriginalAuthorAggregateStore = "user_original_author_aggregates"
val OriginalAuthorAggregateStore = "original_author_aggregates"
val UserEngagerAggregateStore = "user_engager_aggregates"
val UserMentionAggregateStore = "user_mention_aggregates"
val TwitterWideUserAggregateStore = "twitter_wide_user_aggregates"
val TwitterWideUserAuthorAggregateStore = "twitter_wide_user_author_aggregates"
val UserRequestHourAggregateStore = "user_request_hour_aggregates"
val UserRequestDowAggregateStore = "user_request_dow_aggregates"
val UserListAggregateStore = "user_list_aggregates"
val AuthorTopicAggregateStore = "author_topic_aggregates"
val UserTopicAggregateStore = "user_topic_aggregates"
val UserInferredTopicAggregateStore = "user_inferred_topic_aggregates"
val UserMediaUnderstandingAnnotationAggregateStore =
"user_media_understanding_annotation_aggregates"
val AuthorCountryCodeAggregateStore = "author_country_code_aggregates"
val OriginalAuthorCountryCodeAggregateStore = "original_author_country_code_aggregates"
/**
* Step 3: Configure all aggregates to compute.
* Note that different subsets of aggregates in this list
* can be launched by different summingbird job instances.
* Any given job can be responsible for a set of AggregateGroup
* configs whose outputStores share the same exact startDate.
* AggregateGroups that do not share the same inputSource,
* outputStore or startDate MUST be launched using different
* summingbird jobs and passed in a different --start-time argument
* See science/scalding/mesos/timelines/prod.yaml for an example
* of how to configure your own job.
*/
val negativeDownsampleTransform =
DownsampleTransform(
negativeSamplingRate = 0.03,
keepLabels = RecapUserFeatureAggregation.LabelsV2)
val negativeRecTweetDownsampleTransform = DownsampleTransform(
negativeSamplingRate = 0.03,
keepLabels = RectweetUserFeatureAggregation.RectweetLabelsForAggregation
)
val userAggregatesV2: AggregateGroup =
AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_aggregate_v2",
preTransforms = Seq(RichRemoveUserIdZero), /* Eliminates reducer skew */
keys = Set(USER_ID),
features = RecapUserFeatureAggregation.UserFeaturesV2,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric, SumMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserAggregateStore,
startDate = "2016-07-15 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val userAuthorAggregatesV2: Set[AggregateGroup] = {
/**
* NOTE: We need to remove records from out-of-network authors from the recap input
* records (which now include out-of-network records as well after merging recap and
* rectweet models) that are used to compute user-author aggregates. This is necessary
* to limit the growth rate of user-author aggregates.
*/
val allFeatureAggregates = Set(
AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_author_aggregate_v2",
preTransforms = Seq(FilterInNetworkTransform, RichRemoveUserIdZero),
keys = Set(USER_ID, AUTHOR_ID),
features = RecapUserFeatureAggregation.UserAuthorFeaturesV2,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(SumMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserAuthorAggregateStore,
startDate = "2016-07-15 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
)
val countAggregates: Set[AggregateGroup] = Set(
AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_author_aggregate_v2",
preTransforms = Seq(FilterInNetworkTransform, RichRemoveUserIdZero),
keys = Set(USER_ID, AUTHOR_ID),
features = RecapUserFeatureAggregation.UserAuthorFeaturesV2Count,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserAuthorAggregateStore,
startDate = "2016-07-15 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
)
allFeatureAggregates ++ countAggregates
}
val userAggregatesV5Continuous: AggregateGroup =
AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_aggregate_v5.continuous",
preTransforms = Seq(RichRemoveUserIdZero),
keys = Set(USER_ID),
features = RecapUserFeatureAggregation.UserFeaturesV5Continuous,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric, SumMetric, SumSqMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserAggregateStore,
startDate = "2016-07-15 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val userAuthorAggregatesV5: AggregateGroup =
AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_author_aggregate_v5",
preTransforms = Seq(FilterInNetworkTransform, RichRemoveUserIdZero),
keys = Set(USER_ID, AUTHOR_ID),
features = RecapUserFeatureAggregation.UserAuthorFeaturesV5,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserAuthorAggregateStore,
startDate = "2016-07-15 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val tweetSourceUserAuthorAggregatesV1: AggregateGroup =
AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_author_aggregate_tweetsource_v1",
preTransforms = Seq(FilterInNetworkTransform, RichRemoveUserIdZero),
keys = Set(USER_ID, AUTHOR_ID),
features = RecapUserFeatureAggregation.UserAuthorTweetSourceFeaturesV1,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric, SumMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserAuthorAggregateStore,
startDate = "2016-07-15 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val userEngagerAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_engager_aggregate",
keys = Set(USER_ID, EngagementDataRecordFeatures.PublicEngagementUserIds),
features = Set.empty,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserEngagerAggregateStore,
startDate = "2016-09-02 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
)),
preTransforms = Seq(
RichRemoveUserIdZero,
RichUnifyPublicEngagersTransform
)
)
val userMentionAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
preTransforms = Seq(RichRemoveUserIdZero), /* Eliminates reducer skew */
aggregatePrefix = "user_mention_aggregate",
keys = Set(USER_ID, RecapFeatures.MENTIONED_SCREEN_NAMES),
features = Set.empty,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserMentionAggregateStore,
startDate = "2017-03-01 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
)),
includeAnyLabel = false
)
val twitterWideUserAggregates = AggregateGroup(
inputSource = timelinesDailyTwitterWideSource,
preTransforms = Seq(RichRemoveUserIdZero), /* Eliminates reducer skew */
aggregatePrefix = "twitter_wide_user_aggregate",
keys = Set(USER_ID),
features = RecapUserFeatureAggregation.TwitterWideFeatures,
labels = RecapUserFeatureAggregation.TwitterWideLabels,
metrics = Set(CountMetric, SumMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = TwitterWideUserAggregateStore,
startDate = "2016-12-28 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val twitterWideUserAuthorAggregates = AggregateGroup(
inputSource = timelinesDailyTwitterWideSource,
preTransforms = Seq(RichRemoveUserIdZero), /* Eliminates reducer skew */
aggregatePrefix = "twitter_wide_user_author_aggregate",
keys = Set(USER_ID, AUTHOR_ID),
features = RecapUserFeatureAggregation.TwitterWideFeatures,
labels = RecapUserFeatureAggregation.TwitterWideLabels,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = TwitterWideUserAuthorAggregateStore,
startDate = "2016-12-28 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
)),
includeAnyLabel = false
)
/**
* User-HourOfDay and User-DayOfWeek aggregations, both for recap and rectweet
*/
val userRequestHourAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_request_context_aggregate.hour",
preTransforms = Seq(RichRemoveUserIdZero, negativeDownsampleTransform),
keys = Set(USER_ID, RequestContextFeatures.TIMESTAMP_GMT_HOUR),
features = Set.empty,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserRequestHourAggregateStore,
startDate = "2017-08-01 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val userRequestDowAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_request_context_aggregate.dow",
preTransforms = Seq(RichRemoveUserIdZero, negativeDownsampleTransform),
keys = Set(USER_ID, RequestContextFeatures.TIMESTAMP_GMT_DOW),
features = Set.empty,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserRequestDowAggregateStore,
startDate = "2017-08-01 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val authorTopicAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "author_topic_aggregate",
preTransforms = Seq(RichRemoveUserIdZero),
keys = Set(AUTHOR_ID, TimelinesSharedFeatures.TOPIC_ID),
features = Set.empty,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = AuthorTopicAggregateStore,
startDate = "2020-05-19 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val userTopicAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_topic_aggregate",
preTransforms = Seq(RichRemoveUserIdZero),
keys = Set(USER_ID, TimelinesSharedFeatures.TOPIC_ID),
features = Set.empty,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserTopicAggregateStore,
startDate = "2020-05-23 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val userTopicAggregatesV2 = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_topic_aggregate_v2",
preTransforms = Seq(RichRemoveUserIdZero),
keys = Set(USER_ID, TimelinesSharedFeatures.TOPIC_ID),
features = RecapUserFeatureAggregation.UserTopicFeaturesV2Count,
labels = RecapUserFeatureAggregation.LabelsV2,
includeAnyFeature = false,
includeAnyLabel = false,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserTopicAggregateStore,
startDate = "2020-05-23 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val userInferredTopicAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_inferred_topic_aggregate",
preTransforms = Seq(RichRemoveUserIdZero),
keys = Set(USER_ID, TimelinesSharedFeatures.INFERRED_TOPIC_IDS),
features = Set.empty,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserInferredTopicAggregateStore,
startDate = "2020-09-09 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val userInferredTopicAggregatesV2 = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_inferred_topic_aggregate_v2",
preTransforms = Seq(RichRemoveUserIdZero),
keys = Set(USER_ID, TimelinesSharedFeatures.INFERRED_TOPIC_IDS),
features = RecapUserFeatureAggregation.UserTopicFeaturesV2Count,
labels = RecapUserFeatureAggregation.LabelsV2,
includeAnyFeature = false,
includeAnyLabel = false,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserInferredTopicAggregateStore,
startDate = "2020-09-09 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val userReciprocalEngagementAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_aggregate_v6",
preTransforms = Seq(RichRemoveUserIdZero),
keys = Set(USER_ID),
features = Set.empty,
labels = RecapUserFeatureAggregation.ReciprocalLabels,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserAggregateStore,
startDate = "2016-07-15 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
)),
includeAnyLabel = false
)
val userOriginalAuthorReciprocalEngagementAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_original_author_aggregate_v1",
preTransforms = Seq(RichRemoveUserIdZero, RichRemoveAuthorIdZero),
keys = Set(USER_ID, TimelinesSharedFeatures.ORIGINAL_AUTHOR_ID),
features = Set.empty,
labels = RecapUserFeatureAggregation.ReciprocalLabels,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserOriginalAuthorAggregateStore,
startDate = "2018-12-26 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
)),
includeAnyLabel = false
)
val originalAuthorReciprocalEngagementAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "original_author_aggregate_v1",
preTransforms = Seq(RichRemoveUserIdZero, RichRemoveAuthorIdZero),
keys = Set(TimelinesSharedFeatures.ORIGINAL_AUTHOR_ID),
features = Set.empty,
labels = RecapUserFeatureAggregation.ReciprocalLabels,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = OriginalAuthorAggregateStore,
startDate = "2023-02-25 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
)),
includeAnyLabel = false
)
val originalAuthorNegativeEngagementAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "original_author_aggregate_v2",
preTransforms = Seq(RichRemoveUserIdZero, RichRemoveAuthorIdZero),
keys = Set(TimelinesSharedFeatures.ORIGINAL_AUTHOR_ID),
features = Set.empty,
labels = RecapUserFeatureAggregation.NegativeEngagementLabels,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = OriginalAuthorAggregateStore,
startDate = "2023-02-25 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
)),
includeAnyLabel = false
)
val userListAggregates: AggregateGroup =
AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_list_aggregate",
keys = Set(USER_ID, ListFeatures.LIST_ID),
features = Set.empty,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserListAggregateStore,
startDate = "2020-05-28 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
)),
preTransforms = Seq(RichRemoveUserIdZero)
)
val userMediaUnderstandingAnnotationAggregates: AggregateGroup = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_media_annotation_aggregate",
preTransforms = Seq(RichRemoveUserIdZero),
keys =
Set(USER_ID, SemanticCoreFeatures.mediaUnderstandingHighRecallNonSensitiveEntityIdsFeature),
features = Set.empty,
labels = RecapUserFeatureAggregation.LabelsV2,
metrics = Set(CountMetric),
halfLives = Set(50.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserMediaUnderstandingAnnotationAggregateStore,
startDate = "2021-03-20 00:00",
commonConfig = timelinesOfflineAggregateSink
))
)
val userAuthorGoodClickAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_author_good_click_aggregate",
preTransforms = Seq(FilterInNetworkTransform, RichRemoveUserIdZero),
keys = Set(USER_ID, AUTHOR_ID),
features = RecapUserFeatureAggregation.UserAuthorFeaturesV2,
labels = RecapUserFeatureAggregation.GoodClickLabels,
metrics = Set(SumMetric),
halfLives = Set(14.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserAuthorAggregateStore,
startDate = "2016-07-15 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
))
)
val userEngagerGoodClickAggregates = AggregateGroup(
inputSource = timelinesDailyRecapMinimalSource,
aggregatePrefix = "user_engager_good_click_aggregate",
keys = Set(USER_ID, EngagementDataRecordFeatures.PublicEngagementUserIds),
features = Set.empty,
labels = RecapUserFeatureAggregation.GoodClickLabels,
metrics = Set(CountMetric),
halfLives = Set(14.days),
outputStore = mkPhysicalStore(
OfflineAggregateDataRecordStore(
name = UserEngagerAggregateStore,
startDate = "2016-09-02 00:00",
commonConfig = timelinesOfflineAggregateSink,
maxKvSourceFailures = defaultMaxKvSourceFailures
)),
preTransforms = Seq(
RichRemoveUserIdZero,
RichUnifyPublicEngagersTransform
)
)
}

View File

@ -0,0 +1,50 @@
package com.twitter.timelines.prediction.common.aggregates
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregationConfig
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.AggregateGroup
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.TypedAggregateGroup
trait TimelinesAggregationConfigTrait
extends TimelinesAggregationConfigDetails
with AggregationConfig {
private val aggregateGroups = Set(
authorTopicAggregates,
userTopicAggregates,
userTopicAggregatesV2,
userInferredTopicAggregates,
userInferredTopicAggregatesV2,
userAggregatesV2,
userAggregatesV5Continuous,
userReciprocalEngagementAggregates,
userAuthorAggregatesV5,
userOriginalAuthorReciprocalEngagementAggregates,
originalAuthorReciprocalEngagementAggregates,
tweetSourceUserAuthorAggregatesV1,
userEngagerAggregates,
userMentionAggregates,
twitterWideUserAggregates,
twitterWideUserAuthorAggregates,
userRequestHourAggregates,
userRequestDowAggregates,
userListAggregates,
userMediaUnderstandingAnnotationAggregates,
) ++ userAuthorAggregatesV2
val aggregatesToComputeList: Set[List[TypedAggregateGroup[_]]] =
aggregateGroups.map(_.buildTypedAggregateGroups())
override val aggregatesToCompute: Set[TypedAggregateGroup[_]] = aggregatesToComputeList.flatten
/*
* Feature selection config to save storage space and manhattan query bandwidth.
* Only the most important features found using offline RCE simulations are used
* when actually training and serving. This selector is used by
* [[com.twitter.timelines.data_processing.jobs.timeline_ranking_user_features.TimelineRankingAggregatesV2FeaturesProdJob]]
* but defined here to keep it in sync with the config that computes the aggregates.
*/
val AggregatesV2FeatureSelector = FeatureSelectorConfig.AggregatesV2ProdFeatureSelector
def filterAggregatesGroups(storeNames: Set[String]): Set[AggregateGroup] = {
aggregateGroups.filter(aggregateGroup => storeNames.contains(aggregateGroup.outputStore.name))
}
}

View File

@ -0,0 +1,48 @@
package com.twitter.timelines.prediction.common.aggregates
import com.twitter.ml.api.DataRecord
import com.twitter.scalding_internal.multiformat.format.keyval.KeyValInjection
import com.twitter.summingbird.batch.BatchID
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.{
AggregateStore,
AggregationKey,
OfflineAggregateInjections,
TypedAggregateGroup
}
object TimelinesAggregationKeyValInjections extends TimelinesAggregationConfigTrait {
import OfflineAggregateInjections.getInjection
type KVInjection = KeyValInjection[AggregationKey, (BatchID, DataRecord)]
val AuthorTopic: KVInjection = getInjection(filter(AuthorTopicAggregateStore))
val UserTopic: KVInjection = getInjection(filter(UserTopicAggregateStore))
val UserInferredTopic: KVInjection = getInjection(filter(UserInferredTopicAggregateStore))
val User: KVInjection = getInjection(filter(UserAggregateStore))
val UserAuthor: KVInjection = getInjection(filter(UserAuthorAggregateStore))
val UserOriginalAuthor: KVInjection = getInjection(filter(UserOriginalAuthorAggregateStore))
val OriginalAuthor: KVInjection = getInjection(filter(OriginalAuthorAggregateStore))
val UserEngager: KVInjection = getInjection(filter(UserEngagerAggregateStore))
val UserMention: KVInjection = getInjection(filter(UserMentionAggregateStore))
val TwitterWideUser: KVInjection = getInjection(filter(TwitterWideUserAggregateStore))
val TwitterWideUserAuthor: KVInjection = getInjection(filter(TwitterWideUserAuthorAggregateStore))
val UserRequestHour: KVInjection = getInjection(filter(UserRequestHourAggregateStore))
val UserRequestDow: KVInjection = getInjection(filter(UserRequestDowAggregateStore))
val UserList: KVInjection = getInjection(filter(UserListAggregateStore))
val UserMediaUnderstandingAnnotation: KVInjection = getInjection(
filter(UserMediaUnderstandingAnnotationAggregateStore))
private def filter(storeName: String): Set[TypedAggregateGroup[_]] = {
val groups = aggregatesToCompute.filter(_.outputStore.name == storeName)
require(groups.nonEmpty)
groups
}
override def outputHdfsPath: String = "/user/timelines/processed/aggregates_v2"
// Since this object is not used to execute any online or offline aggregates job, but is meant
// to store all PDT enabled KeyValInjections, we do not need to construct a physical store.
// We use the identity operation as a default.
override def mkPhysicalStore(store: AggregateStore): AggregateStore = store
}

View File

@ -0,0 +1,45 @@
package com.twitter.timelines.prediction.common.aggregates
import com.twitter.ml.api.constant.SharedFeatures.TIMESTAMP
import com.twitter.timelines.data_processing.ml_util.aggregation_framework.OfflineAggregateSource
import com.twitter.timelines.prediction.features.p_home_latest.HomeLatestUserAggregatesFeatures
import timelines.data_processing.ad_hoc.recap.data_record_preparation.RecapDataRecordsAggMinimalJavaDataset
/**
* Any update here should be in sync with [[TimelinesFeatureGroups]] and [[AggMinimalDataRecordGeneratorJob]].
*/
object TimelinesAggregationSources {
/**
* This is the recap data records after post-processing in [[GenerateRecapAggMinimalDataRecordsJob]]
*/
val timelinesDailyRecapMinimalSource = OfflineAggregateSource(
name = "timelines_daily_recap",
timestampFeature = TIMESTAMP,
dalDataSet = Some(RecapDataRecordsAggMinimalJavaDataset),
scaldingSuffixType = Some("dal"),
withValidation = true
)
val timelinesDailyTwitterWideSource = OfflineAggregateSource(
name = "timelines_daily_twitter_wide",
timestampFeature = TIMESTAMP,
scaldingHdfsPath = Some("/user/timelines/processed/suggests/recap/twitter_wide_data_records"),
scaldingSuffixType = Some("daily"),
withValidation = true
)
val timelinesDailyListTimelineSource = OfflineAggregateSource(
name = "timelines_daily_list_timeline",
timestampFeature = TIMESTAMP,
scaldingHdfsPath = Some("/user/timelines/processed/suggests/recap/all_features/list"),
scaldingSuffixType = Some("hourly"),
withValidation = true
)
val timelinesDailyHomeLatestSource = OfflineAggregateSource(
name = "timelines_daily_home_latest",
timestampFeature = HomeLatestUserAggregatesFeatures.AGGREGATE_TIMESTAMP_MS,
scaldingHdfsPath = Some("/user/timelines/processed/p_home_latest/user_aggregates"),
scaldingSuffixType = Some("daily")
)
}

View File

@ -0,0 +1,70 @@
package com.twitter.timelines.prediction.common.aggregates.real_time
import com.twitter.dal.personal_data.thriftjava.PersonalDataType.UserState
import com.twitter.ml.api.Feature.Binary
import com.twitter.ml.api.{DataRecord, Feature, FeatureContext, RichDataRecord}
import com.twitter.ml.featurestore.catalog.entities.core.Author
import com.twitter.ml.featurestore.catalog.features.magicrecs.UserActivity
import com.twitter.ml.featurestore.lib.data.PredictionRecord
import com.twitter.ml.featurestore.lib.feature.{BoundFeature, BoundFeatureSet}
import com.twitter.ml.featurestore.lib.{UserId, Discrete => FSDiscrete}
import com.twitter.timelines.prediction.common.adapters.TimelinesAdapterBase
import java.lang.{Boolean => JBoolean}
import java.util
import scala.collection.JavaConverters._
object AuthorFeaturesAdapter extends TimelinesAdapterBase[PredictionRecord] {
val UserStateBoundFeature: BoundFeature[UserId, FSDiscrete] = UserActivity.UserState.bind(Author)
val UserFeaturesSet: BoundFeatureSet = BoundFeatureSet(UserStateBoundFeature)
/**
* Boolean features about viewer's user state.
* enum UserState {
* NEW = 0,
* NEAR_ZERO = 1,
* VERY_LIGHT = 2,
* LIGHT = 3,
* MEDIUM_TWEETER = 4,
* MEDIUM_NON_TWEETER = 5,
* HEAVY_NON_TWEETER = 6,
* HEAVY_TWEETER = 7
* }(persisted='true')
*/
val IS_USER_NEW = new Binary("timelines.author.user_state.is_user_new", Set(UserState).asJava)
val IS_USER_LIGHT = new Binary("timelines.author.user_state.is_user_light", Set(UserState).asJava)
val IS_USER_MEDIUM_TWEETER =
new Binary("timelines.author.user_state.is_user_medium_tweeter", Set(UserState).asJava)
val IS_USER_MEDIUM_NON_TWEETER =
new Binary("timelines.author.user_state.is_user_medium_non_tweeter", Set(UserState).asJava)
val IS_USER_HEAVY_NON_TWEETER =
new Binary("timelines.author.user_state.is_user_heavy_non_tweeter", Set(UserState).asJava)
val IS_USER_HEAVY_TWEETER =
new Binary("timelines.author.user_state.is_user_heavy_tweeter", Set(UserState).asJava)
val userStateToFeatureMap: Map[Long, Binary] = Map(
0L -> IS_USER_NEW,
1L -> IS_USER_LIGHT,
2L -> IS_USER_LIGHT,
3L -> IS_USER_LIGHT,
4L -> IS_USER_MEDIUM_TWEETER,
5L -> IS_USER_MEDIUM_NON_TWEETER,
6L -> IS_USER_HEAVY_NON_TWEETER,
7L -> IS_USER_HEAVY_TWEETER
)
val UserStateBooleanFeatures: Set[Feature[_]] = userStateToFeatureMap.values.toSet
private val allFeatures: Seq[Feature[_]] = UserStateBooleanFeatures.toSeq
override def getFeatureContext: FeatureContext = new FeatureContext(allFeatures: _*)
override def commonFeatures: Set[Feature[_]] = Set.empty
override def adaptToDataRecords(record: PredictionRecord): util.List[DataRecord] = {
val newRecord = new RichDataRecord(new DataRecord)
record
.getFeatureValue(UserStateBoundFeature)
.flatMap { userState => userStateToFeatureMap.get(userState.value) }.foreach {
booleanFeature => newRecord.setFeatureValue[JBoolean](booleanFeature, true)
}
List(newRecord.getRecord).asJava
}
}

View File

@ -0,0 +1,199 @@
heron_binary(
name = "heron-without-jass",
main = "com.twitter.timelines.prediction.common.aggregates.real_time.TypeSafeRunner",
oss = True,
platform = "java8",
runtime_platform = "java8",
tags = ["bazel-compatible"],
dependencies = [
":real_time",
"3rdparty/jvm/org/slf4j:slf4j-jdk14",
],
)
jvm_app(
name = "rta_heron",
binary = ":heron-without-jass",
bundles = [
bundle(
fileset = ["resources/jaas.conf"],
),
],
tags = [
"bazel-compatible",
"bazel-only",
],
)
scala_library(
sources = ["*.scala"],
platform = "java8",
strict_deps = False,
tags = ["bazel-compatible"],
dependencies = [
":online-configs",
"3rdparty/src/jvm/com/twitter/summingbird:storm",
"src/java/com/twitter/heron/util",
"src/java/com/twitter/ml/api:api-base",
"src/java/com/twitter/ml/api/constant",
"src/scala/com/twitter/frigate/data_pipeline/features_aggregated/core:core-features",
"src/scala/com/twitter/ml/api/util",
"src/scala/com/twitter/storehaus_internal/memcache",
"src/scala/com/twitter/storehaus_internal/util",
"src/scala/com/twitter/summingbird_internal/bijection:bijection-implicits",
"src/scala/com/twitter/summingbird_internal/runner/store_config",
"src/scala/com/twitter/summingbird_internal/runner/storm",
"src/scala/com/twitter/summingbird_internal/sources/storm/remote:ClientEventSourceScrooge2",
"src/scala/com/twitter/timelines/prediction/adapters/client_log_event",
"src/scala/com/twitter/timelines/prediction/adapters/client_log_event_mr",
"src/scala/com/twitter/timelines/prediction/features/client_log_event",
"src/scala/com/twitter/timelines/prediction/features/common",
"src/scala/com/twitter/timelines/prediction/features/list_features",
"src/scala/com/twitter/timelines/prediction/features/recap",
"src/scala/com/twitter/timelines/prediction/features/user_health",
"src/thrift/com/twitter/ml/api:data-java",
"src/thrift/com/twitter/timelines/suggests/common:record-scala",
"timelinemixer/common/src/main/scala/com/twitter/timelinemixer/clients/served_features_cache",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
"timelines/data_processing/ml_util/aggregation_framework/heron",
"timelines/data_processing/ml_util/aggregation_framework/job",
"timelines/data_processing/ml_util/aggregation_framework/metrics",
"timelines/data_processing/ml_util/transforms",
"timelines/src/main/scala/com/twitter/timelines/clients/memcache_common",
"util/util-core:scala",
],
)
scala_library(
name = "online-configs",
sources = [
"AuthorFeaturesAdapter.scala",
"Event.scala",
"FeatureStoreUtils.scala",
"StormAggregateSourceUtils.scala",
"TimelinesOnlineAggregationConfig.scala",
"TimelinesOnlineAggregationConfigBase.scala",
"TimelinesOnlineAggregationSources.scala",
"TimelinesStormAggregateSource.scala",
"TweetFeaturesReadableStore.scala",
"UserFeaturesAdapter.scala",
"UserFeaturesReadableStore.scala",
],
platform = "java8",
strict_deps = True,
tags = ["bazel-compatible"],
dependencies = [
":base-config",
"3rdparty/src/jvm/com/twitter/scalding:db",
"3rdparty/src/jvm/com/twitter/storehaus:core",
"3rdparty/src/jvm/com/twitter/summingbird:core",
"3rdparty/src/jvm/com/twitter/summingbird:online",
"3rdparty/src/jvm/com/twitter/summingbird:storm",
"abuse/detection/src/main/thrift/com/twitter/abuse/detection/mention_interactions:thrift-scala",
"snowflake/src/main/scala/com/twitter/snowflake/id",
"snowflake/src/main/thrift:thrift-scala",
"src/java/com/twitter/ml/api:api-base",
"src/java/com/twitter/ml/api/constant",
"src/scala/com/twitter/frigate/data_pipeline/features_aggregated/core:core-features",
"src/scala/com/twitter/ml/api/util:datarecord",
"src/scala/com/twitter/ml/featurestore/catalog/datasets/geo:geo-user-location",
"src/scala/com/twitter/ml/featurestore/catalog/datasets/magicrecs:user-features",
"src/scala/com/twitter/ml/featurestore/catalog/entities/core",
"src/scala/com/twitter/ml/featurestore/catalog/features/core:user",
"src/scala/com/twitter/ml/featurestore/catalog/features/geo",
"src/scala/com/twitter/ml/featurestore/catalog/features/magicrecs:user-activity",
"src/scala/com/twitter/ml/featurestore/catalog/features/magicrecs:user-info",
"src/scala/com/twitter/ml/featurestore/catalog/features/trends:tweet_trends_scores",
"src/scala/com/twitter/ml/featurestore/lib/data",
"src/scala/com/twitter/ml/featurestore/lib/dataset/offline",
"src/scala/com/twitter/ml/featurestore/lib/export/strato:app-names",
"src/scala/com/twitter/ml/featurestore/lib/feature",
"src/scala/com/twitter/ml/featurestore/lib/online",
"src/scala/com/twitter/ml/featurestore/lib/params",
"src/scala/com/twitter/storehaus_internal/util",
"src/scala/com/twitter/summingbird_internal/bijection:bijection-implicits",
"src/scala/com/twitter/summingbird_internal/runner/store_config",
"src/scala/com/twitter/summingbird_internal/runner/storm",
"src/scala/com/twitter/summingbird_internal/sources/common",
"src/scala/com/twitter/summingbird_internal/sources/common/remote:ClientEventSourceScrooge",
"src/scala/com/twitter/summingbird_internal/sources/storm/remote:ClientEventSourceScrooge2",
"src/scala/com/twitter/timelines/prediction/adapters/client_log_event",
"src/scala/com/twitter/timelines/prediction/adapters/client_log_event_mr",
"src/scala/com/twitter/timelines/prediction/common/adapters:base",
"src/scala/com/twitter/timelines/prediction/common/adapters:engagement-converter",
"src/scala/com/twitter/timelines/prediction/common/aggregates",
"src/scala/com/twitter/timelines/prediction/features/client_log_event",
"src/scala/com/twitter/timelines/prediction/features/common",
"src/scala/com/twitter/timelines/prediction/features/list_features",
"src/scala/com/twitter/timelines/prediction/features/recap",
"src/scala/com/twitter/timelines/prediction/features/user_health",
"src/thrift/com/twitter/clientapp/gen:clientapp-scala",
"src/thrift/com/twitter/dal/personal_data:personal_data-java",
"src/thrift/com/twitter/ml/api:data-java",
"src/thrift/com/twitter/timelines/suggests/common:engagement-java",
"src/thrift/com/twitter/timelines/suggests/common:engagement-scala",
"src/thrift/com/twitter/timelines/suggests/common:record-scala",
"src/thrift/com/twitter/timelineservice/injection:thrift-scala",
"src/thrift/com/twitter/timelineservice/server/suggests/logging:thrift-scala",
"strato/src/main/scala/com/twitter/strato/client",
"timelinemixer/common/src/main/scala/com/twitter/timelinemixer/clients/served_features_cache",
"timelines/data_processing/ad_hoc/suggests/common:raw_training_data_creator",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
"timelines/data_processing/ml_util/aggregation_framework/heron:configs",
"timelines/data_processing/ml_util/aggregation_framework/metrics",
"timelines/data_processing/ml_util/transforms",
"timelines/data_processing/util:rich-request",
"tweetsource/common/src/main/thrift:thrift-scala",
"twitter-server-internal/src/main/scala",
"unified_user_actions/client/src/main/scala/com/twitter/unified_user_actions/client/config",
"unified_user_actions/client/src/main/scala/com/twitter/unified_user_actions/client/summingbird",
"unified_user_actions/thrift/src/main/thrift/com/twitter/unified_user_actions:unified_user_actions-scala",
"util/util-core:scala",
"util/util-stats/src/main/scala/com/twitter/finagle/stats",
],
)
scala_library(
name = "base-config",
sources = [
"AuthorFeaturesAdapter.scala",
"TimelinesOnlineAggregationConfigBase.scala",
"TweetFeaturesAdapter.scala",
"UserFeaturesAdapter.scala",
],
platform = "java8",
strict_deps = True,
tags = ["bazel-compatible"],
dependencies = [
"src/java/com/twitter/ml/api:api-base",
"src/java/com/twitter/ml/api/constant",
"src/resources/com/twitter/timelines/prediction/common/aggregates/real_time",
"src/scala/com/twitter/ml/api/util:datarecord",
"src/scala/com/twitter/ml/featurestore/catalog/datasets/magicrecs:user-features",
"src/scala/com/twitter/ml/featurestore/catalog/entities/core",
"src/scala/com/twitter/ml/featurestore/catalog/features/core:user",
"src/scala/com/twitter/ml/featurestore/catalog/features/geo",
"src/scala/com/twitter/ml/featurestore/catalog/features/magicrecs:user-activity",
"src/scala/com/twitter/ml/featurestore/catalog/features/magicrecs:user-info",
"src/scala/com/twitter/ml/featurestore/catalog/features/trends:tweet_trends_scores",
"src/scala/com/twitter/ml/featurestore/lib/data",
"src/scala/com/twitter/ml/featurestore/lib/feature",
"src/scala/com/twitter/timelines/prediction/common/adapters:base",
"src/scala/com/twitter/timelines/prediction/common/adapters:engagement-converter",
"src/scala/com/twitter/timelines/prediction/common/aggregates",
"src/scala/com/twitter/timelines/prediction/features/client_log_event",
"src/scala/com/twitter/timelines/prediction/features/common",
"src/scala/com/twitter/timelines/prediction/features/list_features",
"src/scala/com/twitter/timelines/prediction/features/recap",
"src/scala/com/twitter/timelines/prediction/features/user_health",
"src/thrift/com/twitter/dal/personal_data:personal_data-java",
"src/thrift/com/twitter/ml/api:feature_context-java",
"src/thrift/com/twitter/timelines/suggests/common:engagement-scala",
"timelines/data_processing/ml_util/aggregation_framework:common_types",
"timelines/data_processing/ml_util/aggregation_framework/heron:base-config",
"timelines/data_processing/ml_util/aggregation_framework/metrics",
"timelines/data_processing/ml_util/transforms",
"util/util-core:scala",
"util/util-core:util-core-util",
],
)

View File

@ -0,0 +1,11 @@
package com.twitter.timelines.prediction.common.aggregates.real_time
private[real_time] sealed trait Event[T] { def event: T }
private[real_time] case class HomeEvent[T](override val event: T) extends Event[T]
private[real_time] case class ProfileEvent[T](override val event: T) extends Event[T]
private[real_time] case class SearchEvent[T](override val event: T) extends Event[T]
private[real_time] case class UuaEvent[T](override val event: T) extends Event[T]

View File

@ -0,0 +1,53 @@
package com.twitter.timelines.prediction.common.aggregates.real_time
import com.twitter.finagle.mtls.authentication.ServiceIdentifier
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.ml.featurestore.catalog.datasets.magicrecs.UserFeaturesDataset
import com.twitter.ml.featurestore.catalog.datasets.geo.GeoUserLocationDataset
import com.twitter.ml.featurestore.lib.dataset.DatasetParams
import com.twitter.ml.featurestore.lib.export.strato.FeatureStoreAppNames
import com.twitter.ml.featurestore.lib.online.FeatureStoreClient
import com.twitter.ml.featurestore.lib.params.FeatureStoreParams
import com.twitter.strato.client.{Client, Strato}
import com.twitter.strato.opcontext.Attribution.ManhattanAppId
import com.twitter.util.Duration
private[real_time] object FeatureStoreUtils {
private def mkStratoClient(serviceIdentifier: ServiceIdentifier): Client =
Strato.client
.withMutualTls(serviceIdentifier)
.withRequestTimeout(Duration.fromMilliseconds(50))
.build()
private val featureStoreParams: FeatureStoreParams =
FeatureStoreParams(
perDataset = Map(
UserFeaturesDataset.id ->
DatasetParams(
stratoSuffix = Some(FeatureStoreAppNames.Timelines),
attributions = Seq(ManhattanAppId("athena", "timelines_aggregates_v2_features_by_user"))
),
GeoUserLocationDataset.id ->
DatasetParams(
attributions = Seq(ManhattanAppId("starbuck", "timelines_geo_features_by_user"))
)
)
)
def mkFeatureStoreClient(
serviceIdentifier: ServiceIdentifier,
statsReceiver: StatsReceiver
): FeatureStoreClient = {
com.twitter.server.Init() // necessary in order to use WilyNS path
val stratoClient: Client = mkStratoClient(serviceIdentifier)
val featureStoreClient: FeatureStoreClient = FeatureStoreClient(
featureSet =
UserFeaturesAdapter.UserFeaturesSet ++ AuthorFeaturesAdapter.UserFeaturesSet ++ TweetFeaturesAdapter.TweetFeaturesSet,
client = stratoClient,
statsReceiver = statsReceiver,
featureStoreParams = featureStoreParams
)
featureStoreClient
}
}

Some files were not shown because too many files have changed in this diff Show More