This commit is contained in:
Kyle McIndoe 2023-04-11 09:06:41 -04:00
commit 3925317a3d
41 changed files with 5432 additions and 116 deletions

View File

@ -1,6 +1,6 @@
# Twitter Recommendation Algorithm
# Twitter's Recommendation Algorithm
The Twitter Recommendation Algorithm is a set of services and jobs that are responsible for constructing and serving the
Twitter's Recommendation Algorithm is a set of services and jobs that are responsible for constructing and serving the
Home Timeline. For an introduction to how the algorithm works, please refer to our [engineering blog](https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm). The
diagram below illustrates how major services and jobs interconnect.
@ -13,24 +13,24 @@ These are the main components of the Recommendation Algorithm included in this r
| Feature | [SimClusters](src/scala/com/twitter/simclusters_v2/README.md) | Community detection and sparse embeddings into those communities. |
| | [TwHIN](https://github.com/twitter/the-algorithm-ml/blob/main/projects/twhin/README.md) | Dense knowledge graph embeddings for Users and Tweets. |
| | [trust-and-safety-models](trust_and_safety_models/README.md) | Models for detecting NSFW or abusive content. |
| | [real-graph](src/scala/com/twitter/interaction_graph/README.md) | Model to predict likelihood of a Twitter User interacting with another User. |
| | [real-graph](src/scala/com/twitter/interaction_graph/README.md) | Model to predict the likelihood of a Twitter User interacting with another User. |
| | [tweepcred](src/scala/com/twitter/graph/batch/job/tweepcred/README) | Page-Rank algorithm for calculating Twitter User reputation. |
| | [recos-injector](recos-injector/README.md) | Streaming event processor for building input streams for [GraphJet](https://github.com/twitter/GraphJet) based services. |
| | [graph-feature-service](graph-feature-service/README.md) | Serves graph features for a directed pair of Users (e.g. how many of User A's following liked Tweets from User B). |
| Candidate Source | [search-index](src/java/com/twitter/search/README.md) | Find and rank In-Network Tweets. ~50% of Tweets come from this candidate source. |
| | [cr-mixer](cr-mixer/README.md) | Coordination layer for fetching Out-of-Network tweet candidates from underlying compute services. |
| | [user-tweet-entity-graph](src/scala/com/twitter/recos/user_tweet_entity_graph/README.md) (UTEG)| Maintains an in memory User to Tweet interaction graph, and finds candidates based on traversals of this graph. This is built on the [GraphJet](https://github.com/twitter/GraphJet) framework. Several other GraphJet based features and candidate sources are located [here](src/scala/com/twitter/recos) |
| | [user-tweet-entity-graph](src/scala/com/twitter/recos/user_tweet_entity_graph/README.md) (UTEG)| Maintains an in memory User to Tweet interaction graph, and finds candidates based on traversals of this graph. This is built on the [GraphJet](https://github.com/twitter/GraphJet) framework. Several other GraphJet based features and candidate sources are located [here](src/scala/com/twitter/recos). |
| | [follow-recommendation-service](follow-recommendations-service/README.md) (FRS)| Provides Users with recommendations for accounts to follow, and Tweets from those accounts. |
| Ranking | [light-ranker](src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md) | Light ranker model used by search index (Earlybird) to rank Tweets. |
| Ranking | [light-ranker](src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md) | Light Ranker model used by search index (Earlybird) to rank Tweets. |
| | [heavy-ranker](https://github.com/twitter/the-algorithm-ml/blob/main/projects/home/recap/README.md) | Neural network for ranking candidate tweets. One of the main signals used to select timeline Tweets post candidate sourcing. |
| Tweet mixing & filtering | [home-mixer](home-mixer/README.md) | Main service used to construct and serve the Home Timeline. Built on [product-mixer](product-mixer/README.md) |
| Tweet mixing & filtering | [home-mixer](home-mixer/README.md) | Main service used to construct and serve the Home Timeline. Built on [product-mixer](product-mixer/README.md). |
| | [visibility-filters](visibilitylib/README.md) | Responsible for filtering Twitter content to support legal compliance, improve product quality, increase user trust, protect revenue through the use of hard-filtering, visible product treatments, and coarse-grained downranking. |
| | [timelineranker](timelineranker/README.md) | Legacy service which provides relevance-scored tweets from the Earlybird Search Index and UTEG service. |
| Software framework | [navi](navi/navi/README.md) | High performance, machine learning model serving written in Rust. |
| Software framework | [navi](navi/README.md) | High performance, machine learning model serving written in Rust. |
| | [product-mixer](product-mixer/README.md) | Software framework for building feeds of content. |
| | [twml](twml/README.md) | Legacy machine learning framework built on TensorFlow v1. |
We include Bazel BUILD files for most components, but not a top level BUILD or WORKSPACE file.
We include Bazel BUILD files for most components, but not a top-level BUILD or WORKSPACE file.
## Contributing

View File

@ -0,0 +1,232 @@
import argparse
import logging
import os
import pkgutil
import sys
from urllib.parse import urlsplit
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import faiss
def parse_d6w_config(argv=None):
"""Parse d6w config.
:param argv: d6w config
:return: dictionary containing d6w config
"""
parser = argparse.ArgumentParser(
description="See https://docbird.twitter.biz/d6w/model.html for any parameters inherited from d6w job config"
)
parser.add_argument("--job_name", dest="job_name", required=True, help="d6w attribute")
parser.add_argument("--project", dest="project", required=True, help="d6w attribute")
parser.add_argument(
"--staging_location", dest="staging_location", required=True, help="d6w attribute"
)
parser.add_argument("--temp_location", dest="temp_location", required=True, help="d6w attribute")
parser.add_argument(
"--output_location",
dest="output_location",
required=True,
help="GCS bucket and path where resulting artifacts are uploaded",
)
parser.add_argument(
"--service_account_email", dest="service_account_email", required=True, help="d6w attribute"
)
parser.add_argument(
"--factory_string",
dest="factory_string",
required=False,
help="FAISS factory string describing index to build. See https://github.com/facebookresearch/faiss/wiki/The-index-factory",
)
parser.add_argument(
"--metric",
dest="metric",
required=True,
help="Metric used to compute distance between embeddings. Valid values are 'l2', 'ip', 'l1', 'linf'",
)
parser.add_argument(
"--use_gpu",
dest="gpu",
required=True,
help="--use_gpu=yes if you want to use GPU during index building",
)
known_args, unknown_args = parser.parse_known_args(argv)
d6w_config = vars(known_args)
d6w_config["gpu"] = d6w_config["gpu"].lower() == "yes"
d6w_config["metric"] = parse_metric(d6w_config)
"""
WARNING: Currently, d6w (a Twitter tool used to deploy Dataflow jobs to GCP) and
PipelineOptions.for_dataflow_runner (a helper method in twitter.ml.common.apache_beam) do not
play nicely together. The helper method will overwrite some of the config specified in the d6w
file using the defaults in https://sourcegraph.twitter.biz/git.twitter.biz/source/-/blob/src/python/twitter/ml/common/apache_beam/__init__.py?L24.'
However, the d6w output message will still report that the config specified in the d6w file was used.
"""
logging.warning(
f"The following d6w config parameters will be overwritten by the defaults in "
f"https://sourcegraph.twitter.biz/git.twitter.biz/source/-/blob/src/python/twitter/ml/common/apache_beam/__init__.py?L24\n"
f"{str(unknown_args)}"
)
return d6w_config
def get_bq_query():
"""
Query is expected to return rows with unique entityId
"""
return pkgutil.get_data(__name__, "bq.sql").decode("utf-8")
def parse_metric(config):
metric_str = config["metric"].lower()
if metric_str == "l2":
return faiss.METRIC_L2
elif metric_str == "ip":
return faiss.METRIC_INNER_PRODUCT
elif metric_str == "l1":
return faiss.METRIC_L1
elif metric_str == "linf":
return faiss.METRIC_Linf
else:
raise Exception(f"Unknown metric: {metric_str}")
def run_pipeline(argv=[]):
config = parse_d6w_config(argv)
argv_with_extras = argv
if config["gpu"]:
argv_with_extras.extend(["--experiments", "use_runner_v2"])
argv_with_extras.extend(
["--experiments", "worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver"]
)
argv_with_extras.extend(
[
"--worker_harness_container_image",
"gcr.io/twttr-recos-ml-prod/dataflow-gpu/beam2_39_0_py3_7",
]
)
options = PipelineOptions(argv_with_extras)
output_bucket_name = urlsplit(config["output_location"]).netloc
with beam.Pipeline(options=options) as p:
input_data = p | "Read from BigQuery" >> beam.io.ReadFromBigQuery(
method=beam.io.ReadFromBigQuery.Method.DIRECT_READ,
query=get_bq_query(),
use_standard_sql=True,
)
index_built = input_data | "Build and upload index" >> beam.CombineGlobally(
MergeAndBuildIndex(
output_bucket_name,
config["output_location"],
config["factory_string"],
config["metric"],
config["gpu"],
)
)
# Make linter happy
index_built
class MergeAndBuildIndex(beam.CombineFn):
def __init__(self, bucket_name, gcs_output_path, factory_string, metric, gpu):
self.bucket_name = bucket_name
self.gcs_output_path = gcs_output_path
self.factory_string = factory_string
self.metric = metric
self.gpu = gpu
def create_accumulator(self):
return []
def add_input(self, accumulator, element):
accumulator.append(element)
return accumulator
def merge_accumulators(self, accumulators):
merged = []
for accum in accumulators:
merged.extend(accum)
return merged
def extract_output(self, rows):
# Reimports are needed on workers
import glob
import subprocess
import faiss
from google.cloud import storage
import numpy as np
client = storage.Client()
bucket = client.get_bucket(self.bucket_name)
logging.info("Building FAISS index")
logging.info(f"There are {len(rows)} rows")
ids = np.array([x["entityId"] for x in rows]).astype("long")
embeds = np.array([x["embedding"] for x in rows]).astype("float32")
dimensions = len(embeds[0])
N = ids.shape[0]
logging.info(f"There are {dimensions} dimensions")
if self.factory_string is None:
M = 48
divideable_dimensions = (dimensions // M) * M
if divideable_dimensions != dimensions:
opq_prefix = f"OPQ{M}_{divideable_dimensions}"
else:
opq_prefix = f"OPQ{M}"
clusters = N // 20
self.factory_string = f"{opq_prefix},IVF{clusters},PQ{M}"
logging.info(f"Factory string is {self.factory_string}, metric={self.metric}")
if self.gpu:
logging.info("Using GPU")
res = faiss.StandardGpuResources()
cpu_index = faiss.index_factory(dimensions, self.factory_string, self.metric)
cpu_index = faiss.IndexIDMap(cpu_index)
gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index)
gpu_index.train(embeds)
gpu_index.add_with_ids(embeds, ids)
cpu_index = faiss.index_gpu_to_cpu(gpu_index)
else:
logging.info("Using CPU")
cpu_index = faiss.index_factory(dimensions, self.factory_string, self.metric)
cpu_index = faiss.IndexIDMap(cpu_index)
cpu_index.train(embeds)
cpu_index.add_with_ids(embeds, ids)
logging.info("Built faiss index")
local_path = "/indices"
logging.info(f"Writing indices to local {local_path}")
subprocess.run(f"mkdir -p {local_path}".strip().split())
local_index_path = os.path.join(local_path, "result.index")
faiss.write_index(cpu_index, local_index_path)
logging.info(f"Done writing indices to local {local_path}")
logging.info(f"Uploading to GCS with path {self.gcs_output_path}")
assert os.path.isdir(local_path)
for local_file in glob.glob(local_path + "/*"):
remote_path = os.path.join(
self.gcs_output_path.split("/")[-1], local_file[1 + len(local_path) :]
)
blob = bucket.blob(remote_path)
blob.upload_from_filename(local_file)
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
run_pipeline(sys.argv)

7
cr-mixer/README.md Normal file
View File

@ -0,0 +1,7 @@
# CR-Mixer
CR-Mixer is a candidate generation service proposed as part of the Personalization Strategy vision for Twitter. Its aim is to speed up the iteration and development of candidate generation and light ranking. The service acts as a lightweight coordinating layer that delegates candidate generation tasks to underlying compute services. It focuses on Twitter's candidate generation use cases and offers a centralized platform for fetching, mixing, and managing candidate sources and light rankers. The overarching goal is to increase the speed and ease of testing and developing candidate generation pipelines, ultimately delivering more value to Twitter users.
CR-Mixer acts as a configurator and delegator, providing abstractions for the challenging parts of candidate generation and handling performance issues. It will offer a 1-stop-shop for fetching and mixing candidate sources, a managed and shared performant platform, a light ranking layer, a common filtering layer, a version control system, a co-owned feature switch set, and peripheral tooling.
CR-Mixer's pipeline consists of 4 steps: source signal extraction, candidate generation, filtering, and ranking. It also provides peripheral tooling like scribing, debugging, and monitoring. The service fetches source signals externally from stores like UserProfileService and RealGraph, calls external candidate generation services, and caches results. Filters are applied for deduping and pre-ranking, and a light ranking step follows.

View File

@ -0,0 +1,138 @@
package com.twitter.cr_mixer.similarity_engine
import com.twitter.finagle.stats.StatsReceiver
import com.twitter.search.earlybird.thriftscala.EarlybirdRequest
import com.twitter.search.earlybird.thriftscala.EarlybirdService
import com.twitter.search.earlybird.thriftscala.ThriftSearchQuery
import com.twitter.util.Time
import com.twitter.search.common.query.thriftjava.thriftscala.CollectorParams
import com.twitter.search.common.ranking.thriftscala.ThriftRankingParams
import com.twitter.search.common.ranking.thriftscala.ThriftScoringFunctionType
import com.twitter.search.earlybird.thriftscala.ThriftSearchRelevanceOptions
import javax.inject.Inject
import javax.inject.Singleton
import EarlybirdSimilarityEngineBase._
import com.twitter.cr_mixer.config.TimeoutConfig
import com.twitter.cr_mixer.similarity_engine.EarlybirdTensorflowBasedSimilarityEngine.EarlybirdTensorflowBasedSearchQuery
import com.twitter.cr_mixer.util.EarlybirdSearchUtil.EarlybirdClientId
import com.twitter.cr_mixer.util.EarlybirdSearchUtil.FacetsToFetch
import com.twitter.cr_mixer.util.EarlybirdSearchUtil.GetCollectorTerminationParams
import com.twitter.cr_mixer.util.EarlybirdSearchUtil.GetEarlybirdQuery
import com.twitter.cr_mixer.util.EarlybirdSearchUtil.MetadataOptions
import com.twitter.cr_mixer.util.EarlybirdSearchUtil.GetNamedDisjunctions
import com.twitter.search.earlybird.thriftscala.ThriftSearchRankingMode
import com.twitter.simclusters_v2.common.TweetId
import com.twitter.simclusters_v2.common.UserId
import com.twitter.util.Duration
@Singleton
case class EarlybirdTensorflowBasedSimilarityEngine @Inject() (
earlybirdSearchClient: EarlybirdService.MethodPerEndpoint,
timeoutConfig: TimeoutConfig,
stats: StatsReceiver)
extends EarlybirdSimilarityEngineBase[EarlybirdTensorflowBasedSearchQuery] {
import EarlybirdTensorflowBasedSimilarityEngine._
override val statsReceiver: StatsReceiver = stats.scope(this.getClass.getSimpleName)
override def getEarlybirdRequest(
query: EarlybirdTensorflowBasedSearchQuery
): Option[EarlybirdRequest] = {
if (query.seedUserIds.nonEmpty)
Some(
EarlybirdRequest(
searchQuery = getThriftSearchQuery(query, timeoutConfig.earlybirdServerTimeout),
clientHost = None,
clientRequestID = None,
clientId = Some(EarlybirdClientId),
clientRequestTimeMs = Some(Time.now.inMilliseconds),
cachingParams = None,
timeoutMs = timeoutConfig.earlybirdServerTimeout.inMilliseconds.intValue(),
facetRequest = None,
termStatisticsRequest = None,
debugMode = 0,
debugOptions = None,
searchSegmentId = None,
returnStatusType = None,
successfulResponseThreshold = None,
querySource = None,
getOlderResults = Some(false),
followedUserIds = Some(query.seedUserIds),
adjustedProtectedRequestParams = None,
adjustedFullArchiveRequestParams = None,
getProtectedTweetsOnly = Some(false),
retokenizeSerializedQuery = None,
skipVeryRecentTweets = true,
experimentClusterToUse = None
))
else None
}
}
object EarlybirdTensorflowBasedSimilarityEngine {
case class EarlybirdTensorflowBasedSearchQuery(
searcherUserId: Option[UserId],
seedUserIds: Seq[UserId],
maxNumTweets: Int,
beforeTweetIdExclusive: Option[TweetId],
afterTweetIdExclusive: Option[TweetId],
filterOutRetweetsAndReplies: Boolean,
useTensorflowRanking: Boolean,
excludedTweetIds: Set[TweetId],
maxNumHitsPerShard: Int)
extends EarlybirdSearchQuery
private def getThriftSearchQuery(
query: EarlybirdTensorflowBasedSearchQuery,
processingTimeout: Duration
): ThriftSearchQuery =
ThriftSearchQuery(
serializedQuery = GetEarlybirdQuery(
query.beforeTweetIdExclusive,
query.afterTweetIdExclusive,
query.excludedTweetIds,
query.filterOutRetweetsAndReplies).map(_.serialize),
fromUserIDFilter64 = Some(query.seedUserIds),
numResults = query.maxNumTweets,
// Whether to collect conversation IDs. Remove it for now.
// collectConversationId = Gate.True(), // true for Home
rankingMode = ThriftSearchRankingMode.Relevance,
relevanceOptions = Some(getRelevanceOptions),
collectorParams = Some(
CollectorParams(
// numResultsToReturn defines how many results each EB shard will return to search root
numResultsToReturn = 1000,
// terminationParams.maxHitsToProcess is used for early terminating per shard results fetching.
terminationParams =
GetCollectorTerminationParams(query.maxNumHitsPerShard, processingTimeout)
)),
facetFieldNames = Some(FacetsToFetch),
resultMetadataOptions = Some(MetadataOptions),
searcherId = query.searcherUserId,
searchStatusIds = None,
namedDisjunctionMap = GetNamedDisjunctions(query.excludedTweetIds)
)
// The specific values of recap relevance/reranking options correspond to
// experiment: enable_recap_reranking_2988,timeline_internal_disable_recap_filter
// bucket : enable_rerank,disable_filter
private def getRelevanceOptions: ThriftSearchRelevanceOptions = {
ThriftSearchRelevanceOptions(
proximityScoring = true,
maxConsecutiveSameUser = Some(2),
rankingParams = Some(getTensorflowBasedRankingParams),
maxHitsToProcess = Some(500),
maxUserBlendCount = Some(3),
proximityPhraseWeight = 9.0,
returnAllResults = Some(true)
)
}
private def getTensorflowBasedRankingParams: ThriftRankingParams = {
ThriftRankingParams(
`type` = Some(ThriftScoringFunctionType.TensorflowBased),
selectedTensorflowModel = Some("timelines_rectweet_replica"),
minScore = -1.0e100,
applyBoosts = false,
authorSpecificScoreAdjustments = None
)
}
}

View File

@ -0,0 +1,227 @@
package com.twitter.home_mixer.functional_component.decorator
import com.twitter.conversions.DurationOps._
import com.twitter.home_mixer.model.HomeFeatures._
import com.twitter.product_mixer.core.feature.featuremap.FeatureMap
import com.twitter.timelinemixer.injection.model.candidate.SemanticCoreFeatures
import com.twitter.tweetypie.{thriftscala => tpt}
object HomeTweetTypePredicates {
/**
* IMPORTANT: Please avoid logging tweet types that are tied to sensitive
* internal author information / labels (e.g. blink labels, abuse labels, or geo-location).
*/
private[this] val CandidatePredicates: Seq[(String, FeatureMap => Boolean)] = Seq(
("with_candidate", _ => true),
("retweet", _.getOrElse(IsRetweetFeature, false)),
("reply", _.getOrElse(InReplyToTweetIdFeature, None).nonEmpty),
("image", _.getOrElse(EarlybirdFeature, None).exists(_.hasImage)),
("video", _.getOrElse(EarlybirdFeature, None).exists(_.hasVideo)),
("link", _.getOrElse(EarlybirdFeature, None).exists(_.hasVisibleLink)),
("quote", _.getOrElse(EarlybirdFeature, None).exists(_.hasQuote.contains(true))),
("like_social_context", _.getOrElse(NonSelfFavoritedByUserIdsFeature, Seq.empty).nonEmpty),
("protected", _.getOrElse(EarlybirdFeature, None).exists(_.isProtected)),
(
"has_exclusive_conversation_author_id",
_.getOrElse(ExclusiveConversationAuthorIdFeature, None).nonEmpty),
("is_eligible_for_connect_boost", _.getOrElse(AuthorIsEligibleForConnectBoostFeature, false)),
("hashtag", _.getOrElse(EarlybirdFeature, None).exists(_.numHashtags > 0)),
("has_scheduled_space", _.getOrElse(AudioSpaceMetaDataFeature, None).exists(_.isScheduled)),
("has_recorded_space", _.getOrElse(AudioSpaceMetaDataFeature, None).exists(_.isRecorded)),
("is_read_from_cache", _.getOrElse(IsReadFromCacheFeature, false)),
(
"is_self_thread_tweet",
_.getOrElse(ConversationFeature, None).exists(_.isSelfThreadTweet.contains(true))),
("get_initial", _.getOrElse(GetInitialFeature, false)),
("get_newer", _.getOrElse(GetNewerFeature, false)),
("get_middle", _.getOrElse(GetMiddleFeature, false)),
("get_older", _.getOrElse(GetOlderFeature, false)),
("pull_to_refresh", _.getOrElse(PullToRefreshFeature, false)),
("polling", _.getOrElse(PollingFeature, false)),
("tls_size_20_plus", _ => false),
("near_empty", _ => false),
("ranked_request", _ => false),
("mutual_follow", _.getOrElse(EarlybirdFeature, None).exists(_.fromMutualFollow)),
("has_ticketed_space", _.getOrElse(AudioSpaceMetaDataFeature, None).exists(_.hasTickets)),
("in_utis_top5", _.getOrElse(PositionFeature, None).exists(_ < 5)),
("is_utis_pos0", _.getOrElse(PositionFeature, None).exists(_ == 0)),
("is_utis_pos1", _.getOrElse(PositionFeature, None).exists(_ == 1)),
("is_utis_pos2", _.getOrElse(PositionFeature, None).exists(_ == 2)),
("is_utis_pos3", _.getOrElse(PositionFeature, None).exists(_ == 3)),
("is_utis_pos4", _.getOrElse(PositionFeature, None).exists(_ == 4)),
(
"is_signup_request",
candidate => candidate.getOrElse(AccountAgeFeature, None).exists(_.untilNow < 30.minutes)),
("empty_request", _ => false),
("served_size_less_than_5", _.getOrElse(ServedSizeFeature, None).exists(_ < 5)),
("served_size_less_than_10", _.getOrElse(ServedSizeFeature, None).exists(_ < 10)),
("served_size_less_than_20", _.getOrElse(ServedSizeFeature, None).exists(_ < 20)),
("served_size_less_than_50", _.getOrElse(ServedSizeFeature, None).exists(_ < 50)),
(
"served_size_between_50_and_100",
_.getOrElse(ServedSizeFeature, None).exists(size => size >= 50 && size < 100)),
("authored_by_contextual_user", _.getOrElse(AuthoredByContextualUserFeature, false)),
("has_ancestors", _.getOrElse(AncestorsFeature, Seq.empty).nonEmpty),
("full_scoring_succeeded", _.getOrElse(FullScoringSucceededFeature, false)),
(
"account_age_less_than_30_minutes",
_.getOrElse(AccountAgeFeature, None).exists(_.untilNow < 30.minutes)),
(
"account_age_less_than_1_day",
_.getOrElse(AccountAgeFeature, None).exists(_.untilNow < 1.day)),
(
"account_age_less_than_7_days",
_.getOrElse(AccountAgeFeature, None).exists(_.untilNow < 7.days)),
(
"directed_at_user_is_in_first_degree",
_.getOrElse(EarlybirdFeature, None).exists(_.directedAtUserIdIsInFirstDegree.contains(true))),
("root_user_is_in_first_degree", _ => false),
(
"has_semantic_core_annotation",
_.getOrElse(EarlybirdFeature, None).exists(_.semanticCoreAnnotations.nonEmpty)),
("is_request_context_foreground", _.getOrElse(IsForegroundRequestFeature, false)),
(
"part_of_utt",
_.getOrElse(EarlybirdFeature, None)
.exists(_.semanticCoreAnnotations.exists(_.exists(annotation =>
annotation.domainId == SemanticCoreFeatures.UnifiedTwitterTaxonomy)))),
("is_random_tweet", _.getOrElse(IsRandomTweetFeature, false)),
("has_random_tweet_in_response", _.getOrElse(HasRandomTweetFeature, false)),
("is_random_tweet_above_in_utis", _.getOrElse(IsRandomTweetAboveFeature, false)),
("is_request_context_launch", _.getOrElse(IsLaunchRequestFeature, false)),
("viewer_is_employee", _ => false),
("viewer_is_timelines_employee", _ => false),
("viewer_follows_any_topics", _.getOrElse(UserFollowedTopicsCountFeature, None).exists(_ > 0)),
(
"has_ancestor_authored_by_viewer",
candidate =>
candidate
.getOrElse(AncestorsFeature, Seq.empty).exists(ancestor =>
candidate.getOrElse(ViewerIdFeature, 0L) == ancestor.userId)),
("ancestor", _.getOrElse(IsAncestorCandidateFeature, false)),
(
"root_ancestor",
candidate =>
candidate.getOrElse(IsAncestorCandidateFeature, false) && candidate
.getOrElse(InReplyToTweetIdFeature, None).isEmpty),
(
"deep_reply",
candidate =>
candidate.getOrElse(InReplyToTweetIdFeature, None).nonEmpty && candidate
.getOrElse(AncestorsFeature, Seq.empty).size > 2),
(
"has_simcluster_embeddings",
_.getOrElse(
SimclustersTweetTopKClustersWithScoresFeature,
Map.empty[String, Double]).nonEmpty),
(
"tweet_age_less_than_15_seconds",
_.getOrElse(OriginalTweetCreationTimeFromSnowflakeFeature, None)
.exists(_.untilNow <= 15.seconds)),
("is_followed_topic_tweet", _ => false),
("is_recommended_topic_tweet", _ => false),
("is_topic_tweet", _ => false),
("preferred_language_matches_tweet_language", _ => false),
(
"device_language_matches_tweet_language",
candidate =>
candidate.getOrElse(TweetLanguageFeature, None) ==
candidate.getOrElse(DeviceLanguageFeature, None)),
("question", _.getOrElse(EarlybirdFeature, None).exists(_.hasQuestion.contains(true))),
("in_network", _.getOrElse(FromInNetworkSourceFeature, true)),
("viewer_follows_original_author", _ => false),
("has_account_follow_prompt", _ => false),
("has_relevance_prompt", _ => false),
("has_topic_annotation_haug_prompt", _ => false),
("has_topic_annotation_random_precision_prompt", _ => false),
("has_topic_annotation_prompt", _ => false),
(
"has_political_annotation",
_.getOrElse(EarlybirdFeature, None).exists(
_.semanticCoreAnnotations.exists(
_.exists(annotation =>
SemanticCoreFeatures.PoliticalDomains.contains(annotation.domainId) ||
(annotation.domainId == SemanticCoreFeatures.UnifiedTwitterTaxonomy &&
annotation.entityId == SemanticCoreFeatures.UttPoliticsEntityId))))),
(
"is_dont_at_me_by_invitation",
_.getOrElse(EarlybirdFeature, None).exists(
_.conversationControl.exists(_.isInstanceOf[tpt.ConversationControl.ByInvitation]))),
(
"is_dont_at_me_community",
_.getOrElse(EarlybirdFeature, None)
.exists(_.conversationControl.exists(_.isInstanceOf[tpt.ConversationControl.Community]))),
("has_zero_score", _.getOrElse(ScoreFeature, None).exists(_ == 0.0)),
("is_viewer_not_invited_to_reply", _ => false),
("is_viewer_invited_to_reply", _ => false),
("has_gte_10_favs", _.getOrElse(EarlybirdFeature, None).exists(_.favCountV2.exists(_ >= 10))),
("has_gte_100_favs", _.getOrElse(EarlybirdFeature, None).exists(_.favCountV2.exists(_ >= 100))),
("has_gte_1k_favs", _.getOrElse(EarlybirdFeature, None).exists(_.favCountV2.exists(_ >= 1000))),
(
"has_gte_10k_favs",
_.getOrElse(EarlybirdFeature, None).exists(_.favCountV2.exists(_ >= 10000))),
(
"has_gte_100k_favs",
_.getOrElse(EarlybirdFeature, None).exists(_.favCountV2.exists(_ >= 100000))),
("above_neighbor_is_topic_tweet", _ => false),
("is_topic_tweet_with_neighbor_below", _ => false),
("has_audio_space", _.getOrElse(AudioSpaceMetaDataFeature, None).exists(_.hasSpace)),
("has_live_audio_space", _.getOrElse(AudioSpaceMetaDataFeature, None).exists(_.isLive)),
(
"has_gte_10_retweets",
_.getOrElse(EarlybirdFeature, None).exists(_.retweetCountV2.exists(_ >= 10))),
(
"has_gte_100_retweets",
_.getOrElse(EarlybirdFeature, None).exists(_.retweetCountV2.exists(_ >= 100))),
(
"has_gte_1k_retweets",
_.getOrElse(EarlybirdFeature, None).exists(_.retweetCountV2.exists(_ >= 1000))),
(
"has_us_political_annotation",
_.getOrElse(EarlybirdFeature, None)
.exists(_.semanticCoreAnnotations.exists(_.exists(annotation =>
annotation.domainId == SemanticCoreFeatures.UnifiedTwitterTaxonomy &&
annotation.entityId == SemanticCoreFeatures.usPoliticalTweetEntityId &&
annotation.groupId == SemanticCoreFeatures.UsPoliticalTweetAnnotationGroupIds.BalancedV0)))),
(
"has_toxicity_score_above_threshold",
_.getOrElse(EarlybirdFeature, None).exists(_.toxicityScore.exists(_ > 0.91))),
(
"text_only",
candidate =>
candidate.getOrElse(HasDisplayedTextFeature, false) &&
!(candidate.getOrElse(EarlybirdFeature, None).exists(_.hasImage) ||
candidate.getOrElse(EarlybirdFeature, None).exists(_.hasVideo) ||
candidate.getOrElse(EarlybirdFeature, None).exists(_.hasCard))),
(
"image_only",
candidate =>
candidate.getOrElse(EarlybirdFeature, None).exists(_.hasImage) &&
!candidate.getOrElse(HasDisplayedTextFeature, false)),
("has_1_image", _.getOrElse(NumImagesFeature, None).exists(_ == 1)),
("has_2_images", _.getOrElse(NumImagesFeature, None).exists(_ == 2)),
("has_3_images", _.getOrElse(NumImagesFeature, None).exists(_ == 3)),
("has_4_images", _.getOrElse(NumImagesFeature, None).exists(_ == 4)),
("has_card", _.getOrElse(EarlybirdFeature, None).exists(_.hasCard)),
("3_or_more_consecutive_not_in_network", _ => false),
("2_or_more_consecutive_not_in_network", _ => false),
("5_out_of_7_not_in_network", _ => false),
("7_out_of_7_not_in_network", _ => false),
("5_out_of_5_not_in_network", _ => false),
("user_follow_count_gte_50", _.getOrElse(UserFollowingCountFeature, None).exists(_ > 50)),
("has_liked_by_social_context", _ => false),
("has_followed_by_social_context", _ => false),
("has_topic_social_context", _ => false),
("timeline_entry_has_banner", _ => false),
("served_in_conversation_module", _.getOrElse(ServedInConversationModuleFeature, false)),
(
"conversation_module_has_2_displayed_tweets",
_.getOrElse(ConversationModule2DisplayedTweetsFeature, false)),
("conversation_module_has_gap", _.getOrElse(ConversationModuleHasGapFeature, false)),
("served_in_recap_tweet_candidate_module_injection", _ => false),
("served_in_threaded_conversation_module", _ => false)
)
val PredicateMap = CandidatePredicates.toMap
}

View File

@ -0,0 +1,49 @@
package com.twitter.home_mixer.util.earlybird
import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant
import com.twitter.search.common.ranking.{thriftscala => scr}
import com.twitter.search.earlybird.{thriftscala => eb}
object RelevanceSearchUtil {
val Mentions: String = EarlybirdFieldConstant.MENTIONS_FACET
val Hashtags: String = EarlybirdFieldConstant.HASHTAGS_FACET
val FacetsToFetch: Seq[String] = Seq(Mentions, Hashtags)
private val RankingParams: scr.ThriftRankingParams = {
scr.ThriftRankingParams(
`type` = Some(scr.ThriftScoringFunctionType.TensorflowBased),
selectedTensorflowModel = Some("timelines_rectweet_replica"),
minScore = -1.0e100,
selectedModels = Some(Map("home_mixer_unified_engagement_prod" -> 1.0)),
applyBoosts = false,
)
}
val MetadataOptions: eb.ThriftSearchResultMetadataOptions = {
eb.ThriftSearchResultMetadataOptions(
getTweetUrls = true,
getResultLocation = false,
getLuceneScore = false,
getInReplyToStatusId = true,
getReferencedTweetAuthorId = true,
getMediaBits = true,
getAllFeatures = true,
returnSearchResultFeatures = true,
// Set getExclusiveConversationAuthorId in order to retrieve Exclusive / SuperFollow tweets.
getExclusiveConversationAuthorId = true
)
}
val RelevanceOptions: eb.ThriftSearchRelevanceOptions = {
eb.ThriftSearchRelevanceOptions(
proximityScoring = true,
maxConsecutiveSameUser = Some(2),
rankingParams = Some(RankingParams),
maxHitsToProcess = Some(500),
maxUserBlendCount = Some(3),
proximityPhraseWeight = 9.0,
returnAllResults = Some(true)
)
}
}

36
navi/README.md Normal file
View File

@ -0,0 +1,36 @@
# Navi: High-Performance Machine Learning Serving Server in Rust
Navi is a high-performance, versatile machine learning serving server implemented in Rust and tailored for production usage. It's designed to efficiently serve within the Twitter tech stack, offering top-notch performance while focusing on core features.
## Key Features
- **Minimalist Design Optimized for Production Use Cases**: Navi delivers ultra-high performance, stability, and availability, engineered to handle real-world application demands with a streamlined codebase.
- **gRPC API Compatibility with TensorFlow Serving**: Seamless integration with existing TensorFlow Serving clients via its gRPC API, enabling easy integration, smooth deployment, and scaling in production environments.
- **Plugin Architecture for Different Runtimes**: Navi's pluggable architecture supports various machine learning runtimes, providing adaptability and extensibility for diverse use cases. Out-of-the-box support is available for TensorFlow and Onnx Runtime, with PyTorch in an experimental state.
## Current State
While Navi's features may not be as comprehensive as its open-source counterparts, its performance-first mindset makes it highly efficient.
- Navi for TensorFlow is currently the most feature-complete, supporting multiple input tensors of different types (float, int, string, etc.).
- Navi for Onnx primarily supports one input tensor of type string, used in Twitter's home recommendation with a proprietary BatchPredictRequest format.
- Navi for Pytorch is compilable and runnable but not yet production-ready in terms of performance and stability.
## Directory Structure
- `navi`: The main code repository for Navi
- `dr_transform`: Twitter-specific converter that converts BatchPredictionRequest Thrift to ndarray
- `segdense`: Twitter-specific config to specify how to retrieve feature values from BatchPredictionRequest
- `thrift_bpr_adapter`: generated thrift code for BatchPredictionRequest
## Content
We have included all *.rs source code files that make up the main Navi binaries for you to examine. However, we have not included the test and benchmark code, or various configuration files, due to data security concerns.
## Run
In navi/navi, you can run the following commands:
- `scripts/run_tf2.sh` for [TensorFlow](https://www.tensorflow.org/)
- `scripts/run_onnx.sh` for [Onnx](https://onnx.ai/)
Do note that you need to create a models directory and create some versions, preferably using epoch time, e.g., `1679693908377`.
## Build
You can adapt the above scripts to build using Cargo.

View File

@ -0,0 +1,48 @@
use serde::{Deserialize, Serialize};
use serde_json::Error;
#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct AllConfig {
#[serde(rename = "train_data")]
pub train_data: TrainData,
}
#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct TrainData {
#[serde(rename = "seg_dense_schema")]
pub seg_dense_schema: SegDenseSchema,
}
#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct SegDenseSchema {
#[serde(rename = "renamed_features")]
pub renamed_features: RenamedFeatures,
}
#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct RenamedFeatures {
pub continuous: String,
pub binary: String,
pub discrete: String,
#[serde(rename = "author_embedding")]
pub author_embedding: String,
#[serde(rename = "user_embedding")]
pub user_embedding: String,
#[serde(rename = "user_eng_embedding")]
pub user_eng_embedding: String,
#[serde(rename = "meta__author_id")]
pub meta_author_id: String,
#[serde(rename = "meta__user_id")]
pub meta_user_id: String,
#[serde(rename = "meta__tweet_id")]
pub meta_tweet_id: String,
}
pub fn parse(json_str: &str) -> Result<AllConfig, Error> {
serde_json::from_str(json_str)
}

View File

@ -0,0 +1,586 @@
use std::collections::BTreeSet;
use std::fmt::{self, Debug, Display};
use std::fs;
use bpr_thrift::data::DataRecord;
use bpr_thrift::prediction_service::BatchPredictionRequest;
use bpr_thrift::tensor::GeneralTensor;
use log::debug;
use ndarray::Array2;
use once_cell::sync::OnceCell;
use ort::tensor::InputTensor;
use prometheus::{HistogramOpts, HistogramVec};
use segdense::mapper::{FeatureMapper, MapReader};
use segdense::segdense_transform_spec_home_recap_2022::{DensificationTransformSpec, Root};
use segdense::util;
use thrift::protocol::{TBinaryInputProtocol, TSerializable};
use thrift::transport::TBufferChannel;
use crate::{all_config, all_config::AllConfig};
pub fn log_feature_match(
dr: &DataRecord,
seg_dense_config: &DensificationTransformSpec,
dr_type: String,
) {
// Note the following algorithm matches features from config using linear search.
// Also the record source is MinDataRecord. This includes only binary and continous features for now.
for (feature_id, feature_value) in dr.continuous_features.as_ref().unwrap() {
debug!(
"{dr_type} - Continuous Datarecord => Feature ID: {feature_id}, Feature value: {feature_value}"
);
for input_feature in &seg_dense_config.cont.input_features {
if input_feature.feature_id == *feature_id {
debug!("Matching input feature: {input_feature:?}")
}
}
}
for feature_id in dr.binary_features.as_ref().unwrap() {
debug!("{dr_type} - Binary Datarecord => Feature ID: {feature_id}");
for input_feature in &seg_dense_config.binary.input_features {
if input_feature.feature_id == *feature_id {
debug!("Found input feature: {input_feature:?}")
}
}
}
}
pub fn log_feature_matches(drs: &Vec<DataRecord>, seg_dense_config: &DensificationTransformSpec) {
for dr in drs {
log_feature_match(dr, seg_dense_config, String::from("individual"));
}
}
pub trait Converter: Send + Sync + Debug + 'static + Display {
fn convert(&self, input: Vec<Vec<u8>>) -> (Vec<InputTensor>, Vec<usize>);
}
#[derive(Debug)]
#[allow(dead_code)]
pub struct BatchPredictionRequestToTorchTensorConverter {
all_config: AllConfig,
seg_dense_config: Root,
all_config_path: String,
seg_dense_config_path: String,
feature_mapper: FeatureMapper,
user_embedding_feature_id: i64,
user_eng_embedding_feature_id: i64,
author_embedding_feature_id: i64,
discrete_features_to_report: BTreeSet<i64>,
continuous_features_to_report: BTreeSet<i64>,
discrete_feature_metrics: &'static HistogramVec,
continuous_feature_metrics: &'static HistogramVec,
}
impl Display for BatchPredictionRequestToTorchTensorConverter {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(
f,
"all_config_path: {}, seg_dense_config_path:{}",
self.all_config_path, self.seg_dense_config_path
)
}
}
impl BatchPredictionRequestToTorchTensorConverter {
pub fn new(
model_dir: &str,
model_version: &str,
reporting_feature_ids: Vec<(i64, &str)>,
register_metric_fn: Option<impl Fn(&HistogramVec)>,
) -> BatchPredictionRequestToTorchTensorConverter {
let all_config_path = format!("{model_dir}/{model_version}/all_config.json");
let seg_dense_config_path =
format!("{model_dir}/{model_version}/segdense_transform_spec_home_recap_2022.json");
let seg_dense_config = util::load_config(&seg_dense_config_path);
let all_config = all_config::parse(
&fs::read_to_string(&all_config_path)
.unwrap_or_else(|error| panic!("error loading all_config.json - {error}")),
)
.unwrap();
let feature_mapper = util::load_from_parsed_config_ref(&seg_dense_config);
let user_embedding_feature_id = Self::get_feature_id(
&all_config
.train_data
.seg_dense_schema
.renamed_features
.user_embedding,
&seg_dense_config,
);
let user_eng_embedding_feature_id = Self::get_feature_id(
&all_config
.train_data
.seg_dense_schema
.renamed_features
.user_eng_embedding,
&seg_dense_config,
);
let author_embedding_feature_id = Self::get_feature_id(
&all_config
.train_data
.seg_dense_schema
.renamed_features
.author_embedding,
&seg_dense_config,
);
static METRICS: OnceCell<(HistogramVec, HistogramVec)> = OnceCell::new();
let (discrete_feature_metrics, continuous_feature_metrics) = METRICS.get_or_init(|| {
let discrete = HistogramVec::new(
HistogramOpts::new(":navi:feature_id:discrete", "Discrete Feature ID values")
.buckets(Vec::from([
0.0f64, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 110.0,
120.0, 130.0, 140.0, 150.0, 160.0, 170.0, 180.0, 190.0, 200.0, 250.0,
300.0, 500.0, 1000.0, 10000.0, 100000.0,
])),
&["feature_id"],
)
.expect("metric cannot be created");
let continuous = HistogramVec::new(
HistogramOpts::new(
":navi:feature_id:continuous",
"continuous Feature ID values",
)
.buckets(Vec::from([
0.0f64, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 110.0,
120.0, 130.0, 140.0, 150.0, 160.0, 170.0, 180.0, 190.0, 200.0, 250.0, 300.0,
500.0, 1000.0, 10000.0, 100000.0,
])),
&["feature_id"],
)
.expect("metric cannot be created");
if let Some(r) = register_metric_fn {
r(&discrete);
r(&continuous);
}
(discrete, continuous)
});
let mut discrete_features_to_report = BTreeSet::new();
let mut continuous_features_to_report = BTreeSet::new();
for (feature_id, feature_type) in reporting_feature_ids.iter() {
match *feature_type {
"discrete" => discrete_features_to_report.insert(*feature_id),
"continuous" => continuous_features_to_report.insert(*feature_id),
_ => panic!("Invalid feature type {feature_type} for reporting metrics!"),
};
}
BatchPredictionRequestToTorchTensorConverter {
all_config,
seg_dense_config,
all_config_path,
seg_dense_config_path,
feature_mapper,
user_embedding_feature_id,
user_eng_embedding_feature_id,
author_embedding_feature_id,
discrete_features_to_report,
continuous_features_to_report,
discrete_feature_metrics,
continuous_feature_metrics,
}
}
fn get_feature_id(feature_name: &str, seg_dense_config: &Root) -> i64 {
// given a feature name, we get the complex feature type id
for feature in &seg_dense_config.complex_feature_type_transform_spec {
if feature.full_feature_name == feature_name {
return feature.feature_id;
}
}
-1
}
fn parse_batch_prediction_request(bytes: Vec<u8>) -> BatchPredictionRequest {
// parse batch prediction request into a struct from byte array repr.
let mut bc = TBufferChannel::with_capacity(bytes.len(), 0);
bc.set_readable_bytes(&bytes);
let mut protocol = TBinaryInputProtocol::new(bc, true);
BatchPredictionRequest::read_from_in_protocol(&mut protocol).unwrap()
}
fn get_embedding_tensors(
&self,
bprs: &[BatchPredictionRequest],
feature_id: i64,
batch_size: &[usize],
) -> Array2<f32> {
// given an embedding feature id, extract the float tensor array into tensors.
let cols: usize = 200;
let rows: usize = batch_size[batch_size.len() - 1];
let total_size = rows * cols;
let mut working_set = vec![0 as f32; total_size];
let mut bpr_start = 0;
for (bpr, &bpr_end) in bprs.iter().zip(batch_size) {
if bpr.common_features.is_some()
&& bpr.common_features.as_ref().unwrap().tensors.is_some()
&& bpr
.common_features
.as_ref()
.unwrap()
.tensors
.as_ref()
.unwrap()
.contains_key(&feature_id)
{
let source_tensor = bpr
.common_features
.as_ref()
.unwrap()
.tensors
.as_ref()
.unwrap()
.get(&feature_id)
.unwrap();
let tensor = match source_tensor {
GeneralTensor::FloatTensor(float_tensor) =>
//Tensor::of_slice(
{
float_tensor
.floats
.iter()
.map(|x| x.into_inner() as f32)
.collect::<Vec<_>>()
}
_ => vec![0 as f32; cols],
};
// since the tensor is found in common feature, add it in all batches
for row in bpr_start..bpr_end {
for col in 0..cols {
working_set[row * cols + col] = tensor[col];
}
}
}
// find the feature in individual feature list and add to corresponding batch.
for (index, datarecord) in bpr.individual_features_list.iter().enumerate() {
if datarecord.tensors.is_some()
&& datarecord
.tensors
.as_ref()
.unwrap()
.contains_key(&feature_id)
{
let source_tensor = datarecord
.tensors
.as_ref()
.unwrap()
.get(&feature_id)
.unwrap();
let tensor = match source_tensor {
GeneralTensor::FloatTensor(float_tensor) => float_tensor
.floats
.iter()
.map(|x| x.into_inner() as f32)
.collect::<Vec<_>>(),
_ => vec![0 as f32; cols],
};
for col in 0..cols {
working_set[(bpr_start + index) * cols + col] = tensor[col];
}
}
}
bpr_start = bpr_end;
}
Array2::<f32>::from_shape_vec([rows, cols], working_set).unwrap()
}
// Todo : Refactor, create a generic version with different type and field accessors
// Example paramterize and then instiantiate the following
// (FLOAT --> FLOAT, DataRecord.continuous_feature)
// (BOOL --> INT64, DataRecord.binary_feature)
// (INT64 --> INT64, DataRecord.discrete_feature)
fn get_continuous(&self, bprs: &[BatchPredictionRequest], batch_ends: &[usize]) -> InputTensor {
// These need to be part of model schema
let rows = batch_ends[batch_ends.len() - 1];
let cols = 5293;
let full_size = rows * cols;
let default_val = f32::NAN;
let mut tensor = vec![default_val; full_size];
let mut bpr_start = 0;
for (bpr, &bpr_end) in bprs.iter().zip(batch_ends) {
// Common features
if bpr.common_features.is_some()
&& bpr
.common_features
.as_ref()
.unwrap()
.continuous_features
.is_some()
{
let common_features = bpr
.common_features
.as_ref()
.unwrap()
.continuous_features
.as_ref()
.unwrap();
for feature in common_features {
if let Some(f_info) = self.feature_mapper.get(feature.0) {
let idx = f_info.index_within_tensor as usize;
if idx < cols {
// Set value in each row
for r in bpr_start..bpr_end {
let flat_index = r * cols + idx;
tensor[flat_index] = feature.1.into_inner() as f32;
}
}
}
if self.continuous_features_to_report.contains(feature.0) {
self.continuous_feature_metrics
.with_label_values(&[feature.0.to_string().as_str()])
.observe(feature.1.into_inner())
} else if self.discrete_features_to_report.contains(feature.0) {
self.discrete_feature_metrics
.with_label_values(&[feature.0.to_string().as_str()])
.observe(feature.1.into_inner())
}
}
}
// Process the batch of datarecords
for r in bpr_start..bpr_end {
let dr: &DataRecord = &bpr.individual_features_list[r - bpr_start];
if dr.continuous_features.is_some() {
for feature in dr.continuous_features.as_ref().unwrap() {
if let Some(f_info) = self.feature_mapper.get(feature.0) {
let idx = f_info.index_within_tensor as usize;
let flat_index = r * cols + idx;
if flat_index < tensor.len() && idx < cols {
tensor[flat_index] = feature.1.into_inner() as f32;
}
}
if self.continuous_features_to_report.contains(feature.0) {
self.continuous_feature_metrics
.with_label_values(&[feature.0.to_string().as_str()])
.observe(feature.1.into_inner())
} else if self.discrete_features_to_report.contains(feature.0) {
self.discrete_feature_metrics
.with_label_values(&[feature.0.to_string().as_str()])
.observe(feature.1.into_inner())
}
}
}
}
bpr_start = bpr_end;
}
InputTensor::FloatTensor(
Array2::<f32>::from_shape_vec([rows, cols], tensor)
.unwrap()
.into_dyn(),
)
}
fn get_binary(&self, bprs: &[BatchPredictionRequest], batch_ends: &[usize]) -> InputTensor {
// These need to be part of model schema
let rows = batch_ends[batch_ends.len() - 1];
let cols = 149;
let full_size = rows * cols;
let default_val = 0;
let mut v = vec![default_val; full_size];
let mut bpr_start = 0;
for (bpr, &bpr_end) in bprs.iter().zip(batch_ends) {
// Common features
if bpr.common_features.is_some()
&& bpr
.common_features
.as_ref()
.unwrap()
.binary_features
.is_some()
{
let common_features = bpr
.common_features
.as_ref()
.unwrap()
.binary_features
.as_ref()
.unwrap();
for feature in common_features {
if let Some(f_info) = self.feature_mapper.get(feature) {
let idx = f_info.index_within_tensor as usize;
if idx < cols {
// Set value in each row
for r in bpr_start..bpr_end {
let flat_index = r * cols + idx;
v[flat_index] = 1;
}
}
}
}
}
// Process the batch of datarecords
for r in bpr_start..bpr_end {
let dr: &DataRecord = &bpr.individual_features_list[r - bpr_start];
if dr.binary_features.is_some() {
for feature in dr.binary_features.as_ref().unwrap() {
if let Some(f_info) = self.feature_mapper.get(feature) {
let idx = f_info.index_within_tensor as usize;
let flat_index = r * cols + idx;
v[flat_index] = 1;
}
}
}
}
bpr_start = bpr_end;
}
InputTensor::Int64Tensor(
Array2::<i64>::from_shape_vec([rows, cols], v)
.unwrap()
.into_dyn(),
)
}
#[allow(dead_code)]
fn get_discrete(&self, bprs: &[BatchPredictionRequest], batch_ends: &[usize]) -> InputTensor {
// These need to be part of model schema
let rows = batch_ends[batch_ends.len() - 1];
let cols = 320;
let full_size = rows * cols;
let default_val = 0;
let mut v = vec![default_val; full_size];
let mut bpr_start = 0;
for (bpr, &bpr_end) in bprs.iter().zip(batch_ends) {
// Common features
if bpr.common_features.is_some()
&& bpr
.common_features
.as_ref()
.unwrap()
.discrete_features
.is_some()
{
let common_features = bpr
.common_features
.as_ref()
.unwrap()
.discrete_features
.as_ref()
.unwrap();
for feature in common_features {
if let Some(f_info) = self.feature_mapper.get(feature.0) {
let idx = f_info.index_within_tensor as usize;
if idx < cols {
// Set value in each row
for r in bpr_start..bpr_end {
let flat_index = r * cols + idx;
v[flat_index] = *feature.1;
}
}
}
if self.discrete_features_to_report.contains(feature.0) {
self.discrete_feature_metrics
.with_label_values(&[feature.0.to_string().as_str()])
.observe(*feature.1 as f64)
}
}
}
// Process the batch of datarecords
for r in bpr_start..bpr_end {
let dr: &DataRecord = &bpr.individual_features_list[r];
if dr.discrete_features.is_some() {
for feature in dr.discrete_features.as_ref().unwrap() {
if let Some(f_info) = self.feature_mapper.get(feature.0) {
let idx = f_info.index_within_tensor as usize;
let flat_index = r * cols + idx;
if flat_index < v.len() && idx < cols {
v[flat_index] = *feature.1;
}
}
if self.discrete_features_to_report.contains(feature.0) {
self.discrete_feature_metrics
.with_label_values(&[feature.0.to_string().as_str()])
.observe(*feature.1 as f64)
}
}
}
}
bpr_start = bpr_end;
}
InputTensor::Int64Tensor(
Array2::<i64>::from_shape_vec([rows, cols], v)
.unwrap()
.into_dyn(),
)
}
fn get_user_embedding(
&self,
bprs: &[BatchPredictionRequest],
batch_ends: &[usize],
) -> InputTensor {
InputTensor::FloatTensor(
self.get_embedding_tensors(bprs, self.user_embedding_feature_id, batch_ends)
.into_dyn(),
)
}
fn get_eng_embedding(
&self,
bpr: &[BatchPredictionRequest],
batch_ends: &[usize],
) -> InputTensor {
InputTensor::FloatTensor(
self.get_embedding_tensors(bpr, self.user_eng_embedding_feature_id, batch_ends)
.into_dyn(),
)
}
fn get_author_embedding(
&self,
bpr: &[BatchPredictionRequest],
batch_ends: &[usize],
) -> InputTensor {
InputTensor::FloatTensor(
self.get_embedding_tensors(bpr, self.author_embedding_feature_id, batch_ends)
.into_dyn(),
)
}
}
impl Converter for BatchPredictionRequestToTorchTensorConverter {
fn convert(&self, batched_bytes: Vec<Vec<u8>>) -> (Vec<InputTensor>, Vec<usize>) {
let bprs = batched_bytes
.into_iter()
.map(|bytes| {
BatchPredictionRequestToTorchTensorConverter::parse_batch_prediction_request(bytes)
})
.collect::<Vec<_>>();
let batch_ends = bprs
.iter()
.map(|bpr| bpr.individual_features_list.len())
.scan(0usize, |acc, e| {
//running total
*acc += e;
Some(*acc)
})
.collect::<Vec<_>>();
let t1 = self.get_continuous(&bprs, &batch_ends);
let t2 = self.get_binary(&bprs, &batch_ends);
//let _t3 = self.get_discrete(&bprs, &batch_ends);
let t4 = self.get_user_embedding(&bprs, &batch_ends);
let t5 = self.get_eng_embedding(&bprs, &batch_ends);
let t6 = self.get_author_embedding(&bprs, &batch_ends);
(vec![t1, t2, t4, t5, t6], batch_ends)
}
}

View File

@ -0,0 +1,32 @@
use npyz::WriterBuilder;
use npyz::{AutoSerialize, WriteOptions};
use std::io::BufWriter;
use std::{
fs::File,
io::{self, BufRead},
};
pub fn load_batch_prediction_request_base64(file_name: &str) -> Vec<Vec<u8>> {
let file = File::open(file_name).expect("could not read file");
let mut result = vec![];
for (mut line_count, line) in io::BufReader::new(file).lines().enumerate() {
line_count += 1;
match base64::decode(line.unwrap().trim()) {
Ok(payload) => result.push(payload),
Err(err) => println!("error decoding line {file_name}:{line_count} - {err}"),
}
}
println!("result len: {}", result.len());
result
}
pub fn save_to_npy<T: npyz::Serialize + AutoSerialize>(data: &[T], save_to: String) {
let mut writer = WriteOptions::new()
.default_dtype()
.shape(&[data.len() as u64, 1])
.writer(BufWriter::new(File::create(save_to).unwrap()))
.begin_nd()
.unwrap();
writer.extend(data.to_owned()).unwrap();
writer.finish().unwrap();
}

40
recos-injector/README.md Normal file
View File

@ -0,0 +1,40 @@
# Recos-Injector
Recos-Injector is a streaming event processor used to build input streams for GraphJet-based services. It is a general-purpose tool that consumes arbitrary incoming event streams (e.g., Fav, RT, Follow, client_events, etc.), applies filtering, and combines and publishes cleaned up events to corresponding GraphJet services. Each GraphJet-based service subscribes to a dedicated Kafka topic, and Recos-Injector enables GraphJet-based services to consume any event they want.
## How to run Recos-Injector server tests
You can run tests by using the following command from your project's root directory:
$ bazel build recos-injector/...
$ bazel test recos-injector/...
## How to run recos-injector-server in development on a local machine
The simplest way to stand up a service is to run it locally. To run
recos-injector-server in development mode, compile the project and then
execute it with `bazel run`:
$ bazel build recos-injector/server:bin
$ bazel run recos-injector/server:bin
A tunnel can be set up in order for downstream queries to work properly.
Upon successful server startup, try to `curl` its admin endpoint in another
terminal:
$ curl -s localhost:9990/admin/ping
pong
Run `curl -s localhost:9990/admin` to see a list of all available admin endpoints.
## Querying Recos-Injector server from a Scala console
Recos-Injector does not have a Thrift endpoint. Instead, it reads Event Bus and Kafka queues and writes to the Recos-Injector Kafka.
## Generating a package for deployment
To package your service into a zip file for deployment, run:
$ bazel bundle recos-injector/server:bin --bundle-jvm-archive=zip
If the command is successful, a file named `dist/recos-injector-server.zip` will be created.

99
simclusters-ann/README.md Normal file
View File

@ -0,0 +1,99 @@
# SimClusters ANN
SimClusters ANN is a service that returns tweet candidate recommendations given a SimClusters embedding. The service implements tweet recommendations based on the Approximate Cosine Similarity algorithm.
The cosine similarity between two Tweet SimClusters Embedding represents the relevance level of two tweets in SimCluster space. The traditional algorithm for calculating cosine similarity is expensive and hard to support by the existing infrastructure. Therefore, the Approximate Cosine Similarity algorithm is introduced to save response time by reducing I/O operations.
## Background
SimClusters V2 runtime infra introduces the SimClusters and its online and offline approaches. A heron job builds the mapping between SimClusters and Tweets. The job saves top 400 Tweets for a SimClusters and top 100 SimClusters for a Tweet. Favorite score and follow score are two types of tweet score. In the document, the top 100 SimClusters based on the favorite score for a Tweet stands for the Tweet SimClusters Embedding.
The cosine similarity between two Tweet SimClusters Embedding presents the relevant level of two tweets in SimCluster space. The score varies from 0 to 1. The high cosine similarity score(>= 0.7 in Prod) means that the users who like two tweets share the same SimClusters.
SimClusters from the Linear Algebra Perspective discussed the difference between the dot-product and cosine similarity in SimCluster space. We believe the cosine similarity approach is better because it avoids the bias of tweet popularity.
However, calculating the cosine similarity between two Tweets is pretty expensive in Tweet candidate generation. In TWISTLY, we scan at most 15,000 (6 source tweets * 25 clusters * 100 tweets per clusters) tweet candidates for every Home Timeline request. The traditional algorithm needs to make API calls to fetch 15,000 tweet SimCluster embeddings. Consider that we need to process over 6,000 RPS, its hard to support by the existing infrastructure.
## SimClusters Approximate Cosine Similarity Core Algorithm
1. Provide a source SimCluster Embedding *SV*, *SV = [(SC1, Score), (SC2, Score), (SC3, Score) …]*
2. Fetch top *M* tweets for each Top *N* SimClusters based on SV. In Prod, *M = 400*, *N = 50*. Tweets may appear in multiple SimClusters.
| | | | |
|---|---|---|---|
| SC1 | T1:Score | T2: Score | ... |
| SC2 | T3: Score | T4: Score | ... |
3. Based on the previous table, generate an *(M x N) x N* Matrix *R*. The *R* represents the approximate SimCluster embeddings for *MxN* tweets. The embedding only contains top *N* SimClusters from *SV*. Only top *M* tweets from each SimCluster have the score. Others are 0.
| | SC1 | SC2 | ... |
|---|---|---|---|
| T1 | Score | 0 | ... |
| T2 | Score | 0 | ... |
| T3 | 0 | Score | ... |
4. Compute the dot product between source vector and the approximate vectors for each tweet. (Calculate *R • SV^T*). Take top *X* tweets. In Prod, *X = 200*
5. Fetch *X* tweet SimClusters Embedding, Calculate Cosine Similarity between *X* tweets and *SV*, Return top *Y* above a certain threshold *Z*.
Approximate Cosine Similarity is an approximate algorithm. Instead of fetching *M * N* tweets embedding, it only fetches *X* tweets embedding. In prod, *X / M * N * 100% = 6%*. Based on the metrics during TWISTLY development, most of the response time is consumed by I/O operation. The Approximate Cosine Similarity is a good approach to save a large amount of response time.
The idea of the approximate algorithm is based on the assumption that the higher dot-product between source tweets SimCluster embedding and candidate tweets limited SimCluster Embedding, the possibility that these two tweets are relevant is higher. Additional Cosine Similarity filter is to guarantee that the results are not affected by popularity bias.
Adjusting the M, N, X, Y, Z is able to balance the precision and recall for different products. The implementation of approximate cosine similarity is used by TWISTLY, Interest-based tweet recommendation, Similar Tweet in RUX, and Author based recommendation. This algorithm is also suitable for future user or entity recommendation based on SimClusters Embedding.
# -------------------------------
# Build and Test
# -------------------------------
Compile the service
$ ./bazel build simclusters-ann/server:bin
Unit tests
$ ./bazel test simclusters-ann/server:bin
# -------------------------------
# Deploy
# -------------------------------
## Prerequisite for devel deployments
First of all, you need to generate Service to Service certificates for use while developing locally. This only needs to be done ONCE:
To add cert files to Aurora (if you want to deploy to DEVEL):
```
$ developer-cert-util --env devel --job simclusters-ann
```
## Deploying to devel/staging from a local build
Reference -
$ ./simclusters-ann/bin/deploy.sh --help
Use the script to build the service in your local branch, upload it to packer and deploy in devel aurora:
$ ./simclusters-ann/bin/deploy.sh atla $USER devel simclusters-ann
You can also deploy to staging with this script. E.g. to deploy to instance 1:
$ ./simclusters-ann/bin/deploy.sh atla simclusters-ann staging simclusters-ann <instance-number>
## Deploying to production
Production deploys should be managed by Workflows.
_Do not_ deploy to production unless it is an emergency and you have approval from oncall.
##### It is not recommended to deploy from Command Lines into production environments, unless 1) you're testing a small change in Canary shard [0,9]. 2) Tt is an absolute emergency. Be sure to make oncalls aware of the changes you're deploying.
$ ./simclusters-ann/bin/deploy.sh atla simclusters-ann prod simclusters-ann <instance-number>
In the case of multiple instances,
$ ./simclusters-ann/bin/deploy.sh atla simclusters-ann prod simclusters-ann <instance-number-start>-<instance-number-end>
## Checking Deployed Version and Rolling Back
Wherever possible, roll back using Workflows by finding an earlier good version and clicking the "rollback" button in the UI. This is the safest and least error-prone method.

View File

@ -0,0 +1,647 @@
package com.twitter.search.common.converter.earlybird;
import java.io.IOException;
import java.util.Date;
import java.util.List;
import java.util.Optional;
import javax.annotation.concurrent.NotThreadSafe;
import com.google.common.base.Preconditions;
import org.apache.commons.collections.CollectionUtils;
import org.joda.time.DateTime;
import org.joda.time.DateTimeZone;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.twitter.common_internal.text.version.PenguinVersion;
import com.twitter.search.common.converter.earlybird.EncodedFeatureBuilder.TweetFeatureWithEncodeFeatures;
import com.twitter.search.common.indexing.thriftjava.Place;
import com.twitter.search.common.indexing.thriftjava.PotentialLocation;
import com.twitter.search.common.indexing.thriftjava.ProfileGeoEnrichment;
import com.twitter.search.common.indexing.thriftjava.ThriftVersionedEvents;
import com.twitter.search.common.indexing.thriftjava.VersionedTweetFeatures;
import com.twitter.search.common.metrics.SearchCounter;
import com.twitter.search.common.partitioning.snowflakeparser.SnowflakeIdParser;
import com.twitter.search.common.relevance.entities.GeoObject;
import com.twitter.search.common.relevance.entities.TwitterMessage;
import com.twitter.search.common.relevance.entities.TwitterQuotedMessage;
import com.twitter.search.common.schema.base.ImmutableSchemaInterface;
import com.twitter.search.common.schema.base.Schema;
import com.twitter.search.common.schema.earlybird.EarlybirdCluster;
import com.twitter.search.common.schema.earlybird.EarlybirdEncodedFeatures;
import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants;
import com.twitter.search.common.schema.earlybird.EarlybirdFieldConstants.EarlybirdFieldConstant;
import com.twitter.search.common.schema.earlybird.EarlybirdThriftDocumentBuilder;
import com.twitter.search.common.schema.thriftjava.ThriftDocument;
import com.twitter.search.common.schema.thriftjava.ThriftIndexingEvent;
import com.twitter.search.common.schema.thriftjava.ThriftIndexingEventType;
import com.twitter.search.common.util.spatial.GeoUtil;
import com.twitter.search.common.util.text.NormalizerHelper;
import com.twitter.tweetypie.thriftjava.ComposerSource;
/**
* Converts a TwitterMessage into a ThriftVersionedEvents. This is only responsible for data that
* is available immediately when a Tweet is created. Some data, like URL data, isn't available
* immediately, and so it is processed later, in the DelayedIndexingConverter and sent as an
* update. In order to achieve this we create the document in 2 passes:
*
* 1. BasicIndexingConverter builds thriftVersionedEvents with the fields that do not require
* external services.
*
* 2. DelayedIndexingConverter builds all the document fields depending on external services, once
* those services have processed the relevant Tweet and we have retrieved that data.
*/
@NotThreadSafe
public class BasicIndexingConverter {
private static final Logger LOG = LoggerFactory.getLogger(BasicIndexingConverter.class);
private static final SearchCounter NUM_NULLCAST_FEATURE_FLAG_SET_TWEETS =
SearchCounter.export("num_nullcast_feature_flag_set_tweets");
private static final SearchCounter NUM_NULLCAST_TWEETS =
SearchCounter.export("num_nullcast_tweets");
private static final SearchCounter NUM_NON_NULLCAST_TWEETS =
SearchCounter.export("num_non_nullcast_tweets");
private static final SearchCounter ADJUSTED_BAD_CREATED_AT_COUNTER =
SearchCounter.export("adjusted_incorrect_created_at_timestamp");
private static final SearchCounter INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS =
SearchCounter.export("inconsistent_tweet_id_and_created_at_ms");
private static final SearchCounter NUM_SELF_THREAD_TWEETS =
SearchCounter.export("num_self_thread_tweets");
private static final SearchCounter NUM_EXCLUSIVE_TWEETS =
SearchCounter.export("num_exclusive_tweets");
// If a tweet carries a timestamp smaller than this timestamp, we consider the timestamp invalid,
// because twitter does not even exist back then before: Sun, 01 Jan 2006 00:00:00 GMT
private static final long VALID_CREATION_TIME_THRESHOLD_MILLIS =
new DateTime(2006, 1, 1, 0, 0, 0, DateTimeZone.UTC).getMillis();
private final EncodedFeatureBuilder featureBuilder;
private final Schema schema;
private final EarlybirdCluster cluster;
public BasicIndexingConverter(Schema schema, EarlybirdCluster cluster) {
this.featureBuilder = new EncodedFeatureBuilder();
this.schema = schema;
this.cluster = cluster;
}
/**
* This function converts TwitterMessage to ThriftVersionedEvents, which is a generic data
* structure that can be consumed by Earlybird directly.
*/
public ThriftVersionedEvents convertMessageToThrift(
TwitterMessage message,
boolean strict,
List<PenguinVersion> penguinVersions) throws IOException {
Preconditions.checkNotNull(message);
Preconditions.checkNotNull(penguinVersions);
ThriftVersionedEvents versionedEvents = new ThriftVersionedEvents()
.setId(message.getId());
ImmutableSchemaInterface schemaSnapshot = schema.getSchemaSnapshot();
for (PenguinVersion penguinVersion : penguinVersions) {
ThriftDocument document =
buildDocumentForPenguinVersion(schemaSnapshot, message, strict, penguinVersion);
ThriftIndexingEvent thriftIndexingEvent = new ThriftIndexingEvent()
.setDocument(document)
.setEventType(ThriftIndexingEventType.INSERT)
.setSortId(message.getId());
message.getFromUserTwitterId().map(thriftIndexingEvent::setUid);
versionedEvents.putToVersionedEvents(penguinVersion.getByteValue(), thriftIndexingEvent);
}
return versionedEvents;
}
private ThriftDocument buildDocumentForPenguinVersion(
ImmutableSchemaInterface schemaSnapshot,
TwitterMessage message,
boolean strict,
PenguinVersion penguinVersion) throws IOException {
TweetFeatureWithEncodeFeatures tweetFeature =
featureBuilder.createTweetFeaturesFromTwitterMessage(
message, penguinVersion, schemaSnapshot);
EarlybirdThriftDocumentBuilder builder =
buildBasicFields(message, schemaSnapshot, cluster, tweetFeature);
buildUserFields(builder, message, tweetFeature.versionedFeatures, penguinVersion);
buildGeoFields(builder, message, tweetFeature.versionedFeatures);
buildRetweetAndReplyFields(builder, message, strict);
buildQuotesFields(builder, message);
buildVersionedFeatureFields(builder, tweetFeature.versionedFeatures);
buildAnnotationFields(builder, message);
buildNormalizedMinEngagementFields(builder, tweetFeature.encodedFeatures, cluster);
buildDirectedAtFields(builder, message);
builder.withSpaceIdFields(message.getSpaceIds());
return builder.build();
}
/**
* Build the basic fields for a tweet.
*/
public static EarlybirdThriftDocumentBuilder buildBasicFields(
TwitterMessage message,
ImmutableSchemaInterface schemaSnapshot,
EarlybirdCluster cluster,
TweetFeatureWithEncodeFeatures tweetFeature) {
EarlybirdEncodedFeatures extendedEncodedFeatures = tweetFeature.extendedEncodedFeatures;
if (extendedEncodedFeatures == null && EarlybirdCluster.isTwitterMemoryFormatCluster(cluster)) {
extendedEncodedFeatures = EarlybirdEncodedFeatures.newEncodedTweetFeatures(
schemaSnapshot, EarlybirdFieldConstant.EXTENDED_ENCODED_TWEET_FEATURES_FIELD);
}
EarlybirdThriftDocumentBuilder builder = new EarlybirdThriftDocumentBuilder(
tweetFeature.encodedFeatures,
extendedEncodedFeatures,
new EarlybirdFieldConstants(),
schemaSnapshot);
builder.withID(message.getId());
final Date createdAt = message.getDate();
long createdAtMs = createdAt == null ? 0L : createdAt.getTime();
createdAtMs = fixCreatedAtTimeStampIfNecessary(message.getId(), createdAtMs);
if (createdAtMs > 0L) {
builder.withCreatedAt((int) (createdAtMs / 1000));
}
builder.withTweetSignature(tweetFeature.versionedFeatures.getTweetSignature());
if (message.getConversationId() > 0) {
long conversationId = message.getConversationId();
builder.withLongField(
EarlybirdFieldConstant.CONVERSATION_ID_CSF.getFieldName(), conversationId);
// We only index conversation ID when it is different from the tweet ID.
if (message.getId() != conversationId) {
builder.withLongField(
EarlybirdFieldConstant.CONVERSATION_ID_FIELD.getFieldName(), conversationId);
}
}
if (message.getComposerSource().isPresent()) {
ComposerSource composerSource = message.getComposerSource().get();
builder.withIntField(
EarlybirdFieldConstant.COMPOSER_SOURCE.getFieldName(), composerSource.getValue());
if (composerSource == ComposerSource.CAMERA) {
builder.withCameraComposerSourceFlag();
}
}
EarlybirdEncodedFeatures encodedFeatures = tweetFeature.encodedFeatures;
if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.FROM_VERIFIED_ACCOUNT_FLAG)) {
builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.VERIFIED_FILTER_TERM);
}
if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.FROM_BLUE_VERIFIED_ACCOUNT_FLAG)) {
builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.BLUE_VERIFIED_FILTER_TERM);
}
if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.IS_OFFENSIVE_FLAG)) {
builder.withOffensiveFlag();
}
if (message.getNullcast()) {
NUM_NULLCAST_TWEETS.increment();
builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.NULLCAST_FILTER_TERM);
} else {
NUM_NON_NULLCAST_TWEETS.increment();
}
if (encodedFeatures.isFlagSet(EarlybirdFieldConstant.IS_NULLCAST_FLAG)) {
NUM_NULLCAST_FEATURE_FLAG_SET_TWEETS.increment();
}
if (message.isSelfThread()) {
builder.addFilterInternalFieldTerm(
EarlybirdFieldConstant.SELF_THREAD_FILTER_TERM);
NUM_SELF_THREAD_TWEETS.increment();
}
if (message.isExclusive()) {
builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.EXCLUSIVE_FILTER_TERM);
builder.withLongField(
EarlybirdFieldConstant.EXCLUSIVE_CONVERSATION_AUTHOR_ID_CSF.getFieldName(),
message.getExclusiveConversationAuthorId());
NUM_EXCLUSIVE_TWEETS.increment();
}
builder.withLanguageCodes(message.getLanguage(), message.getBCP47LanguageTag());
return builder;
}
/**
* Build the user fields.
*/
public static void buildUserFields(
EarlybirdThriftDocumentBuilder builder,
TwitterMessage message,
VersionedTweetFeatures versionedTweetFeatures,
PenguinVersion penguinVersion) {
// 1. Set all the from user fields.
if (message.getFromUserTwitterId().isPresent()) {
builder.withLongField(EarlybirdFieldConstant.FROM_USER_ID_FIELD.getFieldName(),
message.getFromUserTwitterId().get())
// CSF
.withLongField(EarlybirdFieldConstant.FROM_USER_ID_CSF.getFieldName(),
message.getFromUserTwitterId().get());
} else {
LOG.warn("fromUserTwitterId is not set in TwitterMessage! Status id: " + message.getId());
}
if (message.getFromUserScreenName().isPresent()) {
String fromUser = message.getFromUserScreenName().get();
String normalizedFromUser =
NormalizerHelper.normalizeWithUnknownLocale(fromUser, penguinVersion);
builder
.withWhiteSpaceTokenizedScreenNameField(
EarlybirdFieldConstant.TOKENIZED_FROM_USER_FIELD.getFieldName(),
normalizedFromUser)
.withStringField(EarlybirdFieldConstant.FROM_USER_FIELD.getFieldName(),
normalizedFromUser);
if (message.getTokenizedFromUserScreenName().isPresent()) {
builder.withCamelCaseTokenizedScreenNameField(
EarlybirdFieldConstant.CAMELCASE_USER_HANDLE_FIELD.getFieldName(),
fromUser,
normalizedFromUser,
message.getTokenizedFromUserScreenName().get());
}
}
Optional<String> toUserScreenName = message.getToUserLowercasedScreenName();
if (toUserScreenName.isPresent() && !toUserScreenName.get().isEmpty()) {
builder.withStringField(
EarlybirdFieldConstant.TO_USER_FIELD.getFieldName(),
NormalizerHelper.normalizeWithUnknownLocale(toUserScreenName.get(), penguinVersion));
}
if (versionedTweetFeatures.isSetUserDisplayNameTokenStreamText()) {
builder.withTokenStreamField(EarlybirdFieldConstant.TOKENIZED_USER_NAME_FIELD.getFieldName(),
versionedTweetFeatures.getUserDisplayNameTokenStreamText(),
versionedTweetFeatures.getUserDisplayNameTokenStream());
}
}
/**
* Build the geo fields.
*/
public static void buildGeoFields(
EarlybirdThriftDocumentBuilder builder,
TwitterMessage message,
VersionedTweetFeatures versionedTweetFeatures) {
double lat = GeoUtil.ILLEGAL_LATLON;
double lon = GeoUtil.ILLEGAL_LATLON;
if (message.getGeoLocation() != null) {
GeoObject location = message.getGeoLocation();
builder.withGeoField(EarlybirdFieldConstant.GEO_HASH_FIELD.getFieldName(),
location.getLatitude(), location.getLongitude(), location.getAccuracy());
if (location.getSource() != null) {
builder.withStringField(EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(),
EarlybirdFieldConstants.formatGeoType(location.getSource()));
}
if (GeoUtil.validateGeoCoordinates(location.getLatitude(), location.getLongitude())) {
lat = location.getLatitude();
lon = location.getLongitude();
}
}
// See SEARCH-14317 for investigation on how much space geo filed is used in archive cluster.
// In lucene archives, this CSF is needed regardless of whether geoLocation is set.
builder.withLatLonCSF(lat, lon);
if (versionedTweetFeatures.isSetTokenizedPlace()) {
Place place = versionedTweetFeatures.getTokenizedPlace();
Preconditions.checkArgument(place.isSetId(), "Place ID not set for tweet "
+ message.getId());
Preconditions.checkArgument(place.isSetFullName(),
"Place full name not set for tweet " + message.getId());
builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.PLACE_ID_FIELD.getFieldName());
builder
.withStringField(EarlybirdFieldConstant.PLACE_ID_FIELD.getFieldName(), place.getId())
.withStringField(EarlybirdFieldConstant.PLACE_FULL_NAME_FIELD.getFieldName(),
place.getFullName());
if (place.isSetCountryCode()) {
builder.withStringField(EarlybirdFieldConstant.PLACE_COUNTRY_CODE_FIELD.getFieldName(),
place.getCountryCode());
}
}
if (versionedTweetFeatures.isSetTokenizedProfileGeoEnrichment()) {
ProfileGeoEnrichment profileGeoEnrichment =
versionedTweetFeatures.getTokenizedProfileGeoEnrichment();
Preconditions.checkArgument(
profileGeoEnrichment.isSetPotentialLocations(),
"ProfileGeoEnrichment.potentialLocations not set for tweet "
+ message.getId());
List<PotentialLocation> potentialLocations = profileGeoEnrichment.getPotentialLocations();
Preconditions.checkArgument(
!potentialLocations.isEmpty(),
"Found tweet with an empty ProfileGeoEnrichment.potentialLocations: "
+ message.getId());
builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.PROFILE_GEO_FILTER_TERM);
for (PotentialLocation potentialLocation : potentialLocations) {
if (potentialLocation.isSetCountryCode()) {
builder.withStringField(
EarlybirdFieldConstant.PROFILE_GEO_COUNTRY_CODE_FIELD.getFieldName(),
potentialLocation.getCountryCode());
}
if (potentialLocation.isSetRegion()) {
builder.withStringField(EarlybirdFieldConstant.PROFILE_GEO_REGION_FIELD.getFieldName(),
potentialLocation.getRegion());
}
if (potentialLocation.isSetLocality()) {
builder.withStringField(EarlybirdFieldConstant.PROFILE_GEO_LOCALITY_FIELD.getFieldName(),
potentialLocation.getLocality());
}
}
}
builder.withPlacesField(message.getPlaces());
}
/**
* Build the retweet and reply fields.
*/
public static void buildRetweetAndReplyFields(
EarlybirdThriftDocumentBuilder builder,
TwitterMessage message,
boolean strict) {
long retweetUserIdVal = -1;
long sharedStatusIdVal = -1;
if (message.getRetweetMessage() != null) {
if (message.getRetweetMessage().getSharedId() != null) {
sharedStatusIdVal = message.getRetweetMessage().getSharedId();
}
if (message.getRetweetMessage().hasSharedUserTwitterId()) {
retweetUserIdVal = message.getRetweetMessage().getSharedUserTwitterId();
}
}
long inReplyToStatusIdVal = -1;
long inReplyToUserIdVal = -1;
if (message.isReply()) {
if (message.getInReplyToStatusId().isPresent()) {
inReplyToStatusIdVal = message.getInReplyToStatusId().get();
}
if (message.getToUserTwitterId().isPresent()) {
inReplyToUserIdVal = message.getToUserTwitterId().get();
}
}
buildRetweetAndReplyFields(
retweetUserIdVal,
sharedStatusIdVal,
inReplyToStatusIdVal,
inReplyToUserIdVal,
strict,
builder);
}
/**
* Build the quotes fields.
*/
public static void buildQuotesFields(
EarlybirdThriftDocumentBuilder builder,
TwitterMessage message) {
if (message.getQuotedMessage() != null) {
TwitterQuotedMessage quoted = message.getQuotedMessage();
if (quoted != null && quoted.getQuotedStatusId() > 0 && quoted.getQuotedUserId() > 0) {
builder.withQuote(quoted.getQuotedStatusId(), quoted.getQuotedUserId());
}
}
}
/**
* Build directed at field.
*/
public static void buildDirectedAtFields(
EarlybirdThriftDocumentBuilder builder,
TwitterMessage message) {
if (message.getDirectedAtUserId().isPresent() && message.getDirectedAtUserId().get() > 0) {
builder.withDirectedAtUser(message.getDirectedAtUserId().get());
builder.addFilterInternalFieldTerm(EarlybirdFieldConstant.DIRECTED_AT_FILTER_TERM);
}
}
/**
* Build the versioned features for a tweet.
*/
public static void buildVersionedFeatureFields(
EarlybirdThriftDocumentBuilder builder,
VersionedTweetFeatures versionedTweetFeatures) {
builder
.withHashtagsField(versionedTweetFeatures.getHashtags())
.withMentionsField(versionedTweetFeatures.getMentions())
.withStocksFields(versionedTweetFeatures.getStocks())
.withResolvedLinksText(versionedTweetFeatures.getNormalizedResolvedUrlText())
.withTokenStreamField(EarlybirdFieldConstant.TEXT_FIELD.getFieldName(),
versionedTweetFeatures.getTweetTokenStreamText(),
versionedTweetFeatures.isSetTweetTokenStream()
? versionedTweetFeatures.getTweetTokenStream() : null)
.withStringField(EarlybirdFieldConstant.SOURCE_FIELD.getFieldName(),
versionedTweetFeatures.getSource())
.withStringField(EarlybirdFieldConstant.NORMALIZED_SOURCE_FIELD.getFieldName(),
versionedTweetFeatures.getNormalizedSource());
// Internal fields for smileys and question marks
if (versionedTweetFeatures.hasPositiveSmiley) {
builder.withStringField(
EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(),
EarlybirdFieldConstant.HAS_POSITIVE_SMILEY);
}
if (versionedTweetFeatures.hasNegativeSmiley) {
builder.withStringField(
EarlybirdFieldConstant.INTERNAL_FIELD.getFieldName(),
EarlybirdFieldConstant.HAS_NEGATIVE_SMILEY);
}
if (versionedTweetFeatures.hasQuestionMark) {
builder.withStringField(EarlybirdFieldConstant.TEXT_FIELD.getFieldName(),
EarlybirdThriftDocumentBuilder.QUESTION_MARK);
}
}
/**
* Build the escherbird annotations for a tweet.
*/
public static void buildAnnotationFields(
EarlybirdThriftDocumentBuilder builder,
TwitterMessage message) {
List<TwitterMessage.EscherbirdAnnotation> escherbirdAnnotations =
message.getEscherbirdAnnotations();
if (CollectionUtils.isEmpty(escherbirdAnnotations)) {
return;
}
builder.addFacetSkipList(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName());
for (TwitterMessage.EscherbirdAnnotation annotation : escherbirdAnnotations) {
String groupDomainEntity = String.format("%d.%d.%d",
annotation.groupId, annotation.domainId, annotation.entityId);
String domainEntity = String.format("%d.%d", annotation.domainId, annotation.entityId);
String entity = String.format("%d", annotation.entityId);
builder.withStringField(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(),
groupDomainEntity);
builder.withStringField(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(),
domainEntity);
builder.withStringField(EarlybirdFieldConstant.ENTITY_ID_FIELD.getFieldName(),
entity);
}
}
/**
* Build the correct ThriftIndexingEvent's fields based on retweet and reply status.
*/
public static void buildRetweetAndReplyFields(
long retweetUserIdVal,
long sharedStatusIdVal,
long inReplyToStatusIdVal,
long inReplyToUserIdVal,
boolean strict,
EarlybirdThriftDocumentBuilder builder) {
Optional<Long> retweetUserId = Optional.of(retweetUserIdVal).filter(x -> x > 0);
Optional<Long> sharedStatusId = Optional.of(sharedStatusIdVal).filter(x -> x > 0);
Optional<Long> inReplyToUserId = Optional.of(inReplyToUserIdVal).filter(x -> x > 0);
Optional<Long> inReplyToStatusId = Optional.of(inReplyToStatusIdVal).filter(x -> x > 0);
// We have six combinations here. A Tweet can be
// 1) a reply to another tweet (then it has both in-reply-to-user-id and
// in-reply-to-status-id set),
// 2) directed-at a user (then it only has in-reply-to-user-id set),
// 3) not a reply at all.
// Additionally, it may or may not be a Retweet (if it is, then it has retweet-user-id and
// retweet-status-id set).
//
// We want to set some fields unconditionally, and some fields (reference-author-id and
// shared-status-id) depending on the reply/retweet combination.
//
// 1. Normal tweet (not a reply, not a retweet). None of the fields should be set.
//
// 2. Reply to a tweet (both in-reply-to-user-id and in-reply-to-status-id set).
// IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id
// SHARED_STATUS_ID_CSF should be set to in-reply-to-status-id
// IS_REPLY_FLAG should be set
//
// 3. Directed-at a user (only in-reply-to-user-id is set).
// IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id
// IS_REPLY_FLAG should be set
//
// 4. Retweet of a normal tweet (retweet-user-id and retweet-status-id are set).
// RETWEET_SOURCE_USER_ID_FIELD should be set to retweet-user-id
// SHARED_STATUS_ID_CSF should be set to retweet-status-id
// IS_RETWEET_FLAG should be set
//
// 5. Retweet of a reply (both in-reply-to-user-id and in-reply-to-status-id set,
// retweet-user-id and retweet-status-id are set).
// RETWEET_SOURCE_USER_ID_FIELD should be set to retweet-user-id
// SHARED_STATUS_ID_CSF should be set to retweet-status-id (retweet beats reply!)
// IS_RETWEET_FLAG should be set
// IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id
// IS_REPLY_FLAG should NOT be set
//
// 6. Retweet of a directed-at tweet (only in-reply-to-user-id is set,
// retweet-user-id and retweet-status-id are set).
// RETWEET_SOURCE_USER_ID_FIELD should be set to retweet-user-id
// SHARED_STATUS_ID_CSF should be set to retweet-status-id
// IS_RETWEET_FLAG should be set
// IN_REPLY_TO_USER_ID_FIELD should be set to in-reply-to-user-id
// IS_REPLY_FLAG should NOT be set
//
// In other words:
// SHARED_STATUS_ID_CSF logic: if this is a retweet SHARED_STATUS_ID_CSF should be set to
// retweet-status-id, otherwise if it's a reply to a tweet, it should be set to
// in-reply-to-status-id.
Preconditions.checkState(retweetUserId.isPresent() == sharedStatusId.isPresent());
if (retweetUserId.isPresent()) {
builder.withNativeRetweet(retweetUserId.get(), sharedStatusId.get());
if (inReplyToUserId.isPresent()) {
// Set IN_REPLY_TO_USER_ID_FIELD even if this is a retweet of a reply.
builder.withInReplyToUserID(inReplyToUserId.get());
}
} else {
// If this is a retweet of a reply, we don't want to mark it as a reply, or override fields
// set by the retweet logic.
// If we are in this branch, this is not a retweet. Potentially, we set the reply flag,
// and override shared-status-id and reference-author-id.
if (inReplyToStatusId.isPresent()) {
if (strict) {
// Enforcing that if this is a reply to a tweet, then it also has a replied-to user.
Preconditions.checkState(inReplyToUserId.isPresent());
}
builder.withReplyFlag();
builder.withLongField(
EarlybirdFieldConstant.SHARED_STATUS_ID_CSF.getFieldName(),
inReplyToStatusId.get());
builder.withLongField(
EarlybirdFieldConstant.IN_REPLY_TO_TWEET_ID_FIELD.getFieldName(),
inReplyToStatusId.get());
}
if (inReplyToUserId.isPresent()) {
builder.withReplyFlag();
builder.withInReplyToUserID(inReplyToUserId.get());
}
}
}
/**
* Build the engagement fields.
*/
public static void buildNormalizedMinEngagementFields(
EarlybirdThriftDocumentBuilder builder,
EarlybirdEncodedFeatures encodedFeatures,
EarlybirdCluster cluster) throws IOException {
if (EarlybirdCluster.isArchive(cluster)) {
int favoriteCount = encodedFeatures.getFeatureValue(EarlybirdFieldConstant.FAVORITE_COUNT);
int retweetCount = encodedFeatures.getFeatureValue(EarlybirdFieldConstant.RETWEET_COUNT);
int replyCount = encodedFeatures.getFeatureValue(EarlybirdFieldConstant.REPLY_COUNT);
builder
.withNormalizedMinEngagementField(
EarlybirdFieldConstant.NORMALIZED_FAVORITE_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD
.getFieldName(),
favoriteCount);
builder
.withNormalizedMinEngagementField(
EarlybirdFieldConstant.NORMALIZED_RETWEET_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD
.getFieldName(),
retweetCount);
builder
.withNormalizedMinEngagementField(
EarlybirdFieldConstant.NORMALIZED_REPLY_COUNT_GREATER_THAN_OR_EQUAL_TO_FIELD
.getFieldName(),
replyCount);
}
}
/**
* As seen in SEARCH-5617, we sometimes have incorrect createdAt. This method tries to fix them
* by extracting creation time from snowflake when possible.
*/
public static long fixCreatedAtTimeStampIfNecessary(long id, long createdAtMs) {
if (createdAtMs < VALID_CREATION_TIME_THRESHOLD_MILLIS
&& id > SnowflakeIdParser.SNOWFLAKE_ID_LOWER_BOUND) {
// This tweet has a snowflake ID, and we can extract timestamp from the ID.
ADJUSTED_BAD_CREATED_AT_COUNTER.increment();
return SnowflakeIdParser.getTimestampFromTweetId(id);
} else if (!SnowflakeIdParser.isTweetIDAndCreatedAtConsistent(id, createdAtMs)) {
LOG.error(
"Found inconsistent tweet ID and created at timestamp: [statusID={}], [createdAtMs={}]",
id, createdAtMs);
INCONSISTENT_TWEET_ID_AND_CREATED_AT_MS.increment();
}
return createdAtMs;
}
}

View File

@ -0,0 +1,155 @@
package com.twitter.search.earlybird.ml;
import java.io.IOException;
import com.google.common.annotations.VisibleForTesting;
import com.google.common.base.Optional;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.twitter.search.common.file.AbstractFile;
import com.twitter.search.common.file.FileUtils;
import com.twitter.search.common.metrics.SearchStatsReceiver;
import com.twitter.search.common.schema.DynamicSchema;
import com.twitter.search.common.util.ml.prediction_engine.CompositeFeatureContext;
import com.twitter.search.common.util.ml.prediction_engine.LightweightLinearModel;
import com.twitter.search.common.util.ml.prediction_engine.ModelLoader;
import static com.twitter.search.modeling.tweet_ranking.TweetScoringFeatures.CONTEXT;
import static com.twitter.search.modeling.tweet_ranking.TweetScoringFeatures.FeatureContextVersion.CURRENT_VERSION;
/**
* Loads the scoring models for tweets and provides access to them.
*
* This class relies on a list of ModelLoader objects to retrieve the objects from them. It will
* return the first model found according to the order in the list.
*
* For production, we load models from 2 sources: classpath and HDFS. If a model is available
* from HDFS, we return it, otherwise we use the model from the classpath.
*
* The models used for default requests (i.e. not experiments) MUST be present in the
* classpath, this allows us to avoid errors if they can't be loaded from HDFS.
* Models for experiments can live only in HDFS, so we don't need to redeploy Earlybird if we
* want to test them.
*/
public class ScoringModelsManager {
private static final Logger LOG = LoggerFactory.getLogger(ScoringModelsManager.class);
/**
* Used when
* 1. Testing
* 2. The scoring models are disabled in the config
* 3. Exceptions thrown during loading the scoring models
*/
public static final ScoringModelsManager NO_OP_MANAGER = new ScoringModelsManager() {
@Override
public boolean isEnabled() {
return false;
}
};
private final ModelLoader[] loaders;
private final DynamicSchema dynamicSchema;
public ScoringModelsManager(ModelLoader... loaders) {
this.loaders = loaders;
this.dynamicSchema = null;
}
public ScoringModelsManager(DynamicSchema dynamicSchema, ModelLoader... loaders) {
this.loaders = loaders;
this.dynamicSchema = dynamicSchema;
}
/**
* Indicates that the scoring models were enabled in the config and were loaded successfully
*/
public boolean isEnabled() {
return true;
}
public void reload() {
for (ModelLoader loader : loaders) {
loader.run();
}
}
/**
* Loads and returns the model with the given name, if one exists.
*/
public Optional<LightweightLinearModel> getModel(String modelName) {
for (ModelLoader loader : loaders) {
Optional<LightweightLinearModel> model = loader.getModel(modelName);
if (model.isPresent()) {
return model;
}
}
return Optional.absent();
}
/**
* Creates an instance that loads models first from HDFS and the classpath resources.
*
* If the models are not found in HDFS, it uses the models from the classpath as fallback.
*/
public static ScoringModelsManager create(
SearchStatsReceiver serverStats,
String hdfsNameNode,
String hdfsBasedPath,
DynamicSchema dynamicSchema) throws IOException {
// Create a composite feature context so we can load both legacy and schema-based models
CompositeFeatureContext featureContext = new CompositeFeatureContext(
CONTEXT, dynamicSchema::getSearchFeatureSchema);
ModelLoader hdfsLoader = createHdfsLoader(
serverStats, hdfsNameNode, hdfsBasedPath, featureContext);
ModelLoader classpathLoader = createClasspathLoader(
serverStats, featureContext);
// Explicitly load the models from the classpath
classpathLoader.run();
ScoringModelsManager manager = new ScoringModelsManager(hdfsLoader, classpathLoader);
LOG.info("Initialized ScoringModelsManager for loading models from HDFS and the classpath");
return manager;
}
protected static ModelLoader createHdfsLoader(
SearchStatsReceiver serverStats,
String hdfsNameNode,
String hdfsBasedPath,
CompositeFeatureContext featureContext) {
String hdfsVersionedPath = hdfsBasedPath + "/" + CURRENT_VERSION.getVersionDirectory();
LOG.info("Starting to load scoring models from HDFS: {}:{}",
hdfsNameNode, hdfsVersionedPath);
return ModelLoader.forHdfsDirectory(
hdfsNameNode,
hdfsVersionedPath,
featureContext,
"scoring_models_hdfs_",
serverStats);
}
/**
* Creates a loader that loads models from a default location in the classpath.
*/
@VisibleForTesting
public static ModelLoader createClasspathLoader(
SearchStatsReceiver serverStats, CompositeFeatureContext featureContext)
throws IOException {
AbstractFile defaultModelsBaseDir = FileUtils.getTmpDirHandle(
ScoringModelsManager.class,
"/com/twitter/search/earlybird/ml/default_models");
AbstractFile defaultModelsDir = defaultModelsBaseDir.getChild(
CURRENT_VERSION.getVersionDirectory());
LOG.info("Starting to load scoring models from the classpath: {}",
defaultModelsDir.getPath());
return ModelLoader.forDirectory(
defaultModelsDir,
featureContext,
"scoring_models_classpath_",
serverStats);
}
}

View File

@ -0,0 +1,83 @@
# checkstyle: noqa
from twml.feature_config import FeatureConfigBuilder
def get_feature_config(data_spec_path, label):
return (
FeatureConfigBuilder(data_spec_path=data_spec_path, debug=True)
.batch_add_features(
[
("ebd.author_specific_score", "A"),
("ebd.has_diff_lang", "A"),
("ebd.has_english_tweet_diff_ui_lang", "A"),
("ebd.has_english_ui_diff_tweet_lang", "A"),
("ebd.is_self_tweet", "A"),
("ebd.tweet_age_in_secs", "A"),
("encoded_tweet_features.favorite_count", "A"),
("encoded_tweet_features.from_verified_account_flag", "A"),
("encoded_tweet_features.has_card_flag", "A"),
# ("encoded_tweet_features.has_consumer_video_flag", "A"),
("encoded_tweet_features.has_image_url_flag", "A"),
("encoded_tweet_features.has_link_flag", "A"),
("encoded_tweet_features.has_multiple_hashtags_or_trends_flag", "A"),
# ("encoded_tweet_features.has_multiple_media_flag", "A"),
("encoded_tweet_features.has_native_image_flag", "A"),
("encoded_tweet_features.has_news_url_flag", "A"),
("encoded_tweet_features.has_periscope_flag", "A"),
("encoded_tweet_features.has_pro_video_flag", "A"),
("encoded_tweet_features.has_quote_flag", "A"),
("encoded_tweet_features.has_trend_flag", "A"),
("encoded_tweet_features.has_video_url_flag", "A"),
("encoded_tweet_features.has_vine_flag", "A"),
("encoded_tweet_features.has_visible_link_flag", "A"),
("encoded_tweet_features.is_offensive_flag", "A"),
("encoded_tweet_features.is_reply_flag", "A"),
("encoded_tweet_features.is_retweet_flag", "A"),
("encoded_tweet_features.is_sensitive_content", "A"),
# ("encoded_tweet_features.is_user_new_flag", "A"),
("encoded_tweet_features.language", "A"),
("encoded_tweet_features.link_language", "A"),
("encoded_tweet_features.num_hashtags", "A"),
("encoded_tweet_features.num_mentions", "A"),
# ("encoded_tweet_features.profile_is_egg_flag", "A"),
("encoded_tweet_features.reply_count", "A"),
("encoded_tweet_features.retweet_count", "A"),
("encoded_tweet_features.text_score", "A"),
("encoded_tweet_features.user_reputation", "A"),
("extended_encoded_tweet_features.embeds_impression_count", "A"),
("extended_encoded_tweet_features.embeds_impression_count_v2", "A"),
("extended_encoded_tweet_features.embeds_url_count", "A"),
("extended_encoded_tweet_features.embeds_url_count_v2", "A"),
("extended_encoded_tweet_features.favorite_count_v2", "A"),
("extended_encoded_tweet_features.label_abusive_hi_rcl_flag", "A"),
("extended_encoded_tweet_features.label_dup_content_flag", "A"),
("extended_encoded_tweet_features.label_nsfw_hi_prc_flag", "A"),
("extended_encoded_tweet_features.label_nsfw_hi_rcl_flag", "A"),
("extended_encoded_tweet_features.label_spam_flag", "A"),
("extended_encoded_tweet_features.label_spam_hi_rcl_flag", "A"),
("extended_encoded_tweet_features.quote_count", "A"),
("extended_encoded_tweet_features.reply_count_v2", "A"),
("extended_encoded_tweet_features.retweet_count_v2", "A"),
("extended_encoded_tweet_features.weighted_favorite_count", "A"),
("extended_encoded_tweet_features.weighted_quote_count", "A"),
("extended_encoded_tweet_features.weighted_reply_count", "A"),
("extended_encoded_tweet_features.weighted_retweet_count", "A"),
]
)
.add_labels(
[
label, # Tensor index: 0
"recap.engagement.is_clicked", # Tensor index: 1
"recap.engagement.is_favorited", # Tensor index: 2
"recap.engagement.is_open_linked", # Tensor index: 3
"recap.engagement.is_photo_expanded", # Tensor index: 4
"recap.engagement.is_profile_clicked", # Tensor index: 5
"recap.engagement.is_replied", # Tensor index: 6
"recap.engagement.is_retweeted", # Tensor index: 7
"recap.engagement.is_video_playback_50", # Tensor index: 8
"timelines.earlybird_score", # Tensor index: 9
]
)
.define_weight("meta.record_weight/type=earlybird")
.build()
)

View File

@ -0,0 +1,75 @@
Tweepcred
Tweepcred is a social network analysis tool that calculates the influence of Twitter users based on their interactions with other users. The tool uses the PageRank algorithm to rank users based on their influence.
PageRank Algorithm
PageRank is a graph algorithm that was originally developed by Google to determine the importance of web pages in search results. The algorithm works by assigning a numerical score to each page based on the number and quality of other pages that link to it. The more links a page has from other high-quality pages, the higher its PageRank score.
In the Tweepcred project, the PageRank algorithm is used to determine the influence of Twitter users based on their interactions with other users. The graph is constructed by treating Twitter users as nodes, and their interactions (mentions, retweets, etc.) as edges. The PageRank score of a user represents their influence in the network.
Tweepcred PageRank Implementation
The implementation of the PageRank algorithm in Tweepcred is based on the Hadoop MapReduce framework. The algorithm is split into two stages: preparation and iteration.
The preparation stage involves constructing the graph of Twitter users and their interactions, and initializing each user's PageRank score to a default value. This stage is implemented in the PreparePageRankData class.
The iteration stage involves repeatedly calculating and updating the PageRank scores of each user until convergence is reached. This stage is implemented in the UpdatePageRank class, which is run multiple times until the algorithm converges.
The Tweepcred PageRank implementation also includes a number of optimizations to improve performance and reduce memory usage. These optimizations include block compression, lazy loading, and in-memory caching.
========================================== TweepcredBatchJob.scala ==========================================
This is a Scala class that represents a batch job for computing the "tweepcred" (Twitter credibility) score for Twitter users using weighted or unweighted PageRank algorithm. The class extends the AnalyticsIterativeBatchJob class, which is part of the Scalding framework used for data processing on Hadoop.
The class defines various properties and methods that are used to configure and run the batch job. The args parameter represents the command-line arguments that are passed to the batch job, such as the --weighted flag that determines whether to use the weighted PageRank algorithm or not.
The run method overrides the run method of the base class and prints the batch statistics after the job has finished. The children method defines a list of child jobs that need to be executed as part of the batch job. The messageHeader method returns a string that represents the header of the batch job message.
========================================== ExtractTweepcred.scala ==========================================
This class is a Scalding job that calculates "tweepcred" from a given pagerank file. Tweepcred is a measure of reputation for Twitter users that takes into account the number of followers they have and the number of people they follow. If the optional argument post_adjust is set to true (default value), then the pagerank values are adjusted based on the user's follower-to-following ratio.
The class takes several command-line arguments specifying input and output files and options, and it uses the Scalding library to perform distributed data processing on the input files. It reads in the pagerank file and a user mass file, both in TSV format, and combines them to produce a new pagerank file with the adjusted values. The adjusted pagerank is then used to calculate tweepcred values, which are written to output files.
The code makes use of the MostRecentCombinedUserSnapshotSource class from the com.twitter.pluck.source.combined_user_source package to obtain user information from the user mass file. It also uses the Reputation class to perform the tweepcred calculations and adjustments.
========================================== UserMass.scala ==========================================
The UserMass class is a helper class used to calculate the "mass" of a user on Twitter, as defined by a certain algorithm. The mass score represents the user's reputation and is used in various applications, such as in determining which users should be recommended to follow or which users should have their content highlighted.
The getUserMass method of the UserMass class takes in a CombinedUser object, which contains information about a Twitter user, and returns an optional UserMassInfo object, which contains the user's ID and calculated mass score.
The algorithm used to calculate the mass score takes into account various factors such as the user's account age, number of followers and followings, device usage, and safety status (restricted, suspended, verified). The calculation involves adding and multiplying weight factors and adjusting the mass score based on a threshold for the number of friends and followers.
========================================== PreparePageRankData.scala ==========================================
The PreparePageRankData class prepares the graph data for the page rank calculation. It generates the initial pagerank and then starts the WeightedPageRank job. It has the following functionalities:
It reads the user mass TSV file generated by the twadoop user_mass job.
It reads the graph data, which is either a TSV file or a combination of flock edges and real graph inputs for weights.
It generates the initial pagerank as the starting point for the pagerank computation.
It writes the number of nodes to a TSV file and dumps the nodes to another TSV file.
It has several options like weighted, flock_edges_only, and input_pagerank to fine-tune the pagerank calculation.
It also has options for the WeightedPageRank and ExtractTweepcred jobs, like output_pagerank, output_tweepcred, maxiterations, jumpprob, threshold, and post_adjust.
The PreparePageRankData class has several helper functions like getFlockEdges, getRealGraphEdges, getFlockRealGraphEdges, and getCsvEdges that read the graph data from different sources like DAL, InteractionGraph, or CSV files. It also has the generateInitialPagerank function that generates the initial pagerank from the graph data.
========================================== WeightedPageRank.scala ==========================================
WeightedPageRank is a class that performs the weighted PageRank algorithm on a given graph.
The algorithm starts from a given PageRank value and performs one iteration, then tests for convergence. If convergence has not been reached, the algorithm clones itself and starts the next PageRank job with the updated PageRank as input. If convergence has been reached, the algorithm starts the ExtractTweepcred job instead.
The class takes in several options, including the working directory, total number of nodes, nodes file, PageRank file, total difference, whether to perform weighted PageRank, the current iteration, maximum iterations to run, probability of a random jump, and whether to do post adjust.
The algorithm reads a nodes file that includes the source node ID, destination node IDs, weights, and mass prior. The algorithm also reads an input PageRank file that includes the source node ID and mass input. The algorithm then performs one iteration of the PageRank algorithm and writes the output PageRank to a file.
The algorithm tests for convergence by calculating the total difference between the input and output PageRank masses. If convergence has not been reached, the algorithm clones itself and starts the next PageRank job. If convergence has been reached, the algorithm starts the ExtractTweepcred job.
========================================== Reputation.scala ==========================================
This is a helper class called Reputation that contains methods for calculating a user's reputation score. The first method called scaledReputation takes a Double parameter raw which represents the user's page rank, and returns a Byte value that represents the user's reputation on a scale of 0 to 100. This method uses a formula that involves converting the logarithm of the page rank to a number between 0 and 100.
The second method called adjustReputationsPostCalculation takes three parameters: mass (a Double value representing the user's page rank), numFollowers (an Int value representing the number of followers a user has), and numFollowings (an Int value representing the number of users a user is following). This method reduces the page rank of users who have a low number of followers but a high number of followings. It calculates a division factor based on the ratio of followings to followers, and reduces the user's page rank by dividing it by this factor. The method returns the adjusted page rank.

View File

@ -0,0 +1,17 @@
# UserTweetEntityGraph (UTEG)
## What is it
User Tweet Entity Graph (UTEG) is a Finalge thrift service built on the GraphJet framework. It maintains a graph of user-tweet relationships and serves user recommendations based on traversals in this graph.
## How is it used on Twitter
UTEG generates the "XXX Liked" out-of-network tweets seen on Twitter's Home Timeline.
The core idea behind UTEG is collaborative filtering. UTEG takes a user's weighted follow graph (i.e a list of weighted userIds) as input,
performs efficient traversal & aggregation, and returns the top-weighted tweets engaged based on # of users that engaged the tweet, as well as
the engaged users' weights.
UTEG is a stateful service and relies on a Kafka stream to ingest & persist states. It maintains in-memory user engagements over the past
24-48 hours. Older events are dropped and GC'ed.
For full details on storage & processing, please check out our open-sourced project GraphJet, a general-purpose high-performance in-memory storage engine.
- https://github.com/twitter/GraphJet
- http://www.vldb.org/pvldb/vol9/p1281-sharma.pdf

View File

@ -0,0 +1,581 @@
package com.twitter.simclusters_v2.common
import com.twitter.simclusters_v2.thriftscala.SimClusterWithScore
import com.twitter.simclusters_v2.thriftscala.{SimClustersEmbedding => ThriftSimClustersEmbedding}
import scala.collection.mutable
import scala.language.implicitConversions
import scala.util.hashing.MurmurHash3.arrayHash
import scala.util.hashing.MurmurHash3.productHash
import scala.math._
/**
* A representation of a SimClusters Embedding, designed for low memory footprint and performance.
* For services that cache millions of embeddings, we found this to significantly reduce allocations,
* memory footprint and overall performance.
*
* Embedding data is stored in pre-sorted arrays rather than structures which use a lot of pointers
* (e.g. Map). A minimal set of lazily-constructed intermediate data is kept.
*
* Be wary of adding further `val` or `lazy val`s to this class; materializing and storing more data
* on these objects could significantly affect in-memory cache performance.
*
* Also, if you are using this code in a place where you care about memory footprint, be careful
* not to materialize any of the lazy vals unless you need them.
*/
sealed trait SimClustersEmbedding extends Equals {
import SimClustersEmbedding._
/**
* Any compliant implementation of the SimClustersEmbedding trait must ensure that:
* - the cluster and score arrays are ordered as described below
* - the cluster and score arrays are treated as immutable (.hashCode is memoized)
* - the size of all cluster and score arrays is the same
* - all cluster scores are > 0
* - cluster ids are unique
*/
// In descending score order - this is useful for truncation, where we care most about the highest scoring elements
private[simclusters_v2] val clusterIds: Array[ClusterId]
private[simclusters_v2] val scores: Array[Double]
// In ascending cluster order. This is useful for operations where we try to find the same cluster in another embedding, e.g. dot product
private[simclusters_v2] val sortedClusterIds: Array[ClusterId]
private[simclusters_v2] val sortedScores: Array[Double]
/**
* Build and return a Set of all clusters in this embedding
*/
lazy val clusterIdSet: Set[ClusterId] = sortedClusterIds.toSet
/**
* Build and return Seq representation of this embedding
*/
lazy val embedding: Seq[(ClusterId, Double)] =
sortedClusterIds.zip(sortedScores).sortBy(-_._2).toSeq
/**
* Build and return a Map representation of this embedding
*/
lazy val map: Map[ClusterId, Double] = sortedClusterIds.zip(sortedScores).toMap
lazy val l1norm: Double = CosineSimilarityUtil.l1NormArray(sortedScores)
lazy val l2norm: Double = CosineSimilarityUtil.normArray(sortedScores)
lazy val logNorm: Double = CosineSimilarityUtil.logNormArray(sortedScores)
lazy val expScaledNorm: Double =
CosineSimilarityUtil.expScaledNormArray(sortedScores, DefaultExponent)
/**
* The L2 Normalized Embedding. Optimize for Cosine Similarity Calculation.
*/
lazy val normalizedSortedScores: Array[Double] =
CosineSimilarityUtil.applyNormArray(sortedScores, l2norm)
lazy val logNormalizedSortedScores: Array[Double] =
CosineSimilarityUtil.applyNormArray(sortedScores, logNorm)
lazy val expScaledNormalizedSortedScores: Array[Double] =
CosineSimilarityUtil.applyNormArray(sortedScores, expScaledNorm)
/**
* The Standard Deviation of an Embedding.
*/
lazy val std: Double = {
if (scores.isEmpty) {
0.0
} else {
val sum = scores.sum
val mean = sum / scores.length
var variance: Double = 0.0
for (i <- scores.indices) {
val v = scores(i) - mean
variance += (v * v)
}
math.sqrt(variance / scores.length)
}
}
/**
* Return the score of a given clusterId.
*/
def get(clusterId: ClusterId): Option[Double] = {
var i = 0
while (i < sortedClusterIds.length) {
val thisId = sortedClusterIds(i)
if (clusterId == thisId) return Some(sortedScores(i))
if (thisId > clusterId) return None
i += 1
}
None
}
/**
* Return the score of a given clusterId. If not exist, return default.
*/
def getOrElse(clusterId: ClusterId, default: Double = 0.0): Double = {
require(default >= 0.0)
var i = 0
while (i < sortedClusterIds.length) {
val thisId = sortedClusterIds(i)
if (clusterId == thisId) return sortedScores(i)
if (thisId > clusterId) return default
i += 1
}
default
}
/**
* Return the cluster ids
*/
def getClusterIds(): Array[ClusterId] = clusterIds
/**
* Return the cluster ids with the highest scores
*/
def topClusterIds(size: Int): Seq[ClusterId] = clusterIds.take(size)
/**
* Return true if this embedding contains a given clusterId
*/
def contains(clusterId: ClusterId): Boolean = clusterIdSet.contains(clusterId)
def sum(another: SimClustersEmbedding): SimClustersEmbedding = {
if (another.isEmpty) this
else if (this.isEmpty) another
else {
var i1 = 0
var i2 = 0
val l = scala.collection.mutable.ArrayBuffer.empty[(Int, Double)]
while (i1 < sortedClusterIds.length && i2 < another.sortedClusterIds.length) {
if (sortedClusterIds(i1) == another.sortedClusterIds(i2)) {
l += Tuple2(sortedClusterIds(i1), sortedScores(i1) + another.sortedScores(i2))
i1 += 1
i2 += 1
} else if (sortedClusterIds(i1) > another.sortedClusterIds(i2)) {
l += Tuple2(another.sortedClusterIds(i2), another.sortedScores(i2))
// another cluster is lower. Increment it to see if the next one matches this's
i2 += 1
} else {
l += Tuple2(sortedClusterIds(i1), sortedScores(i1))
// this cluster is lower. Increment it to see if the next one matches anothers's
i1 += 1
}
}
if (i1 == sortedClusterIds.length && i2 != another.sortedClusterIds.length)
// this was shorter. Prepend remaining elements from another
l ++= another.sortedClusterIds.drop(i2).zip(another.sortedScores.drop(i2))
else if (i1 != sortedClusterIds.length && i2 == another.sortedClusterIds.length)
// another was shorter. Prepend remaining elements from this
l ++= sortedClusterIds.drop(i1).zip(sortedScores.drop(i1))
SimClustersEmbedding(l)
}
}
def scalarMultiply(multiplier: Double): SimClustersEmbedding = {
require(multiplier > 0.0, "SimClustersEmbedding.scalarMultiply requires multiplier > 0.0")
DefaultSimClustersEmbedding(
clusterIds,
scores.map(_ * multiplier),
sortedClusterIds,
sortedScores.map(_ * multiplier)
)
}
def scalarDivide(divisor: Double): SimClustersEmbedding = {
require(divisor > 0.0, "SimClustersEmbedding.scalarDivide requires divisor > 0.0")
DefaultSimClustersEmbedding(
clusterIds,
scores.map(_ / divisor),
sortedClusterIds,
sortedScores.map(_ / divisor)
)
}
def dotProduct(another: SimClustersEmbedding): Double = {
CosineSimilarityUtil.dotProductForSortedClusterAndScores(
sortedClusterIds,
sortedScores,
another.sortedClusterIds,
another.sortedScores)
}
def cosineSimilarity(another: SimClustersEmbedding): Double = {
CosineSimilarityUtil.dotProductForSortedClusterAndScores(
sortedClusterIds,
normalizedSortedScores,
another.sortedClusterIds,
another.normalizedSortedScores)
}
def logNormCosineSimilarity(another: SimClustersEmbedding): Double = {
CosineSimilarityUtil.dotProductForSortedClusterAndScores(
sortedClusterIds,
logNormalizedSortedScores,
another.sortedClusterIds,
another.logNormalizedSortedScores)
}
def expScaledCosineSimilarity(another: SimClustersEmbedding): Double = {
CosineSimilarityUtil.dotProductForSortedClusterAndScores(
sortedClusterIds,
expScaledNormalizedSortedScores,
another.sortedClusterIds,
another.expScaledNormalizedSortedScores)
}
/**
* Return true if this is an empty embedding
*/
def isEmpty: Boolean = sortedClusterIds.isEmpty
/**
* Return the Jaccard Similarity Score between two embeddings.
* Note: this implementation should be optimized if we start to use it in production
*/
def jaccardSimilarity(another: SimClustersEmbedding): Double = {
if (this.isEmpty || another.isEmpty) {
0.0
} else {
val intersect = clusterIdSet.intersect(another.clusterIdSet).size
val union = clusterIdSet.union(another.clusterIdSet).size
intersect.toDouble / union
}
}
/**
* Return the Fuzzy Jaccard Similarity Score between two embeddings.
* Treat each Simclusters embedding as fuzzy set, calculate the fuzzy set similarity
* metrics of two embeddings
*
* Paper 2.2.1: https://openreview.net/pdf?id=SkxXg2C5FX
*/
def fuzzyJaccardSimilarity(another: SimClustersEmbedding): Double = {
if (this.isEmpty || another.isEmpty) {
0.0
} else {
val v1C = sortedClusterIds
val v1S = sortedScores
val v2C = another.sortedClusterIds
val v2S = another.sortedScores
require(v1C.length == v1S.length)
require(v2C.length == v2S.length)
var i1 = 0
var i2 = 0
var numerator = 0.0
var denominator = 0.0
while (i1 < v1C.length && i2 < v2C.length) {
if (v1C(i1) == v2C(i2)) {
numerator += min(v1S(i1), v2S(i2))
denominator += max(v1S(i1), v2S(i2))
i1 += 1
i2 += 1
} else if (v1C(i1) > v2C(i2)) {
denominator += v2S(i2)
i2 += 1
} else {
denominator += v1S(i1)
i1 += 1
}
}
while (i1 < v1C.length) {
denominator += v1S(i1)
i1 += 1
}
while (i2 < v2C.length) {
denominator += v2S(i2)
i2 += 1
}
numerator / denominator
}
}
/**
* Return the Euclidean Distance Score between two embeddings.
* Note: this implementation should be optimized if we start to use it in production
*/
def euclideanDistance(another: SimClustersEmbedding): Double = {
val unionClusters = clusterIdSet.union(another.clusterIdSet)
val variance = unionClusters.foldLeft(0.0) {
case (sum, clusterId) =>
val distance = math.abs(this.getOrElse(clusterId) - another.getOrElse(clusterId))
sum + distance * distance
}
math.sqrt(variance)
}
/**
* Return the Manhattan Distance Score between two embeddings.
* Note: this implementation should be optimized if we start to use it in production
*/
def manhattanDistance(another: SimClustersEmbedding): Double = {
val unionClusters = clusterIdSet.union(another.clusterIdSet)
unionClusters.foldLeft(0.0) {
case (sum, clusterId) =>
sum + math.abs(this.getOrElse(clusterId) - another.getOrElse(clusterId))
}
}
/**
* Return the number of overlapping clusters between two embeddings.
*/
def overlappingClusters(another: SimClustersEmbedding): Int = {
var i1 = 0
var i2 = 0
var count = 0
while (i1 < sortedClusterIds.length && i2 < another.sortedClusterIds.length) {
if (sortedClusterIds(i1) == another.sortedClusterIds(i2)) {
count += 1
i1 += 1
i2 += 1
} else if (sortedClusterIds(i1) > another.sortedClusterIds(i2)) {
// v2 cluster is lower. Increment it to see if the next one matches v1's
i2 += 1
} else {
// v1 cluster is lower. Increment it to see if the next one matches v2's
i1 += 1
}
}
count
}
/**
* Return the largest product cluster scores
*/
def maxElementwiseProduct(another: SimClustersEmbedding): Double = {
var i1 = 0
var i2 = 0
var maxProduct: Double = 0.0
while (i1 < sortedClusterIds.length && i2 < another.sortedClusterIds.length) {
if (sortedClusterIds(i1) == another.sortedClusterIds(i2)) {
val product = sortedScores(i1) * another.sortedScores(i2)
if (product > maxProduct) maxProduct = product
i1 += 1
i2 += 1
} else if (sortedClusterIds(i1) > another.sortedClusterIds(i2)) {
// v2 cluster is lower. Increment it to see if the next one matches v1's
i2 += 1
} else {
// v1 cluster is lower. Increment it to see if the next one matches v2's
i1 += 1
}
}
maxProduct
}
/**
* Return a new SimClustersEmbedding with Max Embedding Size.
*
* Prefer to truncate on embedding construction where possible. Doing so is cheaper.
*/
def truncate(size: Int): SimClustersEmbedding = {
if (clusterIds.length <= size) {
this
} else {
val truncatedClusterIds = clusterIds.take(size)
val truncatedScores = scores.take(size)
val (sortedClusterIds, sortedScores) =
truncatedClusterIds.zip(truncatedScores).sortBy(_._1).unzip
DefaultSimClustersEmbedding(
truncatedClusterIds,
truncatedScores,
sortedClusterIds,
sortedScores)
}
}
def toNormalized: SimClustersEmbedding = {
// Additional safety check. Only EmptyEmbedding's l2norm is 0.0.
if (l2norm == 0.0) {
EmptyEmbedding
} else {
this.scalarDivide(l2norm)
}
}
implicit def toThrift: ThriftSimClustersEmbedding = {
ThriftSimClustersEmbedding(
embedding.map {
case (clusterId, score) =>
SimClusterWithScore(clusterId, score)
}
)
}
def canEqual(a: Any): Boolean = a.isInstanceOf[SimClustersEmbedding]
/* We define equality as having the same clusters and scores.
* This implementation is arguably incorrect in this case:
* (1 -> 1.0, 2 -> 0.0) == (1 -> 1.0) // equals returns false
* However, compliant implementations of SimClustersEmbedding should not include zero-weight
* clusters, so this implementation should work correctly.
*/
override def equals(that: Any): Boolean =
that match {
case that: SimClustersEmbedding =>
that.canEqual(this) &&
this.sortedClusterIds.sameElements(that.sortedClusterIds) &&
this.sortedScores.sameElements(that.sortedScores)
case _ => false
}
/**
* hashcode implementation based on the contents of the embedding. As a lazy val, this relies on
* the embedding contents being immutable.
*/
override lazy val hashCode: Int = {
/* Arrays uses object id as hashCode, so different arrays with the same contents hash
* differently. To provide a stable hash code, we take the same approach as how a
* `case class(clusters: Seq[Int], scores: Seq[Double])` would be hashed. See
* ScalaRunTime._hashCode and MurmurHash3.productHash
* https://github.com/scala/scala/blob/2.12.x/src/library/scala/runtime/ScalaRunTime.scala#L167
* https://github.com/scala/scala/blob/2.12.x/src/library/scala/util/hashing/MurmurHash3.scala#L64
*
* Note that the hashcode is arguably incorrect in this case:
* (1 -> 1.0, 2 -> 0.0).hashcode == (1 -> 1.0).hashcode // returns false
* However, compliant implementations of SimClustersEmbedding should not include zero-weight
* clusters, so this implementation should work correctly.
*/
productHash((arrayHash(sortedClusterIds), arrayHash(sortedScores)))
}
}
object SimClustersEmbedding {
val EmptyEmbedding: SimClustersEmbedding =
DefaultSimClustersEmbedding(Array.empty, Array.empty, Array.empty, Array.empty)
val DefaultExponent: Double = 0.3
// Descending by score then ascending by ClusterId
implicit val order: Ordering[(ClusterId, Double)] =
(a: (ClusterId, Double), b: (ClusterId, Double)) => {
b._2 compare a._2 match {
case 0 => a._1 compare b._1
case c => c
}
}
/**
* Constructors
*
* These constructors:
* - do not make assumptions about the ordering of the cluster/scores.
* - do assume that cluster ids are unique
* - ignore (drop) any cluster whose score is <= 0
*/
def apply(embedding: (ClusterId, Double)*): SimClustersEmbedding =
buildDefaultSimClustersEmbedding(embedding)
def apply(embedding: Iterable[(ClusterId, Double)]): SimClustersEmbedding =
buildDefaultSimClustersEmbedding(embedding)
def apply(embedding: Iterable[(ClusterId, Double)], size: Int): SimClustersEmbedding =
buildDefaultSimClustersEmbedding(embedding, truncate = Some(size))
implicit def apply(thriftEmbedding: ThriftSimClustersEmbedding): SimClustersEmbedding =
buildDefaultSimClustersEmbedding(thriftEmbedding.embedding.map(_.toTuple))
def apply(thriftEmbedding: ThriftSimClustersEmbedding, truncate: Int): SimClustersEmbedding =
buildDefaultSimClustersEmbedding(
thriftEmbedding.embedding.map(_.toTuple),
truncate = Some(truncate))
private def buildDefaultSimClustersEmbedding(
embedding: Iterable[(ClusterId, Double)],
truncate: Option[Int] = None
): SimClustersEmbedding = {
val truncatedIdAndScores = {
val idsAndScores = embedding.filter(_._2 > 0.0).toArray.sorted(order)
truncate match {
case Some(t) => idsAndScores.take(t)
case _ => idsAndScores
}
}
if (truncatedIdAndScores.isEmpty) {
EmptyEmbedding
} else {
val (clusterIds, scores) = truncatedIdAndScores.unzip
val (sortedClusterIds, sortedScores) = truncatedIdAndScores.sortBy(_._1).unzip
DefaultSimClustersEmbedding(clusterIds, scores, sortedClusterIds, sortedScores)
}
}
/** ***** Aggregation Methods ******/
/**
* A high performance version of Sum a list of SimClustersEmbeddings.
* Suggest using in Online Services to avoid the unnecessary GC.
* For offline or streaming. Please check [[SimClustersEmbeddingMonoid]]
*/
def sum(simClustersEmbeddings: Iterable[SimClustersEmbedding]): SimClustersEmbedding = {
if (simClustersEmbeddings.isEmpty) {
EmptyEmbedding
} else {
val sum = simClustersEmbeddings.foldLeft(mutable.Map[ClusterId, Double]()) {
(sum, embedding) =>
for (i <- embedding.sortedClusterIds.indices) {
val clusterId = embedding.sortedClusterIds(i)
sum.put(clusterId, embedding.sortedScores(i) + sum.getOrElse(clusterId, 0.0))
}
sum
}
SimClustersEmbedding(sum)
}
}
/**
* Support a fixed size SimClustersEmbedding Sum
*/
def sum(
simClustersEmbeddings: Iterable[SimClustersEmbedding],
maxSize: Int
): SimClustersEmbedding = {
sum(simClustersEmbeddings).truncate(maxSize)
}
/**
* A high performance version of Mean a list of SimClustersEmbeddings.
* Suggest using in Online Services to avoid the unnecessary GC.
*/
def mean(simClustersEmbeddings: Iterable[SimClustersEmbedding]): SimClustersEmbedding = {
if (simClustersEmbeddings.isEmpty) {
EmptyEmbedding
} else {
sum(simClustersEmbeddings).scalarDivide(simClustersEmbeddings.size)
}
}
/**
* Support a fixed size SimClustersEmbedding Mean
*/
def mean(
simClustersEmbeddings: Iterable[SimClustersEmbedding],
maxSize: Int
): SimClustersEmbedding = {
mean(simClustersEmbeddings).truncate(maxSize)
}
}
case class DefaultSimClustersEmbedding(
override val clusterIds: Array[ClusterId],
override val scores: Array[Double],
override val sortedClusterIds: Array[ClusterId],
override val sortedScores: Array[Double])
extends SimClustersEmbedding {
override def toString: String =
s"DefaultSimClustersEmbedding(${clusterIds.zip(scores).mkString(",")})"
}
object DefaultSimClustersEmbedding {
// To support existing code which builds embeddings from a Seq
def apply(embedding: Seq[(ClusterId, Double)]): SimClustersEmbedding = SimClustersEmbedding(
embedding)
}

View File

@ -0,0 +1,366 @@
namespace java com.twitter.search.common.ranking.thriftjava
#@namespace scala com.twitter.search.common.ranking.thriftscala
#@namespace strato com.twitter.search.common.ranking
namespace py gen.twitter.search.common.ranking.ranking
struct ThriftLinearFeatureRankingParams {
// values below this will set the score to the minimal one
1: optional double min = -1e+100
// values above this will set the score to the minimal one
2: optional double max = 1e+100
3: optional double weight = 0
}(persisted='true')
struct ThriftAgeDecayRankingParams {
// the rate in which the score of older tweets decreases
1: optional double slope = 0.003
// the age, in minutes, where the age score of a tweet is half of the latest tweet
2: optional double halflife = 360.0
// the minimal age decay score a tweet will have
3: optional double base = 0.6
}(persisted='true')
enum ThriftScoringFunctionType {
LINEAR = 1,
MODEL_BASED = 4,
TENSORFLOW_BASED = 5,
// deprecated
TOPTWEETS = 2,
EXPERIMENTAL = 3,
}
// The struct to define a class that is to be dynamically loaded in earlybird for
// experimentation.
struct ThriftExperimentClass {
// the fully qualified class name.
1: required string name
// data source location (class/jar file) for this dynamic class on HDFS
2: optional string location
// parameters in key-value pairs for this experimental class
3: optional map<string, double> params
}(persisted='true')
// Deprecated!!
struct ThriftQueryEngagementParams {
// Rate Boosts: given a rate (usually a small fraction), the score will be multiplied by
// (1 + rate) ^ boost
// 0 mean no boost, negative numbers are dampens
1: optional double retweetRateBoost = 0
2: optional double replyRateBoost = 0
3: optional double faveRateBoost = 0
}(persisted='true')
struct ThriftHostQualityParams {
// Multiplier applied to host score, for tweets that have links.
// A multiplier of 0 means that this boost is not applied
1: optional double multiplier = 0.0
// Do not apply the multiplier to hosts with score above this level.
// If 0, the multiplier will be applied to any host.
2: optional double maxScoreToModify = 0.0
// Do not apply the multiplier to hosts with score below this level.
// If 0, the multiplier will be applied to any host.
3: optional double minScoreToModify = 0.0
// If true, score modification will be applied to hosts that have unknown scores.
// The host-score used will be lower than the score of any known host.
4: optional bool applyToUnknownHosts = 0
}(persisted='true')
struct ThriftCardRankingParams {
1: optional double hasCardBoost = 1.0
2: optional double domainMatchBoost = 1.0
3: optional double authorMatchBoost = 1.0
4: optional double titleMatchBoost = 1.0
5: optional double descriptionMatchBoost = 1.0
}(persisted='true')
# The ids are assigned in 'blocks'. For adding a new field, find an unused id in the appropriate
# block. Be sure to mention explicitly which ids have been removed so that they are not used again.
struct ThriftRankingParams {
1: optional ThriftScoringFunctionType type
// Dynamically loaded scorer and collector for quick experimentation.
40: optional ThriftExperimentClass expScorer
41: optional ThriftExperimentClass expCollector
// we must set it to a value that fits into a float: otherwise
// some earlybird classes that convert it to float will interpret
// it as Float.NEGATIVE_INFINITY, and some comparisons will fail
2: optional double minScore = -1e+30
10: optional ThriftLinearFeatureRankingParams parusScoreParams
11: optional ThriftLinearFeatureRankingParams retweetCountParams
12: optional ThriftLinearFeatureRankingParams replyCountParams
15: optional ThriftLinearFeatureRankingParams reputationParams
16: optional ThriftLinearFeatureRankingParams luceneScoreParams
18: optional ThriftLinearFeatureRankingParams textScoreParams
19: optional ThriftLinearFeatureRankingParams urlParams
20: optional ThriftLinearFeatureRankingParams isReplyParams
21: optional ThriftLinearFeatureRankingParams directFollowRetweetCountParams
22: optional ThriftLinearFeatureRankingParams trustedCircleRetweetCountParams
23: optional ThriftLinearFeatureRankingParams favCountParams
24: optional ThriftLinearFeatureRankingParams multipleReplyCountParams
27: optional ThriftLinearFeatureRankingParams embedsImpressionCountParams
28: optional ThriftLinearFeatureRankingParams embedsUrlCountParams
29: optional ThriftLinearFeatureRankingParams videoViewCountParams
66: optional ThriftLinearFeatureRankingParams quotedCountParams
// A map from MutableFeatureType to linear ranking params
25: optional map<byte, ThriftLinearFeatureRankingParams> offlineExperimentalFeatureRankingParams
// if min/max for score or ThriftLinearFeatureRankingParams should always be
// applied or only to non-follows, non-self, non-verified
26: optional bool applyFiltersAlways = 0
// Whether to apply promotion/demotion at all for FeatureBasedScoringFunction
70: optional bool applyBoosts = 1
// UI language is english, tweet language is not
30: optional double langEnglishUIBoost = 0.3
// tweet language is english, UI language is not
31: optional double langEnglishTweetBoost = 0.7
// user language differs from tweet language, and neither is english
32: optional double langDefaultBoost = 0.1
// user that produced tweet is marked as spammer by metastore
33: optional double spamUserBoost = 1.0
// user that produced tweet is marked as nsfw by metastore
34: optional double nsfwUserBoost = 1.0
// user that produced tweet is marked as bot (self similarity) by metastore
35: optional double botUserBoost = 1.0
// An alternative way of using lucene score in the ranking function.
38: optional bool useLuceneScoreAsBoost = 0
39: optional double maxLuceneScoreBoost = 1.2
// Use user's consumed and produced languages for scoring
42: optional bool useUserLanguageInfo = 0
// Boost (demotion) if the tweet language is not one of user's understandable languages,
// nor interface language.
43: optional double unknownLanguageBoost = 0.01
// Use topic ids for scoring.
// Deprecated in SEARCH-8616.
44: optional bool deprecated_useTopicIDsBoost = 0
// Parameters for topic id scoring. See TopicIDsBoostScorer (and its test) for details.
46: optional double deprecated_maxTopicIDsBoost = 3.0
47: optional double deprecated_topicIDsBoostExponent = 2.0;
48: optional double deprecated_topicIDsBoostSlope = 2.0;
// Hit Attribute Demotion
60: optional bool enableHitDemotion = 0
61: optional double noTextHitDemotion = 1.0
62: optional double urlOnlyHitDemotion = 1.0
63: optional double nameOnlyHitDemotion = 1.0
64: optional double separateTextAndNameHitDemotion = 1.0
65: optional double separateTextAndUrlHitDemotion = 1.0
// multiplicative score boost for results deemed offensive
100: optional double offensiveBoost = 1
// multiplicative score boost for results in the searcher's social circle
101: optional double inTrustedCircleBoost = 1
// multiplicative score dampen for results with more than one hash tag
102: optional double multipleHashtagsOrTrendsBoost = 1
// multiplicative score boost for results in the searcher's direct follows
103: optional double inDirectFollowBoost = 1
// multiplicative score boost for results that has trends
104: optional double tweetHasTrendBoost = 1
// is tweet from verified account?
106: optional double tweetFromVerifiedAccountBoost = 1
// is tweet authored by the searcher? (boost is in addition to social boost)
107: optional double selfTweetBoost = 1
// multiplicative score boost for a tweet that has image url.
108: optional double tweetHasImageUrlBoost = 1
// multiplicative score boost for a tweet that has video url.
109: optional double tweetHasVideoUrlBoost = 1
// multiplicative score boost for a tweet that has news url.
110: optional double tweetHasNewsUrlBoost = 1
// is tweet from a blue-verified account?
111: optional double tweetFromBlueVerifiedAccountBoost = 1 (personalDataType = 'UserVerifiedFlag')
// subtractive penalty applied after boosts for out-of-network replies.
120: optional double outOfNetworkReplyPenalty = 10.0
150: optional ThriftQueryEngagementParams deprecatedQueryEngagementParams
160: optional ThriftHostQualityParams deprecatedHostQualityParams
// age decay params for regular tweets
203: optional ThriftAgeDecayRankingParams ageDecayParams
// for card ranking: map between card name ordinal (defined in com.twitter.search.common.constants.CardConstants)
// to ranking params
400: optional map<byte, ThriftCardRankingParams> cardRankingParams
// A map from tweet IDs to the score adjustment for that tweet. These are score
// adjustments that include one or more features that can depend on the query
// string. These features aren't indexed by Earlybird, and so their total contribution
// to the scoring function is passed in directly as part of the request. If present,
// the score adjustment for a tweet is directly added to the linear component of the
// scoring function. Since this signal can be made up of multiple features, any
// reweighting or combination of these features is assumed to be done by the caller
// (hence there is no need for a weight parameter -- the weights of the features
// included in this signal have already been incorporated by the caller).
151: optional map<i64, double> querySpecificScoreAdjustments
// A map from user ID to the score adjustment for tweets from that author.
// This field provides a way for adjusting the tweets of a specific set of users with a score
// that is not present in the Earlybird features but has to be passed from the clients, such as
// real graph weights or a combination of multiple features.
// This field should be used mainly for experimentation since it increases the size of the thrift
// requests.
154: optional map<i64, double> authorSpecificScoreAdjustments
// -------- Parameters for ThriftScoringFunctionType.MODEL_BASED --------
// Selected models along with their weights for the linear combination
152: optional map<string, double> selectedModels
153: optional bool useLogitScore = false
// -------- Parameters for ThriftScoringFunctionType.TENSORFLOW_BASED --------
// Selected tensorflow model
303: optional string selectedTensorflowModel
// -------- Deprecated Fields --------
// ID 303 has been used in the past. Resume additional deprecated fields from 304
105: optional double deprecatedTweetHasTrendInTrendingQueryBoost = 1
200: optional double deprecatedAgeDecaySlope = 0.003
201: optional double deprecatedAgeDecayHalflife = 360.0
202: optional double deprecatedAgeDecayBase = 0.6
204: optional ThriftAgeDecayRankingParams deprecatedAgeDecayForTrendsParams
301: optional double deprecatedNameQueryConfidence = 0.0
302: optional double deprecatedHashtagQueryConfidence = 0.0
// Whether to use old-style engagement features (normalized by LogNormalizer)
// or new ones (normalized by SingleBytePositiveFloatNormalizer)
50: optional bool useGranularEngagementFeatures = 0 // DEPRECATED!
}(persisted='true')
// This sorting mode is used by earlybird to retrieve the top-n facets that
// are returned to blender
enum ThriftFacetEarlybirdSortingMode {
SORT_BY_SIMPLE_COUNT = 0,
SORT_BY_WEIGHTED_COUNT = 1,
}
// This is the final sort order used by blender after all results from
// the earlybirds are merged
enum ThriftFacetFinalSortOrder {
// using the created_at date of the first tweet that contained the facet
SCORE = 0,
SIMPLE_COUNT = 1,
WEIGHTED_COUNT = 2,
CREATED_AT = 3
}
struct ThriftFacetRankingOptions {
// next available field ID = 38
// ======================================================================
// EARLYBIRD SETTINGS
//
// These parameters primarily affect how earlybird creates the top-k
// candidate list to be re-ranked by blender
// ======================================================================
// Dynamically loaded scorer and collector for quick experimentation.
26: optional ThriftExperimentClass expScorer
27: optional ThriftExperimentClass expCollector
// It should be less than or equal to reputationParams.min, and all
// tweepcreds between the two get a score of 1.0.
21: optional i32 minTweepcredFilterThreshold
// the maximum score a single tweet can contribute to the weightedCount
22: optional i32 maxScorePerTweet
15: optional ThriftFacetEarlybirdSortingMode sortingMode
// The number of top candidates earlybird returns to blender
16: optional i32 numCandidatesFromEarlybird = 100
// when to early terminate for facet search, overrides the setting in ThriftSearchQuery
34: optional i32 maxHitsToProcess = 1000
// for anti-gaming we want to limit the maximum amount of hits the same user can
// contribute. Set to -1 to disable the anti-gaming filter. Overrides the setting in
// ThriftSearchQuery
35: optional i32 maxHitsPerUser = 3
// if the tweepcred of the user is bigger than this value it will not be excluded
// by the anti-gaming filter. Overrides the setting in ThriftSearchQuery
36: optional i32 maxTweepcredForAntiGaming = 65
// these settings affect how earlybird computes the weightedCount
2: optional ThriftLinearFeatureRankingParams parusScoreParams
3: optional ThriftLinearFeatureRankingParams reputationParams
17: optional ThriftLinearFeatureRankingParams favoritesParams
33: optional ThriftLinearFeatureRankingParams repliesParams
37: optional map<byte, ThriftLinearFeatureRankingParams> rankingExpScoreParams
// penalty counter settings
6: optional i32 offensiveTweetPenalty // set to -1 to disable the offensive filter
7: optional i32 antigamingPenalty // set to -1 to disable antigaming filtering
// weight of penalty counts from all tweets containing a facet, not just the tweets
// matching the query
9: optional double queryIndependentPenaltyWeight // set to 0 to not use query independent penalty weights
// penalty for keyword stuffing
60: optional i32 multipleHashtagsOrTrendsPenalty
// Language related boosts, similar to those in relevance ranking options. By default they are
// all 1.0 (no-boost).
// When the user language is english, facet language is not
11: optional double langEnglishUIBoost = 1.0
// When the facet language is english, user language is not
12: optional double langEnglishFacetBoost = 1.0
// When the user language differs from facet/tweet language, and neither is english
13: optional double langDefaultBoost = 1.0
// ======================================================================
// BLENDER SETTINGS
//
// Settings for the facet relevance scoring happening in blender
// ======================================================================
// This block of parameters are only used in the FacetsFutureManager.
// limits to discard facets
// if a facet has a higher penalty count, it will not be returned
5: optional i32 maxPenaltyCount
// if a facet has a lower simple count, it will not be returned
28: optional i32 minSimpleCount
// if a facet has a lower weighted count, it will not be returned
8: optional i32 minCount
// the maximum allowed value for offensiveCount/facetCount a facet can have in order to be returned
10: optional double maxPenaltyCountRatio
// if set to true, then facets with offensive display tweets are excluded from the resultset
29: optional bool excludePossiblySensitiveFacets
// if set to true, then only facets that have a display tweet in their ThriftFacetCountMetadata object
// will be returned to the caller
30: optional bool onlyReturnFacetsWithDisplayTweet
// parameters for scoring force-inserted media items
// Please check FacetReRanker.java computeScoreForInserted() for their usage.
38: optional double forceInsertedBackgroundExp = 0.3
39: optional double forceInsertedMinBackgroundCount = 2
40: optional double forceInsertedMultiplier = 0.01
// -----------------------------------------------------
// weights for the facet ranking formula
18: optional double simpleCountWeight_DEPRECATED
19: optional double weightedCountWeight_DEPRECATED
20: optional double backgroundModelBoost_DEPRECATED
// -----------------------------------------------------
// Following parameters are used in the FacetsReRanker
// age decay params
14: optional ThriftAgeDecayRankingParams ageDecayParams
// used in the facets reranker
23: optional double maxNormBoost = 5.0
24: optional double globalCountExponent = 3.0
25: optional double simpleCountExponent = 3.0
31: optional ThriftFacetFinalSortOrder finalSortOrder
// Run facets search as if they happen at this specific time (ms since epoch).
32: optional i64 fakeCurrentTimeMs // not really used anywhere, remove?
}(persisted='true')

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,53 @@
namespace java com.twitter.simclusters_v2.thriftjava
namespace py gen.twitter.simclusters_v2
#@namespace scala com.twitter.simclusters_v2.thriftscala
#@namespace strato com.twitter.simclusters_v2
include "embedding.thrift"
include "simclusters_presto.thrift"
/**
* Struct that associates a user with simcluster scores for different
* interaction types. This is meant to be used as a feature to predict abuse.
*
* This thrift struct is meant for exploration purposes. It does not have any
* assumptions about what type of interactions we use or what types of scores
* we are keeping track of.
**/
struct AdhocSingleSideClusterScores {
1: required i64 userId(personalDataType = 'UserId')
// We can make the interaction types have arbitrary names. In the production
// version of this dataset. We should have a different field per interaction
// type so that API of what is included is more clear.
2: required map<string, embedding.SimClustersEmbedding> interactionScores
}(persisted="true", hasPersonalData = 'true')
/**
* This is a prod version of the single side features. It is meant to be used as a value in a key
* value store. The pair of healthy and unhealthy scores will be different depending on the use case.
* We will use different stores for different user cases. For instance, the first instance that
* we implement will use search abuse reports and impressions. We can build stores for new values
* in the future.
*
* The consumer creates the interactions which the author receives. For instance, the consumer
* creates an abuse report for an author. The consumer scores are related to the interaction creation
* behavior of the consumer. The author scores are related to the whether the author receives these
* interactions.
*
**/
struct SingleSideUserScores {
1: required i64 userId(personalDataType = 'UserId')
2: required double consumerUnhealthyScore(personalDataType = 'EngagementScore')
3: required double consumerHealthyScore(personalDataType = 'EngagementScore')
4: required double authorUnhealthyScore(personalDataType = 'EngagementScore')
5: required double authorHealthyScore(personalDataType = 'EngagementScore')
}(persisted="true", hasPersonalData = 'true')
/**
* Struct that associates a cluster-cluster interaction scores for different
* interaction types.
**/
struct AdhocCrossSimClusterInteractionScores {
1: required i64 clusterId
2: required list<simclusters_presto.ClustersScore> clusterScores
}(persisted="true")

View File

@ -0,0 +1,137 @@
namespace java com.twitter.simclusters_v2.thriftjava
namespace py gen.twitter.simclusters_v2.embedding
#@namespace scala com.twitter.simclusters_v2.thriftscala
#@namespace strato com.twitter.simclusters_v2
include "com/twitter/simclusters_v2/identifier.thrift"
include "com/twitter/simclusters_v2/online_store.thrift"
struct SimClusterWithScore {
1: required i32 clusterId(personalDataType = 'InferredInterests')
2: required double score(personalDataType = 'EngagementScore')
}(persisted = 'true', hasPersonalData = 'true')
struct TopSimClustersWithScore {
1: required list<SimClusterWithScore> topClusters
2: required online_store.ModelVersion modelVersion
}(persisted = 'true', hasPersonalData = 'true')
struct InternalIdWithScore {
1: required identifier.InternalId internalId
2: required double score(personalDataType = 'EngagementScore')
}(persisted = 'true', hasPersonalData = 'true')
struct InternalIdEmbedding {
1: required list<InternalIdWithScore> embedding
}(persisted = 'true', hasPersonalData = 'true')
struct SemanticCoreEntityWithScore {
1: required i64 entityId(personalDataType = 'SemanticcoreClassification')
2: required double score(personalDataType = 'EngagementScore')
}(persisted = 'true', hasPersonalData = 'true')
struct TopSemanticCoreEntitiesWithScore {
1: required list<SemanticCoreEntityWithScore> topEntities
}(persisted = 'true', hasPersonalData = 'true')
struct PersistedFullClusterId {
1: required online_store.ModelVersion modelVersion
2: required i32 clusterId(personalDataType = 'InferredInterests')
}(persisted = 'true', hasPersonalData = 'true')
struct DayPartitionedClusterId {
1: required i32 clusterId(personalDataType = 'InferredInterests')
2: required string dayPartition // format: yyyy-MM-dd
}
struct TopProducerWithScore {
1: required i64 userId(personalDataType = 'UserId')
2: required double score(personalDataType = 'EngagementScore')
}(persisted = 'true', hasPersonalData = 'true')
struct TopProducersWithScore {
1: required list<TopProducerWithScore> topProducers
}(persisted = 'true', hasPersonalData = 'true')
struct TweetWithScore {
1: required i64 tweetId(personalDataType = 'TweetId')
2: required double score(personalDataType = 'EngagementScore')
}(persisted = 'true', hasPersonalData = 'true')
struct TweetsWithScore {
1: required list<TweetWithScore> tweets
}(persisted = 'true', hasPersonalData = 'true')
struct TweetTopKTweetsWithScore {
1: required i64 tweetId(personalDataType = 'TweetId')
2: required TweetsWithScore topkTweetsWithScore
}(persisted = 'true', hasPersonalData = 'true')
/**
* The generic SimClustersEmbedding for online long-term storage and real-time calculation.
* Use SimClustersEmbeddingId as the only identifier.
* Warning: Doesn't include model version and embedding type in the value struct.
**/
struct SimClustersEmbedding {
1: required list<SimClusterWithScore> embedding
}(persisted = 'true', hasPersonalData = 'true')
struct SimClustersEmbeddingWithScore {
1: required SimClustersEmbedding embedding
2: required double score
}(persisted = 'true', hasPersonalData = 'false')
/**
* This is the recommended structure for aggregating embeddings with time decay - the metadata
* stores the information needed for decayed aggregation.
**/
struct SimClustersEmbeddingWithMetadata {
1: required SimClustersEmbedding embedding
2: required SimClustersEmbeddingMetadata metadata
}(hasPersonalData = 'true')
struct SimClustersEmbeddingIdWithScore {
1: required identifier.SimClustersEmbeddingId id
2: required double score
}(persisted = 'true', hasPersonalData = 'false')
struct SimClustersMultiEmbeddingByValues {
1: required list<SimClustersEmbeddingWithScore> embeddings
}(persisted = 'true', hasPersonalData = 'false')
struct SimClustersMultiEmbeddingByIds {
1: required list<SimClustersEmbeddingIdWithScore> ids
}(persisted = 'true', hasPersonalData = 'false')
/**
* Generic SimClusters Multiple Embeddings. The identifier.SimClustersMultiEmbeddingId is the key of
* the multiple embedding.
**/
union SimClustersMultiEmbedding {
1: SimClustersMultiEmbeddingByValues values
2: SimClustersMultiEmbeddingByIds ids
}(persisted = 'true', hasPersonalData = 'false')
/**
* The metadata of a SimClustersEmbedding. The updatedCount represent the version of the Embedding.
* For tweet embedding, the updatedCount is same/close to the favorite count.
**/
struct SimClustersEmbeddingMetadata {
1: optional i64 updatedAtMs
2: optional i64 updatedCount
}(persisted = 'true', hasPersonalData = 'true')
/**
* The data structure for PersistentSimClustersEmbedding Store
**/
struct PersistentSimClustersEmbedding {
1: required SimClustersEmbedding embedding
2: required SimClustersEmbeddingMetadata metadata
}(persisted = 'true', hasPersonalData = 'true')
/**
* The data structure for the Multi Model PersistentSimClustersEmbedding Store
**/
struct MultiModelPersistentSimClustersEmbedding {
1: required map<online_store.ModelVersion, PersistentSimClustersEmbedding> multiModelPersistentSimClustersEmbedding
}(persisted = 'true', hasPersonalData = 'true')

View File

@ -0,0 +1,65 @@
namespace java com.twitter.simclusters_v2.thriftjava
namespace py gen.twitter.simclusters_v2.evaluation
#@namespace scala com.twitter.simclusters_v2.thriftscala
#@namespace strato com.twitter.simclusters_v2
/**
* Surface area at which the reference tweet was displayed to the user
**/
enum DisplayLocation {
TimelinesRecap = 1,
TimelinesRectweet = 2
}(hasPersonalData = 'false')
struct TweetLabels {
1: required bool isClicked = false(personalDataType = 'EngagementsPrivate')
2: required bool isLiked = false(personalDataType = 'EngagementsPublic')
3: required bool isRetweeted = false(personalDataType = 'EngagementsPublic')
4: required bool isQuoted = false(personalDataType = 'EngagementsPublic')
5: required bool isReplied = false(personalDataType = 'EngagementsPublic')
}(persisted = 'true', hasPersonalData = 'true')
/**
* Data container of a reference tweet with scribed user engagement labels
*/
struct ReferenceTweet {
1: required i64 tweetId(personalDataType = 'TweetId')
2: required i64 authorId(personalDataType = 'UserId')
3: required i64 timestamp(personalDataType = 'PublicTimestamp')
4: required DisplayLocation displayLocation
5: required TweetLabels labels
}(persisted="true", hasPersonalData = 'true')
/**
* Data container of a candidate tweet generated by the candidate algorithm
*/
struct CandidateTweet {
1: required i64 tweetId(personalDataType = 'TweetId')
2: optional double score(personalDataType = 'EngagementScore')
// The timestamp here is a synthetically generated timestamp.
// for evaluation purpose. Hence left unannotated
3: optional i64 timestamp
}(hasPersonalData = 'true')
/**
* An encapsulated collection of candidate tweets
**/
struct CandidateTweets {
1: required i64 targetUserId(personalDataType = 'UserId')
2: required list<CandidateTweet> recommendedTweets
}(hasPersonalData = 'true')
/**
* An encapsulated collection of reference tweets
**/
struct ReferenceTweets {
1: required i64 targetUserId(personalDataType = 'UserId')
2: required list<ReferenceTweet> impressedTweets
}(persisted="true", hasPersonalData = 'true')
/**
* A list of candidate tweets
**/
struct CandidateTweetsList {
1: required list<CandidateTweet> recommendedTweets
}(hasPersonalData = 'true')

View File

@ -0,0 +1,205 @@
namespace java com.twitter.simclusters_v2.thriftjava
namespace py gen.twitter.simclusters_v2.identifier
#@namespace scala com.twitter.simclusters_v2.thriftscala
#@namespace strato com.twitter.simclusters_v2
include "com/twitter/simclusters_v2/online_store.thrift"
/**
* The uniform type for a SimClusters Embeddings.
* Each embeddings have the uniform underlying storage.
* Warning: Every EmbeddingType should map to one and only one InternalId.
**/
enum EmbeddingType {
// Reserve 001 - 99 for Tweet embeddings
FavBasedTweet = 1, // Deprecated
FollowBasedTweet = 2, // Deprecated
LogFavBasedTweet = 3, // Production Version
FavBasedTwistlyTweet = 10, // Deprecated
LogFavBasedTwistlyTweet = 11, // Deprecated
LogFavLongestL2EmbeddingTweet = 12, // Production Version
// Tweet embeddings generated from non-fav events
// Naming convention: {Event}{Score}BasedTweet
// {Event}: The interaction event we use to build the tweet embeddings
// {Score}: The score from user InterestedIn embeddings
VideoPlayBack50LogFavBasedTweet = 21,
RetweetLogFavBasedTweet = 22,
ReplyLogFavBasedTweet = 23,
PushOpenLogFavBasedTweet = 24,
// [Experimental] Offline generated FavThroughRate-based Tweet Embedding
Pop1000RankDecay11Tweet = 30,
Pop10000RankDecay11Tweet = 31,
OonPop1000RankDecayTweet = 32,
// [Experimental] Offline generated production-like LogFavScore-based Tweet Embedding
OfflineGeneratedLogFavBasedTweet = 40,
// Reserve 51-59 for Ads Embedding
LogFavBasedAdsTweet = 51, // Experimental embedding for ads tweet candidate
LogFavClickBasedAdsTweet = 52, // Experimental embedding for ads tweet candidate
// Reserve 60-69 for Evergreen content
LogFavBasedEvergreenTweet = 60,
LogFavBasedRealTimeTweet = 65,
// Reserve 101 to 149 for Semantic Core Entity embeddings
FavBasedSematicCoreEntity = 101, // Deprecated
FollowBasedSematicCoreEntity = 102, // Deprecated
FavBasedHashtagEntity = 103, // Deprecated
FollowBasedHashtagEntity = 104, // Deprecated
ProducerFavBasedSemanticCoreEntity = 105, // Deprecated
ProducerFollowBasedSemanticCoreEntity = 106,// Deprecated
FavBasedLocaleSemanticCoreEntity = 107, // Deprecated
FollowBasedLocaleSemanticCoreEntity = 108, // Deprecated
LogFavBasedLocaleSemanticCoreEntity = 109, // Deprecated
LanguageFilteredProducerFavBasedSemanticCoreEntity = 110, // Deprecated
LanguageFilteredFavBasedLocaleSemanticCoreEntity = 111, // Deprecated
FavTfgTopic = 112, // TFG topic embedding built from fav-based user interestedIn
LogFavTfgTopic = 113, // TFG topic embedding built from logfav-based user interestedIn
FavInferredLanguageTfgTopic = 114, // TFG topic embedding built using inferred consumed languages
FavBasedKgoApeTopic = 115, // topic embedding using fav-based aggregatable producer embedding of KGO seed accounts.
LogFavBasedKgoApeTopic = 116, // topic embedding using log fav-based aggregatable producer embedding of KGO seed accounts.
FavBasedOnboardingApeTopic = 117, // topic embedding using fav-based aggregatable producer embedding of onboarding seed accounts.
LogFavBasedOnboardingApeTopic = 118, // topic embedding using log fav-based aggregatable producer embedding of onboarding seed accounts.
LogFavApeBasedMuseTopic = 119, // Deprecated
LogFavApeBasedMuseTopicExperiment = 120 // Deprecated
// Reserved 201 - 299 for Producer embeddings (KnownFor)
FavBasedProducer = 201
FollowBasedProducer = 202
AggregatableFavBasedProducer = 203 // fav-based aggregatable producer embedding.
AggregatableLogFavBasedProducer = 204 // logfav-based aggregatable producer embedding.
RelaxedAggregatableLogFavBasedProducer = 205 // logfav-based aggregatable producer embedding.
AggregatableFollowBasedProducer = 206 // follow-based aggregatable producer embedding.
KnownFor = 300
// Reserved 301 - 399 for User InterestedIn embeddings
FavBasedUserInterestedIn = 301
FollowBasedUserInterestedIn = 302
LogFavBasedUserInterestedIn = 303
RecentFollowBasedUserInterestedIn = 304 // interested-in embedding based on aggregating producer embeddings of recent follows
FilteredUserInterestedIn = 305 // interested-in embedding used by twistly read path
LogFavBasedUserInterestedInFromAPE = 306
FollowBasedUserInterestedInFromAPE = 307
TwiceUserInterestedIn = 308 // interested-in multi-embedding based on clustering producer embeddings of neighbors
UnfilteredUserInterestedIn = 309
UserNextInterestedIn = 310 // next interested-in embedding generated from BeT
// Denser User InterestedIn, generated by Producer embeddings.
FavBasedUserInterestedInFromPE = 311
FollowBasedUserInterestedInFromPE = 312
LogFavBasedUserInterestedInFromPE = 313
FilteredUserInterestedInFromPE = 314 // interested-in embedding used by twistly read path
// [Experimental] Denser User InterestedIn, generated by aggregating IIAPE embedding from AddressBook
LogFavBasedUserInterestedMaxpoolingAddressBookFromIIAPE = 320
LogFavBasedUserInterestedAverageAddressBookFromIIAPE = 321
LogFavBasedUserInterestedBooktypeMaxpoolingAddressBookFromIIAPE = 322
LogFavBasedUserInterestedLargestDimMaxpoolingAddressBookFromIIAPE = 323
LogFavBasedUserInterestedLouvainMaxpoolingAddressBookFromIIAPE = 324
LogFavBasedUserInterestedConnectedMaxpoolingAddressBookFromIIAPE = 325
//Reserved 401 - 500 for Space embedding
FavBasedApeSpace = 401 // DEPRECATED
LogFavBasedListenerSpace = 402 // DEPRECATED
LogFavBasedAPESpeakerSpace = 403 // DEPRECATED
LogFavBasedUserInterestedInListenerSpace = 404 // DEPRECATED
// Experimental, internal-only IDs
ExperimentalThirtyDayRecentFollowBasedUserInterestedIn = 10000 // Like RecentFollowBasedUserInterestedIn, except limited to last 30 days
ExperimentalLogFavLongestL2EmbeddingTweet = 10001 // DEPRECATED
}(persisted = 'true', hasPersonalData = 'false')
/**
* The uniform type for a SimClusters MultiEmbeddings.
* Warning: Every MultiEmbeddingType should map to one and only one InternalId.
**/
enum MultiEmbeddingType {
// Reserved 0-99 for Tweet based MultiEmbedding
// Reserved 100 - 199 for Topic based MultiEmbedding
LogFavApeBasedMuseTopic = 100 // Deprecated
LogFavApeBasedMuseTopicExperiment = 101 // Deprecated
// Reserved 301 - 399 for User InterestedIn embeddings
TwiceUserInterestedIn = 301 // interested-in multi-embedding based on clustering producer embeddings of neighbors
}(persisted = 'true', hasPersonalData = 'true')
// Deprecated. Please use TopicId for future cases.
struct LocaleEntityId {
1: i64 entityId
2: string language
}(persisted = 'true', hasPersonalData = 'false')
enum EngagementType {
Favorite = 1,
Retweet = 2,
}
struct UserEngagedTweetId {
1: i64 tweetId(personalDataType = 'TweetId')
2: i64 userId(personalDataType = 'UserId')
3: EngagementType engagementType(personalDataType = 'EventType')
}(persisted = 'true', hasPersonalData = 'true')
struct TopicId {
1: i64 entityId (personalDataType = 'SemanticcoreClassification')
// 2-letter ISO 639-1 language code
2: optional string language
// 2-letter ISO 3166-1 alpha-2 country code
3: optional string country
}(persisted = 'true', hasPersonalData = 'false')
struct TopicSubId {
1: i64 entityId (personalDataType = 'SemanticcoreClassification')
// 2-letter ISO 639-1 language code
2: optional string language
// 2-letter ISO 3166-1 alpha-2 country code
3: optional string country
4: i32 subId
}(persisted = 'true', hasPersonalData = 'true')
// Will be used for testing purposes in DDG 15536, 15534
struct UserWithLanguageId {
1: required i64 userId(personalDataType = 'UserId')
2: optional string langCode(personalDataType = 'InferredLanguage')
}(persisted = 'true', hasPersonalData = 'true')
/**
* The internal identifier type.
* Need to add ordering in [[com.twitter.simclusters_v2.common.SimClustersEmbeddingId]]
* when adding a new type.
**/
union InternalId {
1: i64 tweetId(personalDataType = 'TweetId')
2: i64 userId(personalDataType = 'UserId')
3: i64 entityId(personalDataType = 'SemanticcoreClassification')
4: string hashtag(personalDataType = 'PublicTweetEntitiesAndMetadata')
5: i32 clusterId
6: LocaleEntityId localeEntityId(personalDataType = 'SemanticcoreClassification')
7: UserEngagedTweetId userEngagedTweetId
8: TopicId topicId
9: TopicSubId topicSubId
10: string spaceId
11: UserWithLanguageId userWithLanguageId
}(persisted = 'true', hasPersonalData = 'true')
/**
* A uniform identifier type for all kinds of SimClusters based embeddings.
**/
struct SimClustersEmbeddingId {
1: required EmbeddingType embeddingType
2: required online_store.ModelVersion modelVersion
3: required InternalId internalId
}(persisted = 'true', hasPersonalData = 'true')
/**
* A uniform identifier type for multiple SimClusters embeddings
**/
struct SimClustersMultiEmbeddingId {
1: required MultiEmbeddingType embeddingType
2: required online_store.ModelVersion modelVersion
3: required InternalId internalId
}(persisted = 'true', hasPersonalData = 'true')

13
timelineranker/README.md Normal file
View File

@ -0,0 +1,13 @@
# TimelineRanker
**TimelineRanker** (TLR) is a legacy service that provides relevance-scored tweets from the Earlybird Search Index and User Tweet Entity Graph (UTEG) service. Despite its name, it no longer performs heavy ranking or model-based ranking itself; it only uses relevance scores from the Search Index for ranked tweet endpoints.
The following is a list of major services that Timeline Ranker interacts with:
- **Earlybird-root-superroot (a.k.a Search):** Timeline Ranker calls the Search Index's super root to fetch a list of Tweets.
- **User Tweet Entity Graph (UTEG):** Timeline Ranker calls UTEG to fetch a list of tweets liked by the users you follow.
- **Socialgraph:** Timeline Ranker calls Social Graph Service to obtain the follow graph and user states such as blocked, muted, retweets muted, etc.
- **TweetyPie:** Timeline Ranker hydrates tweets by calling TweetyPie to post-filter tweets based on certain hydrated fields.
- **Manhattan:** Timeline Ranker hydrates some tweet features (e.g., user languages) from Manhattan.
**Home Mixer** calls Timeline Ranker to fetch tweets from the Earlybird Search Index and User Tweet Entity Graph (UTEG) service to power both the For You and Following Home Timelines. Timeline Ranker performs light ranking based on Earlybird tweet candidate scores and truncates to the number of candidates requested by Home Mixer based on these scores.

View File

@ -0,0 +1,10 @@
Trust and Safety Models
=======================
We decided to open source the training code of the following models:
- pNSFWMedia: Model to detect tweets with NSFW images. This includes adult and porn content.
- pNSFWText: Model to detect tweets with NSFW text, adult/sexual topics.
- pToxicity: Model to detect toxic tweets. Toxicity includes marginal content like insults and certain types of harassment. Toxic content does not violate Twitter's terms of service.
- pAbuse: Model to detect abusive content. This includes violations of Twitter's terms of service, including hate speech, targeted harassment and abusive behavior.
We have several more models and rules that we are not going to open source at this time because of the adversarial nature of this area. The team is considering open sourcing more models going forward and will keep the community posted accordingly.

View File

@ -1,7 +1,7 @@
# TWML
---
Note: `twml` is no longer under development. Much of the code here is not out of date and unused.
Note: `twml` is no longer under development. Much of the code here is out of date and unused.
It is included here for completeness, because `twml` is still used to train the light ranker models
(see `src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/README.md`)
---
@ -10,4 +10,4 @@ TWML is one of Twitter's machine learning frameworks, which uses Tensorflow unde
deprecated,
it is still currently used to train the Earlybird light ranking models (
see `src/python/twitter/deepbird/projects/timelines/scripts/models/earlybird/train.py`).
The most relevant part of this is the `DataRecordTrainer` class, which is where the core training logic resides.
The most relevant part of this is the `DataRecordTrainer` class, which is where the core training logic resides.

View File

@ -494,6 +494,9 @@ visibility_library_enable_trends_representative_tweet_safety_level:
visibility_library_enable_trusted_friends_user_list_safety_level:
default_availability: 10000
visibility_library_enable_twitter_delegate_user_list_safety_level:
default_availability: 10000
visibility_library_enable_tweet_detail_safety_level:
default_availability: 10000
@ -758,7 +761,7 @@ visibility_library_enable_short_circuiting_from_blender_visibility_library:
visibility_library_enable_short_circuiting_from_search_visibility_library:
default_availability: 0
visibility_library_enable_nsfw_text_topics_drop_rule:
visibility_library_enable_nsfw_text_high_precision_drop_rule:
default_availability: 10000
visibility_library_enable_spammy_tweet_rule_verdict_logging:

View File

@ -535,6 +535,9 @@ private[visibility] object DeciderKey extends DeciderKeyEnum {
val EnableTrustedFriendsUserListSafetyLevel: Value = Value(
"visibility_library_enable_trusted_friends_user_list_safety_level"
)
val EnableTwitterDelegateUserListSafetyLevel: Value = Value(
"visibility_library_enable_twitter_delegate_user_list_safety_level"
)
val EnableTweetDetailSafetyLevel: Value = Value(
"visibility_library_enable_tweet_detail_safety_level"
)
@ -869,8 +872,8 @@ private[visibility] object DeciderKey extends DeciderKeyEnum {
"visibility_library_enable_short_circuiting_from_search_visibility_library"
)
val EnableNsfwTextTopicsDropRule: Value = Value(
"visibility_library_enable_nsfw_text_topics_drop_rule"
val EnableNsfwTextHighPrecisionDropRule: Value = Value(
"visibility_library_enable_nsfw_text_high_precision_drop_rule"
)
val EnableSpammyTweetRuleVerdictLogging: Value = Value(

View File

@ -198,6 +198,7 @@ private[visibility] object VisibilityDeciders {
TopicRecommendations -> DeciderKey.EnableTopicRecommendationsSafetyLevel,
TrendsRepresentativeTweet -> DeciderKey.EnableTrendsRepresentativeTweetSafetyLevel,
TrustedFriendsUserList -> DeciderKey.EnableTrustedFriendsUserListSafetyLevel,
TwitterDelegateUserList -> DeciderKey.EnableTwitterDelegateUserListSafetyLevel,
TweetDetail -> DeciderKey.EnableTweetDetailSafetyLevel,
TweetDetailNonToo -> DeciderKey.EnableTweetDetailNonTooSafetyLevel,
TweetEngagers -> DeciderKey.EnableTweetEngagersSafetyLevel,
@ -287,7 +288,7 @@ private[visibility] object VisibilityDeciders {
RuleParams.EnableDropAllTrustedFriendsTweetsRuleParam -> DeciderKey.EnableDropAllTrustedFriendsTweetsRule,
RuleParams.EnableDropTrustedFriendsTweetContentRuleParam -> DeciderKey.EnableDropTrustedFriendsTweetContentRule,
RuleParams.EnableDropAllCollabInvitationTweetsRuleParam -> DeciderKey.EnableDropCollabInvitationTweetsRule,
RuleParams.EnableNsfwTextTopicsDropRuleParam -> DeciderKey.EnableNsfwTextTopicsDropRule,
RuleParams.EnableNsfwTextHighPrecisionDropRuleParam -> DeciderKey.EnableNsfwTextHighPrecisionDropRule,
RuleParams.EnableLikelyIvsUserLabelDropRule -> DeciderKey.EnableLikelyIvsUserLabelDropRule,
RuleParams.EnableCardUriRootDomainCardDenylistRule -> DeciderKey.EnableCardUriRootDomainDenylistRule,
RuleParams.EnableCommunityNonMemberPollCardRule -> DeciderKey.EnableCommunityNonMemberPollCardRule,

View File

@ -85,7 +85,7 @@ private[visibility] object RuleParams {
object EnableDropAllCollabInvitationTweetsRuleParam extends RuleParam(false)
object EnableNsfwTextTopicsDropRuleParam extends RuleParam(false)
object EnableNsfwTextHighPrecisionDropRuleParam extends RuleParam(false)
object EnableLikelyIvsUserLabelDropRule extends RuleParam(false)

View File

@ -186,6 +186,7 @@ private[visibility] object SafetyLevelParams {
object EnableTopicRecommendationsSafetyLevelParam extends SafetyLevelParam(false)
object EnableTrendsRepresentativeTweetSafetyLevelParam extends SafetyLevelParam(false)
object EnableTrustedFriendsUserListSafetyLevelParam extends SafetyLevelParam(false)
object EnableTwitterDelegateUserListSafetyLevelParam extends SafetyLevelParam(false)
object EnableTweetDetailSafetyLevelParam extends SafetyLevelParam(false)
object EnableTweetDetailNonTooSafetyLevelParam extends SafetyLevelParam(false)
object EnableTweetDetailWithInjectionsHydrationSafetyLevelParam extends SafetyLevelParam(false)

View File

@ -143,7 +143,7 @@ class VisibilityRuleEngine private[VisibilityRuleEngine] (
builder.withRuleResult(rule, RuleResult(builder.verdict, ShortCircuited))
} else {
if (rule.fallbackActionBuilder.nonEmpty) {
if (failedFeatureDependencies.nonEmpty && rule.fallbackActionBuilder.nonEmpty) {
metricsRecorder.recordRuleFallbackAction(rule.name)
}

View File

@ -194,6 +194,7 @@ object SafetyLevel {
ThriftSafetyLevel.TopicsLandingPageTopicRecommendations -> TopicsLandingPageTopicRecommendations,
ThriftSafetyLevel.TrendsRepresentativeTweet -> TrendsRepresentativeTweet,
ThriftSafetyLevel.TrustedFriendsUserList -> TrustedFriendsUserList,
ThriftSafetyLevel.TwitterDelegateUserList -> TwitterDelegateUserList,
ThriftSafetyLevel.GryphonDecksAndColumns -> GryphonDecksAndColumns,
ThriftSafetyLevel.TweetDetail -> TweetDetail,
ThriftSafetyLevel.TweetDetailNonToo -> TweetDetailNonToo,
@ -772,6 +773,9 @@ object SafetyLevel {
case object TrustedFriendsUserList extends SafetyLevel {
override val enabledParam: SafetyLevelParam = EnableTrustedFriendsUserListSafetyLevelParam
}
case object TwitterDelegateUserList extends SafetyLevel {
override val enabledParam: SafetyLevelParam = EnableTwitterDelegateUserListSafetyLevelParam
}
case object TweetDetail extends SafetyLevel {
override val enabledParam: SafetyLevelParam = EnableTweetDetailSafetyLevelParam
}

View File

@ -379,13 +379,6 @@ object SafetyLevelGroup {
)
}
case object ProfileMixer extends SafetyLevelGroup {
override val levels: Set[SafetyLevel] = Set(
ProfileMixerMedia,
ProfileMixerFavorites,
)
}
case object Reactions extends SafetyLevelGroup {
override val levels: Set[SafetyLevel] = Set(
SignalsReactions,
@ -516,6 +509,10 @@ object SafetyLevelGroup {
SafetyLevel.TimelineProfile,
TimelineProfileAll,
TimelineProfileSpaces,
TimelineMedia,
ProfileMixerMedia,
TimelineFavorites,
ProfileMixerFavorites
)
}

View File

@ -36,8 +36,8 @@ object SpaceSafetyLabelType extends SafetyLabelType {
s.SpaceSafetyLabelType.HatefulHighRecall -> HatefulHighRecall,
s.SpaceSafetyLabelType.ViolenceHighRecall -> ViolenceHighRecall,
s.SpaceSafetyLabelType.HighToxicityModelScore -> HighToxicityModelScore,
s.SpaceSafetyLabelType.UkraineCrisisTopic -> UkraineCrisisTopic,
s.SpaceSafetyLabelType.DoNotPublicPublish -> DoNotPublicPublish,
s.SpaceSafetyLabelType.DeprecatedSpaceSafetyLabel14 -> Deprecated,
s.SpaceSafetyLabelType.DeprecatedSpaceSafetyLabel15 -> Deprecated,
s.SpaceSafetyLabelType.Reserved16 -> Deprecated,
s.SpaceSafetyLabelType.Reserved17 -> Deprecated,
s.SpaceSafetyLabelType.Reserved18 -> Deprecated,
@ -69,10 +69,6 @@ object SpaceSafetyLabelType extends SafetyLabelType {
case object ViolenceHighRecall extends SpaceSafetyLabelType
case object HighToxicityModelScore extends SpaceSafetyLabelType
case object UkraineCrisisTopic extends SpaceSafetyLabelType
case object DoNotPublicPublish extends SpaceSafetyLabelType
case object Deprecated extends SpaceSafetyLabelType
case object Unknown extends SpaceSafetyLabelType

View File

@ -3,6 +3,7 @@ package com.twitter.visibility.rules
import com.twitter.spam.rtf.thriftscala.SafetyResultReason
import com.twitter.util.Memoize
import com.twitter.visibility.common.actions.AppealableReason
import com.twitter.visibility.common.actions.AvoidReason.MightNotBeSuitableForAds
import com.twitter.visibility.common.actions.LimitedEngagementReason
import com.twitter.visibility.common.actions.SoftInterventionDisplayType
import com.twitter.visibility.common.actions.SoftInterventionReason
@ -440,36 +441,6 @@ object FreedomOfSpeechNotReachActions {
}
}
case class ConversationSectionAbusiveQualityAction(
violationLevel: ViolationLevel = DefaultViolationLevel)
extends FreedomOfSpeechNotReachActionBuilder[ConversationSectionAbusiveQuality.type] {
override def actionType: Class[_] = ConversationSectionAbusiveQuality.getClass
override val actionSeverity = 5
private def toRuleResult: Reason => RuleResult = Memoize { r =>
RuleResult(ConversationSectionAbusiveQuality, Evaluated)
}
def build(evaluationContext: EvaluationContext, featureMap: Map[Feature[_], _]): RuleResult = {
val appealableReason =
FreedomOfSpeechNotReach.extractTweetSafetyLabel(featureMap).map(_.labelType) match {
case Some(label) =>
FreedomOfSpeechNotReach.eligibleTweetSafetyLabelTypesToAppealableReason(
label,
violationLevel)
case _ =>
AppealableReason.Unspecified(violationLevel.level)
}
toRuleResult(Reason.fromAppealableReason(appealableReason))
}
override def withViolationLevel(violationLevel: ViolationLevel) = {
copy(violationLevel = violationLevel)
}
}
case class SoftInterventionAvoidAction(violationLevel: ViolationLevel = DefaultViolationLevel)
extends FreedomOfSpeechNotReachActionBuilder[TweetInterstitial] {
@ -662,6 +633,9 @@ object FreedomOfSpeechNotReachRules {
override def enabled: Seq[RuleParam[Boolean]] =
Seq(EnableFosnrRuleParam, FosnrRulesEnabledParam)
override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some(
new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds))))
}
case class ViewerIsNonFollowerNonAuthorAndTweetHasViolationOfLevel(
@ -678,6 +652,9 @@ object FreedomOfSpeechNotReachRules {
override def enabled: Seq[RuleParam[Boolean]] =
Seq(EnableFosnrRuleParam, FosnrRulesEnabledParam)
override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some(
new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds))))
}
case class ViewerIsNonAuthorAndTweetHasViolationOfLevel(
@ -692,6 +669,9 @@ object FreedomOfSpeechNotReachRules {
override def enabled: Seq[RuleParam[Boolean]] =
Seq(EnableFosnrRuleParam, FosnrRulesEnabledParam)
override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some(
new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds))))
}
case object TweetHasViolationOfAnyLevelFallbackDropRule

View File

@ -188,6 +188,7 @@ object RuleBase {
TopicRecommendations -> TopicRecommendationsPolicy,
TrendsRepresentativeTweet -> TrendsRepresentativeTweetPolicy,
TrustedFriendsUserList -> TrustedFriendsUserListPolicy,
TwitterDelegateUserList -> TwitterDelegateUserListPolicy,
TweetDetail -> TweetDetailPolicy,
TweetDetailNonToo -> TweetDetailNonTooPolicy,
TweetDetailWithInjectionsHydration -> TweetDetailWithInjectionsHydrationPolicy,

View File

@ -144,6 +144,9 @@ object NsfwCardImageAvoidAllUsersTweetLabelRule
action = Avoid(Some(AvoidReason.ContainsNsfwMedia)),
) {
override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam)
override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some(
new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds))))
}
object NsfwCardImageAvoidAdPlacementAllUsersTweetLabelRule
@ -247,6 +250,9 @@ object GoreAndViolenceHighPrecisionAvoidAllUsersTweetLabelRule
TweetSafetyLabelType.GoreAndViolenceHighPrecision
) {
override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam)
override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some(
new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds))))
}
object GoreAndViolenceHighPrecisionAllUsersTweetLabelRule
@ -266,6 +272,9 @@ object NsfwReportedHeuristicsAvoidAllUsersTweetLabelRule
TweetSafetyLabelType.NsfwReportedHeuristics
) {
override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam)
override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some(
new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds))))
}
object NsfwReportedHeuristicsAvoidAdPlacementAllUsersTweetLabelRule
@ -274,6 +283,9 @@ object NsfwReportedHeuristicsAvoidAdPlacementAllUsersTweetLabelRule
TweetSafetyLabelType.NsfwReportedHeuristics
) {
override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam)
override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some(
new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds))))
}
object NsfwReportedHeuristicsAllUsersTweetLabelRule
@ -294,6 +306,9 @@ object GoreAndViolenceReportedHeuristicsAvoidAllUsersTweetLabelRule
TweetSafetyLabelType.GoreAndViolenceReportedHeuristics
) {
override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam)
override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some(
new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds))))
}
object GoreAndViolenceReportedHeuristicsAvoidAdPlacementAllUsersTweetLabelRule
@ -302,6 +317,9 @@ object GoreAndViolenceReportedHeuristicsAvoidAdPlacementAllUsersTweetLabelRule
TweetSafetyLabelType.GoreAndViolenceReportedHeuristics
) {
override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableAvoidNsfwRulesParam)
override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some(
new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds))))
}
object GoreAndViolenceHighPrecisionAllUsersTweetLabelDropRule
@ -791,7 +809,7 @@ object SkipTweetDetailLimitedEngagementTweetLabelRule
object DynamicProductAdDropTweetLabelRule
extends TweetHasLabelRule(Drop(Unspecified), TweetSafetyLabelType.DynamicProductAd)
object NsfwTextTweetLabelTopicsDropRule
object NsfwTextHighPrecisionTweetLabelDropRule
extends RuleWithConstantAction(
Drop(Reason.Nsfw),
And(
@ -803,7 +821,7 @@ object NsfwTextTweetLabelTopicsDropRule
)
)
with DoesLogVerdict {
override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableNsfwTextTopicsDropRuleParam)
override def enabled: Seq[RuleParam[Boolean]] = Seq(EnableNsfwTextHighPrecisionDropRuleParam)
override def actionSourceBuilder: Option[RuleActionSourceBuilder] = Some(
TweetSafetyLabelSourceBuilder(TweetSafetyLabelType.NsfwTextHighPrecision))
}
@ -832,7 +850,10 @@ object DoNotAmplifyTweetLabelAvoidRule
extends TweetHasLabelRule(
Avoid(),
TweetSafetyLabelType.DoNotAmplify
)
) {
override val fallbackActionBuilder: Option[ActionBuilder[_ <: Action]] = Some(
new ConstantActionBuilder(Avoid(Some(MightNotBeSuitableForAds))))
}
object NsfaHighPrecisionTweetLabelAvoidRule
extends TweetHasLabelRule(

View File

@ -776,7 +776,10 @@ case object MagicRecsPolicy
tweetRules = MagicRecsPolicyOverrides.union(
RecommendationsPolicy.tweetRules.filterNot(_ == SafetyCrisisLevel3DropRule),
NotificationsIbisPolicy.tweetRules,
Seq(NsfaHighRecallTweetLabelRule, NsfwHighRecallTweetLabelRule),
Seq(
NsfaHighRecallTweetLabelRule,
NsfwHighRecallTweetLabelRule,
NsfwTextHighPrecisionTweetLabelDropRule),
Seq(
AuthorBlocksViewerDropRule,
ViewerBlocksAuthorRule,
@ -1171,7 +1174,7 @@ case object ReturningUserExperiencePolicy
NsfwHighRecallTweetLabelRule,
NsfwVideoTweetLabelDropRule,
NsfwTextTweetLabelDropRule,
NsfwTextTweetLabelTopicsDropRule,
NsfwTextHighPrecisionTweetLabelDropRule,
SpamHighRecallTweetLabelDropRule,
DuplicateContentTweetLabelDropRule,
GoreAndViolenceTweetLabelRule,
@ -1785,6 +1788,14 @@ case object TimelineListsPolicy
NsfwReportedHeuristicsAllUsersTweetLabelRule,
GoreAndViolenceReportedHeuristicsAllUsersTweetLabelRule,
NsfwCardImageAllUsersTweetLabelRule,
NsfwHighPrecisionTweetLabelAvoidRule,
NsfwHighRecallTweetLabelAvoidRule,
GoreAndViolenceHighPrecisionAvoidAllUsersTweetLabelRule,
NsfwReportedHeuristicsAvoidAllUsersTweetLabelRule,
GoreAndViolenceReportedHeuristicsAvoidAllUsersTweetLabelRule,
NsfwCardImageAvoidAllUsersTweetLabelRule,
DoNotAmplifyTweetLabelAvoidRule,
NsfaHighPrecisionTweetLabelAvoidRule,
) ++ LimitedEngagementBaseRules.tweetRules
)
@ -2132,7 +2143,13 @@ case object TimelineHomePolicy
userRules = Seq(
ViewerMutesAuthorRule,
ViewerBlocksAuthorRule,
DeciderableAuthorBlocksViewerDropRule
DeciderableAuthorBlocksViewerDropRule,
ProtectedAuthorDropRule,
SuspendedAuthorRule,
DeactivatedAuthorRule,
ErasedAuthorRule,
OffboardedAuthorRule,
DropTakendownUserRule
),
policyRuleParams = SensitiveMediaSettingsTimelineHomeBaseRules.policyRuleParams
)
@ -2171,7 +2188,13 @@ case object BaseTimelineHomePolicy
userRules = Seq(
ViewerMutesAuthorRule,
ViewerBlocksAuthorRule,
DeciderableAuthorBlocksViewerDropRule
DeciderableAuthorBlocksViewerDropRule,
ProtectedAuthorDropRule,
SuspendedAuthorRule,
DeactivatedAuthorRule,
ErasedAuthorRule,
OffboardedAuthorRule,
DropTakendownUserRule
)
)
@ -2255,7 +2278,13 @@ case object TimelineHomeLatestPolicy
userRules = Seq(
ViewerMutesAuthorRule,
ViewerBlocksAuthorRule,
DeciderableAuthorBlocksViewerDropRule
DeciderableAuthorBlocksViewerDropRule,
ProtectedAuthorDropRule,
SuspendedAuthorRule,
DeactivatedAuthorRule,
ErasedAuthorRule,
OffboardedAuthorRule,
DropTakendownUserRule
),
policyRuleParams = SensitiveMediaSettingsTimelineHomeBaseRules.policyRuleParams
)
@ -3283,7 +3312,7 @@ case object TopicRecommendationsPolicy
tweetRules =
Seq(
NsfwHighRecallTweetLabelRule,
NsfwTextTweetLabelTopicsDropRule
NsfwTextHighPrecisionTweetLabelDropRule
)
++ RecommendationsPolicy.tweetRules,
userRules = RecommendationsPolicy.userRules
@ -3536,6 +3565,17 @@ case object TrustedFriendsUserListPolicy
)
)
case object TwitterDelegateUserListPolicy
extends VisibilityPolicy(
userRules = Seq(
ViewerBlocksAuthorRule,
ViewerIsAuthorDropRule,
DeactivatedAuthorRule,
AuthorBlocksViewerDropRule
),
tweetRules = Seq(DropAllRule)
)
case object QuickPromoteTweetEligibilityPolicy
extends VisibilityPolicy(
tweetRules = TweetDetailPolicy.tweetRules,

View File

@ -100,30 +100,6 @@ object TweetRuleGenerator {
FreedomOfSpeechNotReachActions.SoftInterventionAvoidLimitedEngagementsAction(
limitedActionStrings = Some(level3LimitedActions))
)
.addSafetyLevelRule(
SafetyLevel.TimelineMedia,
FreedomOfSpeechNotReachActions
.SoftInterventionAvoidLimitedEngagementsAction(limitedActionStrings =
Some(level3LimitedActions))
)
.addSafetyLevelRule(
SafetyLevel.ProfileMixerMedia,
FreedomOfSpeechNotReachActions
.SoftInterventionAvoidLimitedEngagementsAction(limitedActionStrings =
Some(level3LimitedActions))
)
.addSafetyLevelRule(
SafetyLevel.TimelineFavorites,
FreedomOfSpeechNotReachActions
.SoftInterventionAvoidLimitedEngagementsAction(limitedActionStrings =
Some(level3LimitedActions))
)
.addSafetyLevelRule(
SafetyLevel.ProfileMixerFavorites,
FreedomOfSpeechNotReachActions
.SoftInterventionAvoidLimitedEngagementsAction(limitedActionStrings =
Some(level3LimitedActions))
)
.build,
UserType.Author -> TweetVisibilityPolicy
.builder()
@ -159,30 +135,6 @@ object TweetRuleGenerator {
.InterstitialLimitedEngagementsAvoidAction(limitedActionStrings =
Some(level3LimitedActions))
)
.addSafetyLevelRule(
SafetyLevel.TimelineMedia,
FreedomOfSpeechNotReachActions
.InterstitialLimitedEngagementsAvoidAction(limitedActionStrings =
Some(level3LimitedActions))
)
.addSafetyLevelRule(
SafetyLevel.ProfileMixerMedia,
FreedomOfSpeechNotReachActions
.InterstitialLimitedEngagementsAvoidAction(limitedActionStrings =
Some(level3LimitedActions))
)
.addSafetyLevelRule(
SafetyLevel.TimelineFavorites,
FreedomOfSpeechNotReachActions
.InterstitialLimitedEngagementsAvoidAction(limitedActionStrings =
Some(level3LimitedActions))
)
.addSafetyLevelRule(
SafetyLevel.ProfileMixerFavorites,
FreedomOfSpeechNotReachActions
.InterstitialLimitedEngagementsAvoidAction(limitedActionStrings =
Some(level3LimitedActions))
)
.build,
),
)